Re: [ceph-users] Ceph 0.94 (and lower) performance on >1 hosts ??

2015-08-07 Thread SCHAER Frederic

De : Jake Young [mailto:jak3...@gmail.com]
Envoyé : mercredi 29 juillet 2015 17:13
À : SCHAER Frederic 
Cc : ceph-users@lists.ceph.com
Objet : Re: [ceph-users] Ceph 0.94 (and lower) performance on >1 hosts ??

On Tue, Jul 28, 2015 at 11:48 AM, SCHAER Frederic 
mailto:frederic.sch...@cea.fr>> wrote:
>
> Hi again,
>
> So I have tried
> - changing the cpus frequency : either 1.6GHZ, or 2.4GHZ on all cores
> - changing the memory configuration, from "advanced ecc mode" to "performance 
> mode", boosting the memory bandwidth from 35GB/s to 40GB/s
> - plugged a second 10GB/s link and setup a ceph internal network
> - tried various "tuned-adm profile" such as "throughput-performance"
>
> This changed about nothing.
>
> If
> - the CPUs are not maxed out, and lowering the frequency doesn't change a 
> thing
> - the network is not maxed out
> - the memory doesn't seem to have an impact
> - network interrupts are spread across all 8 cpu cores and receive queues are 
> OK
> - disks are not used at their maximum potential (iostat shows my dd commands 
> produce much more tps than the 4MB ceph transfers...)
>
> Where can I possibly find a bottleneck ?
>
> I'm /(almost) out of ideas/ ... :'(
>
> Regards
>
>
Frederic,

I was trying to optimize my ceph cluster as well and I looked at all of the 
same things you described, which didn't help my performance noticeably.

The following network kernel tuning settings did help me significantly.

This is my /etc/sysctl.conf file on all of  my hosts: ceph mons, ceph osds and 
any client that connects to my ceph cluster.

# Increase Linux autotuning TCP buffer limits
# Set max to 16MB for 1GE and 32M (33554432) or 54M (56623104) for 10GE
# Don't set tcp_mem itself! Let the kernel scale it based on RAM.
#net.core.rmem_max = 56623104
#net.core.wmem_max = 56623104
# Use 128M buffers
net.core.rmem_max = 134217728
net.core.wmem_max = 134217728
net.core.rmem_default = 67108864
net.core.wmem_default = 67108864
net.core.optmem_max = 134217728
net.ipv4.tcp_rmem = 4096 87380 67108864
net.ipv4.tcp_wmem = 4096 65536 67108864

# Make room for more TIME_WAIT sockets due to more clients,
# and allow them to be reused if we run out of sockets
# Also increase the max packet backlog
net.core.somaxconn = 1024
# Increase the length of the processor input queue
net.core.netdev_max_backlog = 25
net.ipv4.tcp_max_syn_backlog = 3
net.ipv4.tcp_max_tw_buckets = 200
net.ipv4.tcp_tw_reuse = 1
net.ipv4.tcp_tw_recycle = 1
net.ipv4.tcp_fin_timeout = 10

# Disable TCP slow start on idle connections
net.ipv4.tcp_slow_start_after_idle = 0

# If your servers talk UDP, also up these limits
net.ipv4.udp_rmem_min = 8192
net.ipv4.udp_wmem_min = 8192

# Disable source routing and redirects
net.ipv4.conf.all.send_redirects = 0
net.ipv4.conf.all.accept_redirects = 0
net.ipv4.conf.all.accept_source_route = 0

# Recommended when jumbo frames are enabled
net.ipv4.tcp_mtu_probing = 1

I have 40 Gbps links on my osd nodes, and 10 Gbps links on everything else.

Let me know if that helps.

Jake
[>- FS : -<]
Hi,

Thanks for suggesting these :]

I finally got some time to try your kernel parameters… but that doesn’t seem to 
help at least for the EC pools.
I’ll need to re-add all the disk OSDs to be really sure, especially with the 
replicated pools – I’d like to see if at least the replicated pools are better, 
so that I can use them as frontend pools…

Regards



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph 0.94 (and lower) performance on >1 hosts ??

2015-07-29 Thread Jake Young
On Wed, Jul 29, 2015 at 11:23 AM, Mark Nelson  wrote:

> On 07/29/2015 10:13 AM, Jake Young wrote:
>
>> On Tue, Jul 28, 2015 at 11:48 AM, SCHAER Frederic
>> mailto:frederic.sch...@cea.fr>> wrote:
>>  >
>>  > Hi again,
>>  >
>>  > So I have tried
>>  > - changing the cpus frequency : either 1.6GHZ, or 2.4GHZ on all cores
>>  > - changing the memory configuration, from "advanced ecc mode" to
>> "performance mode", boosting the memory bandwidth from 35GB/s to 40GB/s
>>  > - plugged a second 10GB/s link and setup a ceph internal network
>>  > - tried various "tuned-adm profile" such as "throughput-performance"
>>  >
>>  > This changed about nothing.
>>  >
>>  > If
>>  > - the CPUs are not maxed out, and lowering the frequency doesn't
>> change a thing
>>  > - the network is not maxed out
>>  > - the memory doesn't seem to have an impact
>>  > - network interrupts are spread across all 8 cpu cores and receive
>> queues are OK
>>  > - disks are not used at their maximum potential (iostat shows my dd
>> commands produce much more tps than the 4MB ceph transfers...)
>>  >
>>  > Where can I possibly find a bottleneck ?
>>  >
>>  > I'm /(almost) out of ideas/ ... :'(
>>  >
>>  > Regards
>>  >
>>  >
>> Frederic,
>>
>> I was trying to optimize my ceph cluster as well and I looked at all of
>> the same things you described, which didn't help my performance
>> noticeably.
>>
>> The following network kernel tuning settings did help me significantly.
>>
>> This is my /etc/sysctl.conf file on all of  my hosts: ceph mons, ceph
>> osds and any client that connects to my ceph cluster.
>>
>>  # Increase Linux autotuning TCP buffer limits
>>  # Set max to 16MB for 1GE and 32M (33554432) or 54M (56623104)
>> for 10GE
>>  # Don't set tcp_mem itself! Let the kernel scale it based on RAM.
>>  #net.core.rmem_max = 56623104
>>  #net.core.wmem_max = 56623104
>>  # Use 128M buffers
>>  net.core.rmem_max = 134217728
>>  net.core.wmem_max = 134217728
>>  net.core.rmem_default = 67108864
>>  net.core.wmem_default = 67108864
>>  net.core.optmem_max = 134217728
>>  net.ipv4.tcp_rmem = 4096 87380 67108864
>>  net.ipv4.tcp_wmem = 4096 65536 67108864
>>
>>  # Make room for more TIME_WAIT sockets due to more clients,
>>  # and allow them to be reused if we run out of sockets
>>  # Also increase the max packet backlog
>>  net.core.somaxconn = 1024
>>  # Increase the length of the processor input queue
>>  net.core.netdev_max_backlog = 25
>>  net.ipv4.tcp_max_syn_backlog = 3
>>  net.ipv4.tcp_max_tw_buckets = 200
>>  net.ipv4.tcp_tw_reuse = 1
>>  net.ipv4.tcp_tw_recycle = 1
>>  net.ipv4.tcp_fin_timeout = 10
>>
>>  # Disable TCP slow start on idle connections
>>  net.ipv4.tcp_slow_start_after_idle = 0
>>
>>  # If your servers talk UDP, also up these limits
>>  net.ipv4.udp_rmem_min = 8192
>>  net.ipv4.udp_wmem_min = 8192
>>
>>  # Disable source routing and redirects
>>  net.ipv4.conf.all.send_redirects = 0
>>  net.ipv4.conf.all.accept_redirects = 0
>>  net.ipv4.conf.all.accept_source_route = 0
>>
>>  # Recommended when jumbo frames are enabled
>>  net.ipv4.tcp_mtu_probing = 1
>>
>> I have 40 Gbps links on my osd nodes, and 10 Gbps links on everything
>> else.
>>
>> Let me know if that helps.
>>
>
> Hi Jake,
>
> Could you talk a little bit about what scenarios you've seen tuning this
> help?  I noticed improvement in RGW performance in some cases with similar
> TCP tunings, but it would be good to understand what other folks are seeing
> and in what situations.
>
>
>> Jake
>>
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>  ___
> ceph-users mailing list
>
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>

Hey Mark,

I'm only using RBD.  My clients are all VMware, so I have a few iSCSI proxy
VMs (using rbd enabled tgt).  My workload is typically light random
read/write, except for the periodic eager zeroing of multi terabyte
volumes.  Since there is no VAAI in tgt, this turns into heavy sequential
writing.

I found the network tuning above helped to "open up" the connection from a
single iSCSI proxy VM to the cluster.

Note that my osd nodes have both a public network interface as well as a
dedicated private network interface, which are both 40G.  I believe the
network tuning also has another effect of improving the performance of the
cluster network (where the replication data is sent across), because
initially I had only applied the kernel tuning to the osd nodes and saw a
performance improvement before I implemented it on the iSCSI proxy VMs.

I should m

Re: [ceph-users] Ceph 0.94 (and lower) performance on >1 hosts ??

2015-07-29 Thread Mark Nelson

On 07/29/2015 10:13 AM, Jake Young wrote:

On Tue, Jul 28, 2015 at 11:48 AM, SCHAER Frederic
mailto:frederic.sch...@cea.fr>> wrote:
 >
 > Hi again,
 >
 > So I have tried
 > - changing the cpus frequency : either 1.6GHZ, or 2.4GHZ on all cores
 > - changing the memory configuration, from "advanced ecc mode" to
"performance mode", boosting the memory bandwidth from 35GB/s to 40GB/s
 > - plugged a second 10GB/s link and setup a ceph internal network
 > - tried various "tuned-adm profile" such as "throughput-performance"
 >
 > This changed about nothing.
 >
 > If
 > - the CPUs are not maxed out, and lowering the frequency doesn't
change a thing
 > - the network is not maxed out
 > - the memory doesn't seem to have an impact
 > - network interrupts are spread across all 8 cpu cores and receive
queues are OK
 > - disks are not used at their maximum potential (iostat shows my dd
commands produce much more tps than the 4MB ceph transfers...)
 >
 > Where can I possibly find a bottleneck ?
 >
 > I'm /(almost) out of ideas/ ... :'(
 >
 > Regards
 >
 >
Frederic,

I was trying to optimize my ceph cluster as well and I looked at all of
the same things you described, which didn't help my performance noticeably.

The following network kernel tuning settings did help me significantly.

This is my /etc/sysctl.conf file on all of  my hosts: ceph mons, ceph
osds and any client that connects to my ceph cluster.

 # Increase Linux autotuning TCP buffer limits
 # Set max to 16MB for 1GE and 32M (33554432) or 54M (56623104)
for 10GE
 # Don't set tcp_mem itself! Let the kernel scale it based on RAM.
 #net.core.rmem_max = 56623104
 #net.core.wmem_max = 56623104
 # Use 128M buffers
 net.core.rmem_max = 134217728
 net.core.wmem_max = 134217728
 net.core.rmem_default = 67108864
 net.core.wmem_default = 67108864
 net.core.optmem_max = 134217728
 net.ipv4.tcp_rmem = 4096 87380 67108864
 net.ipv4.tcp_wmem = 4096 65536 67108864

 # Make room for more TIME_WAIT sockets due to more clients,
 # and allow them to be reused if we run out of sockets
 # Also increase the max packet backlog
 net.core.somaxconn = 1024
 # Increase the length of the processor input queue
 net.core.netdev_max_backlog = 25
 net.ipv4.tcp_max_syn_backlog = 3
 net.ipv4.tcp_max_tw_buckets = 200
 net.ipv4.tcp_tw_reuse = 1
 net.ipv4.tcp_tw_recycle = 1
 net.ipv4.tcp_fin_timeout = 10

 # Disable TCP slow start on idle connections
 net.ipv4.tcp_slow_start_after_idle = 0

 # If your servers talk UDP, also up these limits
 net.ipv4.udp_rmem_min = 8192
 net.ipv4.udp_wmem_min = 8192

 # Disable source routing and redirects
 net.ipv4.conf.all.send_redirects = 0
 net.ipv4.conf.all.accept_redirects = 0
 net.ipv4.conf.all.accept_source_route = 0

 # Recommended when jumbo frames are enabled
 net.ipv4.tcp_mtu_probing = 1

I have 40 Gbps links on my osd nodes, and 10 Gbps links on everything else.

Let me know if that helps.


Hi Jake,

Could you talk a little bit about what scenarios you've seen tuning this 
help?  I noticed improvement in RGW performance in some cases with 
similar TCP tunings, but it would be good to understand what other folks 
are seeing and in what situations.




Jake


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph 0.94 (and lower) performance on >1 hosts ??

2015-07-29 Thread Jake Young
On Tue, Jul 28, 2015 at 11:48 AM, SCHAER Frederic 
wrote:
>
> Hi again,
>
> So I have tried
> - changing the cpus frequency : either 1.6GHZ, or 2.4GHZ on all cores
> - changing the memory configuration, from "advanced ecc mode" to
"performance mode", boosting the memory bandwidth from 35GB/s to 40GB/s
> - plugged a second 10GB/s link and setup a ceph internal network
> - tried various "tuned-adm profile" such as "throughput-performance"
>
> This changed about nothing.
>
> If
> - the CPUs are not maxed out, and lowering the frequency doesn't change a
thing
> - the network is not maxed out
> - the memory doesn't seem to have an impact
> - network interrupts are spread across all 8 cpu cores and receive queues
are OK
> - disks are not used at their maximum potential (iostat shows my dd
commands produce much more tps than the 4MB ceph transfers...)
>
> Where can I possibly find a bottleneck ?
>
> I'm /(almost) out of ideas/ ... :'(
>
> Regards
>
>
Frederic,

I was trying to optimize my ceph cluster as well and I looked at all of the
same things you described, which didn't help my performance noticeably.

The following network kernel tuning settings did help me significantly.

This is my /etc/sysctl.conf file on all of  my hosts: ceph mons, ceph osds
and any client that connects to my ceph cluster.

# Increase Linux autotuning TCP buffer limits
# Set max to 16MB for 1GE and 32M (33554432) or 54M (56623104) for
10GE
# Don't set tcp_mem itself! Let the kernel scale it based on RAM.
#net.core.rmem_max = 56623104
#net.core.wmem_max = 56623104
# Use 128M buffers
net.core.rmem_max = 134217728
net.core.wmem_max = 134217728
net.core.rmem_default = 67108864
net.core.wmem_default = 67108864
net.core.optmem_max = 134217728
net.ipv4.tcp_rmem = 4096 87380 67108864
net.ipv4.tcp_wmem = 4096 65536 67108864

# Make room for more TIME_WAIT sockets due to more clients,
# and allow them to be reused if we run out of sockets
# Also increase the max packet backlog
net.core.somaxconn = 1024
# Increase the length of the processor input queue
net.core.netdev_max_backlog = 25
net.ipv4.tcp_max_syn_backlog = 3
net.ipv4.tcp_max_tw_buckets = 200
net.ipv4.tcp_tw_reuse = 1
net.ipv4.tcp_tw_recycle = 1
net.ipv4.tcp_fin_timeout = 10

# Disable TCP slow start on idle connections
net.ipv4.tcp_slow_start_after_idle = 0

# If your servers talk UDP, also up these limits
net.ipv4.udp_rmem_min = 8192
net.ipv4.udp_wmem_min = 8192

# Disable source routing and redirects
net.ipv4.conf.all.send_redirects = 0
net.ipv4.conf.all.accept_redirects = 0
net.ipv4.conf.all.accept_source_route = 0

# Recommended when jumbo frames are enabled
net.ipv4.tcp_mtu_probing = 1

I have 40 Gbps links on my osd nodes, and 10 Gbps links on everything else.

Let me know if that helps.

Jake
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph 0.94 (and lower) performance on >1 hosts ??

2015-07-28 Thread SCHAER Frederic
Hi again,

So I have tried 
- changing the cpus frequency : either 1.6GHZ, or 2.4GHZ on all cores
- changing the memory configuration, from "advanced ecc mode" to "performance 
mode", boosting the memory bandwidth from 35GB/s to 40GB/s
- plugged a second 10GB/s link and setup a ceph internal network
- tried various "tuned-adm profile" such as "throughput-performance"

This changed about nothing.

If 
- the CPUs are not maxed out, and lowering the frequency doesn't change a thing
- the network is not maxed out
- the memory doesn't seem to have an impact
- network interrupts are spread across all 8 cpu cores and receive queues are OK
- disks are not used at their maximum potential (iostat shows my dd commands 
produce much more tps than the 4MB ceph transfers...)

Where can I possibly find a bottleneck ?

I'm /(almost) out of ideas/ ... :'(

Regards

-Message d'origine-
De : ceph-users [mailto:ceph-users-boun...@lists.ceph.com] De la part de SCHAER 
Frederic
Envoyé : vendredi 24 juillet 2015 16:04
À : Christian Balzer; ceph-users@lists.ceph.com
Objet : [PROVENANCE INTERNET] Re: [ceph-users] Ceph 0.94 (and lower) 
performance on >1 hosts ??

Hi,

Thanks.
I did not know about atop, nice tool... and I don't seem to be IRQ overloaded - 
I can reach 100% cpu % for IRQs, but that's shared across all 8 physical cores.
I also discovered "turbostat" which showed me the R510s were not configured for 
"performance" in the bios (but dbpm - demand based power management), and were 
not bumping the CPUs frequency to 2.4GHz as they should... only apparently 
remaining at 1.6Ghz...

But changing that did not improve things unfortunately. I know have CPUs  using 
their xeon turbo frequency, but no throughput improvement.

Looking at RPS/ RSS, it looks like our Broadcom cards are configured correctly 
according to redhat, i.e : one receive queue per physical core, spreading the 
IRQ load everywhere.
One thing I noticed though is that the dell BIOS allows to change IRQs... but 
once you change the network card IRQ, it also changes the RAID card IRQ as well 
as many others, all sharing the same bios IRQ (that's therefore apparently a 
useless option). Weird.

Still attempting to determine the bottleneck ;)

Regards
Frederic

-Message d'origine-
De : Christian Balzer [mailto:ch...@gol.com] 
Envoyé : jeudi 23 juillet 2015 14:18
À : ceph-users@lists.ceph.com
Cc : Gregory Farnum; SCHAER Frederic
Objet : Re: [ceph-users] Ceph 0.94 (and lower) performance on >1 hosts ??

On Thu, 23 Jul 2015 11:14:22 +0100 Gregory Farnum wrote:

> Your note that dd can do 2GB/s without networking makes me think that
> you should explore that. As you say, network interrupts can be
> problematic in some systems. The only thing I can think of that's been
> really bad in the past is that some systems process all network
> interrupts on cpu 0, and you probably want to make sure that it's
> splitting them across CPUs.
>

An IRQ overload would be very visible with atop.

Splitting the IRQs will help, but it is likely to need some smarts.

As in, irqbalance may spread things across NUMA nodes.

A card with just one IRQ line will need RPS (Receive Packet Steering),
irqbalance can't help it.

For example, I have a compute node with such a single line card and Quad
Opterons (64 cores, 8 NUMA nodes).

The default is all interrupt handling on CPU0 and that is very little,
except for eth2. So this gets a special treatment:
---
echo 4 >/proc/irq/106/smp_affinity_list
---
Pinning the IRQ for eth2 to CPU 4 by default

---
echo f0 > /sys/class/net/eth2/queues/rx-0/rps_cpus
---
giving RPS CPUs 4-7 to work with. At peak times it needs more than 2
cores, otherwise with this architecture just using 4 and 5 (same L2 cache)
would be better.

Regards,

Christian
-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Global OnLine Japan/Fusion Communications
http://www.gol.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph 0.94 (and lower) performance on >1 hosts ??

2015-07-24 Thread SCHAER Frederic
Hi,

Thanks.
I did not know about atop, nice tool... and I don't seem to be IRQ overloaded - 
I can reach 100% cpu % for IRQs, but that's shared across all 8 physical cores.
I also discovered "turbostat" which showed me the R510s were not configured for 
"performance" in the bios (but dbpm - demand based power management), and were 
not bumping the CPUs frequency to 2.4GHz as they should... only apparently 
remaining at 1.6Ghz...

But changing that did not improve things unfortunately. I know have CPUs  using 
their xeon turbo frequency, but no throughput improvement.

Looking at RPS/ RSS, it looks like our Broadcom cards are configured correctly 
according to redhat, i.e : one receive queue per physical core, spreading the 
IRQ load everywhere.
One thing I noticed though is that the dell BIOS allows to change IRQs... but 
once you change the network card IRQ, it also changes the RAID card IRQ as well 
as many others, all sharing the same bios IRQ (that's therefore apparently a 
useless option). Weird.

Still attempting to determine the bottleneck ;)

Regards
Frederic

-Message d'origine-
De : Christian Balzer [mailto:ch...@gol.com] 
Envoyé : jeudi 23 juillet 2015 14:18
À : ceph-users@lists.ceph.com
Cc : Gregory Farnum; SCHAER Frederic
Objet : Re: [ceph-users] Ceph 0.94 (and lower) performance on >1 hosts ??

On Thu, 23 Jul 2015 11:14:22 +0100 Gregory Farnum wrote:

> Your note that dd can do 2GB/s without networking makes me think that
> you should explore that. As you say, network interrupts can be
> problematic in some systems. The only thing I can think of that's been
> really bad in the past is that some systems process all network
> interrupts on cpu 0, and you probably want to make sure that it's
> splitting them across CPUs.
>

An IRQ overload would be very visible with atop.

Splitting the IRQs will help, but it is likely to need some smarts.

As in, irqbalance may spread things across NUMA nodes.

A card with just one IRQ line will need RPS (Receive Packet Steering),
irqbalance can't help it.

For example, I have a compute node with such a single line card and Quad
Opterons (64 cores, 8 NUMA nodes).

The default is all interrupt handling on CPU0 and that is very little,
except for eth2. So this gets a special treatment:
---
echo 4 >/proc/irq/106/smp_affinity_list
---
Pinning the IRQ for eth2 to CPU 4 by default

---
echo f0 > /sys/class/net/eth2/queues/rx-0/rps_cpus
---
giving RPS CPUs 4-7 to work with. At peak times it needs more than 2
cores, otherwise with this architecture just using 4 and 5 (same L2 cache)
would be better.

Regards,

Christian
-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Global OnLine Japan/Fusion Communications
http://www.gol.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph 0.94 (and lower) performance on >1 hosts ??

2015-07-23 Thread Christian Balzer
On Thu, 23 Jul 2015 11:14:22 +0100 Gregory Farnum wrote:

> Your note that dd can do 2GB/s without networking makes me think that
> you should explore that. As you say, network interrupts can be
> problematic in some systems. The only thing I can think of that's been
> really bad in the past is that some systems process all network
> interrupts on cpu 0, and you probably want to make sure that it's
> splitting them across CPUs.
>

An IRQ overload would be very visible with atop.

Splitting the IRQs will help, but it is likely to need some smarts.

As in, irqbalance may spread things across NUMA nodes.

A card with just one IRQ line will need RPS (Receive Packet Steering),
irqbalance can't help it.

For example, I have a compute node with such a single line card and Quad
Opterons (64 cores, 8 NUMA nodes).

The default is all interrupt handling on CPU0 and that is very little,
except for eth2. So this gets a special treatment:
---
echo 4 >/proc/irq/106/smp_affinity_list
---
Pinning the IRQ for eth2 to CPU 4 by default

---
echo f0 > /sys/class/net/eth2/queues/rx-0/rps_cpus
---
giving RPS CPUs 4-7 to work with. At peak times it needs more than 2
cores, otherwise with this architecture just using 4 and 5 (same L2 cache)
would be better.

Regards,

Christian
-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Global OnLine Japan/Fusion Communications
http://www.gol.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph 0.94 (and lower) performance on >1 hosts ??

2015-07-23 Thread SCHAER Frederic
Hi,

Well I think the journaling would still appear in the dstat output, as that's 
still IOs : even if the user-side bandwidth indeed is cut in half, that should 
not be the case of disks IO.
For instance I just tried a replicated pool for the test, and got around 
1300MiB/s in dstat for about 600MiB/s in the rados bench - I take it that 
indeed, with replication/size=2, there's a total of 2 replicas, so that's 1 
user IO for 2 * [1 replicas + 1  journals] / number of hosts => 600*2*2/2 = 
1200MiBs of IOs per host (+/- the approximations) ...

Using the dd flag "oflag=sync" indeed lowers the dstat values down to 
1100-1300MiB/s. Still above what ceph uses with EC pools .

I have tried to identify/watch interrupt issues (using the watch command), but 
I have to say I failed until know.
The Broadcom card is indeed spreading the load on the cpus:

# egrep 'CPU|p2p' /proc/interrupts
CPU0   CPU1   CPU2   CPU3   CPU4   CPU5   
CPU6   CPU7   CPU8   CPU9   CPU10  CPU11  CPU12  
CPU13  CPU14  CPU15
  80: 881646372   1508 30  97328  0  
10459270   2715   8753  0  12765   5100   
9148   9420  0   PCI-MSI-edge  p2p1
  82: 179710 165107  94684 334842 210219  47403 
270330 166877   3516 229043  709844660  16512   5088   
2456312  12302   PCI-MSI-edge  p2p1-fp-0
  83:  12454  14073   5571  15196   5282  22301  
11522  21299 4092581302069   1303  79810  705953243   
1836  15190 883683   PCI-MSI-edge  p2p1-fp-1
  84:   6463  13994  57006  16200  16778 374815 
558398  11902  695554360  94228   1252  18649 825684   
7555 731875 190402   PCI-MSI-edge  p2p1-fp-2
  85: 163228 259899 143625 121326 107509 798435 
168027 144088  75321  89962  55297  715175665 784356  
53961  92153  92959   PCI-MSI-edge  p2p1-fp-3
  86:233267453226792070827220797122540051748938
39492831684674 65008514098872704778 140711 160954 
5910372981286  672487805   PCI-MSI-edge  p2p1-fp-4
  87:  33772 233318 136341  58163 506773 183451   
18269706  52425 226509  22150  17026 176203   5942  
681346619 270341  87435   PCI-MSI-edge  p2p1-fp-5
  88:   65103573  105514146   51193688   51330824   41771147   61202946   
41053735   49301547 181380   73028922  39525 172439 155778 
108065  154750931   26348797   PCI-MSI-edge  p2p1-fp-6
  89:   59287698  120778879   43446789   47063897   39634087   39463210   
46582805   48786230 342778   82670325 135397 438041 318995
3642955  179107495 833932   PCI-MSI-edge  p2p1-fp-7
  90:   1804   4453   2434  19885  11527   9771  
12724   2392840  12721439   1166   3354
560  69386   9233   PCI-MSI-edge  p2p2
  92:6455149433007258203245273513   115645711838476
22200494039978 977482   15351931 9494511685983 772531
271810175312351954224   PCI-MSI-edge  p2p2-fp-0

I don't know yet how to check if there are memory bandwith/latency/whatever 
issues...

Regards
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph 0.94 (and lower) performance on >1 hosts ??

2015-07-23 Thread Gregory Farnum
I'm not sure. It looks like Ceph and your disk controllers are doing
basically the right thing since you're going from 1GB/s to 420MB/s
when moving from dd to Ceph (the full data journaling cuts it in
half), but just fyi that dd task is not doing nearly the same thing as
Ceph does — you'd need to use directio or similar; the conv=fsync flag
means it will fsync the written data at the end of the run but not at
any intermediate point.

The change from 1 node to 2 cutting your performance so much is a bit
odd. I do note that
1 node: 420 MB/s each
2 nodes: 320 MB/s each
5 nodes: 275 MB/s each
so you appear to be reaching some kind of bound.

Your note that dd can do 2GB/s without networking makes me think that
you should explore that. As you say, network interrupts can be
problematic in some systems. The only thing I can think of that's been
really bad in the past is that some systems process all network
interrupts on cpu 0, and you probably want to make sure that it's
splitting them across CPUs.
-Greg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph 0.94 (and lower) performance on >1 hosts ??

2015-07-22 Thread SCHAER Frederic
Hi Gregory,



Thanks for your replies.

Let's take the 2 hosts config setup (3 MON + 3 idle MDS on same hosts).



2 dell R510 servers, CentOS 7.0.1406, dual xeon 5620 (8 
cores+hyperthreading),16GB RAM, 2 or 1x10gbits/s Ethernet (same results with 
and without private 10gbits network), PERC H700 + 12 2TB SAS disks, and PERC 
H800 + 11 2TB SAS disks (one unused SSD...)

The EC pool is defined with k=4, m=1

I set the failure domain to OSD for the test

The OSDs are set up with XFS and a 10GB journal 1st partition (the single 
doomed-dell SSD was a bottleneck for 23 disks…)

All disks are presently configured with a single-RAID0 because H700/H800 do not 
support JBOD.



I have 5 clients (CentOS 7.1), 10gbits/s ethernet, all running this command :

rados -k ceph.client.admin.keyring -p testec bench 120 write -b 4194304 -t 32 
--run-name "bench_`hostname -s`" --no-cleanup

I'm aggregating the average bandwidth at the end of the tests.

I'm monitoring the Ceph servers stats live with this dstat command: dstat -N 
p2p1,p2p2,total

The network MTU is 9000 on all nodes.



With this, the average client throughput is around 130MiB/s, i.e 650 MiB/s for 
the whole 2-nodes ceph cluster / 5 clients.

I since have tried removing (ceph osd out/ceph osd crush reweight 0) either the 
H700 or the H800 disks, thus only using 11 or 12 disks per server, and I either 
get 550 MiB/s or 590MiB/s of aggregated clients bandwidth. Not much less 
considering I removed half disks !

I'm therefore starting to think I am CPU /memory bandwidth limited... ?



That's not however what I am tempted to conclude (for the cpu at least) when I 
see the dstat output, as it says the cpus still sit idle or IO waiting :



total-cpu-usage -dsk/total- --net/p2p1net/p2p2---net/total- 
---paging-- ---system--

usr sys idl wai hiq siq| read  writ| recv  send: recv  send: recv  send|  in   
out | int   csw

  1   1  97   0   0   0| 586k 1870k|   0 0 :   0 0 :   0 0 |  49B  
455B|816715k

29  17  24  27   0   3| 128k  734M| 367M  870k:   0 0 : 367M  870k|   0 
0 |  61k   61k

30  17  34  16   0   3| 432k  750M| 229M  567k: 199M  168M: 427M  168M|   0 
0 |  65k   68k

25  14  38  20   0   3|  16k  634M| 232M  654k: 162M  133M: 393M  134M|   0 
0 |  56k   64k

19  10  46  23   0   2| 232k  463M| 244M  670k: 184M  138M: 428M  139M|   0 
0 |  45k   55k

15   8  46  29   0   1| 368k  422M| 213M  623k: 149M  110M: 362M  111M|   0 
0 |  35k   41k

25  17  37  19   0   3|  48k  584M| 139M  394k: 137M   90M: 276M   91M|   0 
0 |  54k   53k



Could it be the interruptions or system context switches that cause this 
relatively poor performance per node ?

PCI-E interractions with the PERC cards ?

I know I can get way more disk throughput with dd (command below)

total-cpu-usage -dsk/total- -net/total- ---paging-- ---system--

usr sys idl wai hiq siq| read  writ| recv  send|  in   out | int   csw

  1   1  97   0   0   0| 595k 2059k|   0 0 | 634B 2886B|797115k

  1  93   0   3   0   3|   0  1722M|  49k   78k|   0 0 |  40k   47k

  1  93   0   3   0   3|   0  1836M|  40k   69k|   0 0 |  45k   57k

  1  95   0   2   0   2|   0  1805M|  40k   69k|   0 0 |  38k   34k

  1  94   0   3   0   2|   0  1864M|  37k   38k|   0 0 |  35k   24k

(…)



Dd command : # use at your own risk # FS_THR=64 ; FILE_MB=8 ; N_FS=`mount|grep 
ceph|wc -l` ; time (for i in `mount|grep ceph|awk '{print $3}'` ; do echo 
"writing $FS_THR times (threads) " $[ 4 * FILE_MB ] " mb on $i..." ; for j in 
`seq 1 $FS_THR` ; do dd conv=fsync if=/dev/zero of=$i/test.zero.$j bs=4M 
count=$[ FILE_MB / 4 ] & done ; done ; wait) ; echo "wrote $[ N_FS * FILE_MB * 
FS_THR ] MB on $N_FS FS with $FS_THR threads" ; rm -f 
/var/lib/ceph/osd/*/test.zero*





Hope I gave you more insights on what I’m trying to achieve, and where I’m 
failing ?



Regards





-Message d'origine-
De : Gregory Farnum [mailto:g...@gregs42.com]
Envoyé : mercredi 22 juillet 2015 16:01
À : Florent MONTHEL
Cc : SCHAER Frederic; ceph-users@lists.ceph.com
Objet : Re: [ceph-users] Ceph 0.94 (and lower) performance on >1 hosts ??



We might also be able to help you improve or better understand your

results if you can tell us exactly what tests you're conducting that

are giving you these numbers.

-Greg



On Wed, Jul 22, 2015 at 4:44 AM, Florent MONTHEL 
mailto:fmont...@flox-arts.net>> wrote:

> Hi Frederic,

>

> When you have Ceph cluster with 1 node you don’t experienced network and

> communication overhead due to distributed model

> With 2 nodes and EC 4+1 you will have communication between 2 nodes but you

> will keep internal communication (2 chunks on first node and 3 chunks on

> second node)

> On your configuration EC pool is setup with 4+1 so you will have for each

> write overhead due to write spread

Re: [ceph-users] Ceph 0.94 (and lower) performance on >1 hosts ??

2015-07-22 Thread Gregory Farnum
We might also be able to help you improve or better understand your
results if you can tell us exactly what tests you're conducting that
are giving you these numbers.
-Greg

On Wed, Jul 22, 2015 at 4:44 AM, Florent MONTHEL  wrote:
> Hi Frederic,
>
> When you have Ceph cluster with 1 node you don’t experienced network and
> communication overhead due to distributed model
> With 2 nodes and EC 4+1 you will have communication between 2 nodes but you
> will keep internal communication (2 chunks on first node and 3 chunks on
> second node)
> On your configuration EC pool is setup with 4+1 so you will have for each
> write overhead due to write spreading on 5 nodes (for 1 customer IO, you
> will experience 5 Ceph IO due to EC 4+1)
> It’s the reason for that I think you’re reaching performance stability with
> 5 nodes and more in your cluster
>
>
> On Jul 20, 2015, at 10:35 AM, SCHAER Frederic 
> wrote:
>
> Hi,
>
> As I explained in various previous threads, I’m having a hard time getting
> the most out of my test ceph cluster.
> I’m benching things with rados bench.
> All Ceph hosts are on the same 10GB switch.
>
> Basically, I know I can get about 1GB/s of disk write performance per host,
> when I bench things with dd (hundreds of dd threads) +iperf 10gbit
> inbound+iperf 10gbit outbound.
> I also can get 2GB/s or even more if I don’t bench the network at the same
> time, so yes, there is a bottleneck between disks and network, but I can’t
> identify which one, and it’s not relevant for what follows anyway
> (Dell R510 + MD1200 + PERC H700 + PERC H800 here, if anyone has hints about
> this strange bottleneck though…)
>
> My hosts each are connected though a single 10Gbits/s link for now.
>
> My problem is the following. Please note I see the same kind of poor
> performance with replicated pools...
> When testing EC pools, I ended putting a 4+1 pool on a single node in order
> to track down the ceph bottleneck.
> On that node, I can get approximately 420MB/s write performance using rados
> bench, but that’s fair enough since the dstat output shows that real data
> throughput on disks is about 800+MB/s (that’s the ceph journal effect, I
> presume).
>
> I tested Ceph on my other standalone nodes : I can also get around 420MB/s,
> since they’re identical.
> I’m testing things with 5 10Gbits/s clients, each running rados bench.
>
> But what I really don’t get is the following :
>
> -  With 1 host : throughput is 420MB/s
> -  With 2 hosts : I get 640MB/s. That’s surely not 2x420MB/s.
> -  With 5 hosts : I get around 1375MB/s . That’s far from the
> expected 2GB/s.
>
> The network never is maxed out, nor are the disks or CPUs.
> The hosts throughput I see with rados bench seems to match the dstat
> throughput.
> That’s as if each additional host was only capable of adding 220MB/s of
> throughput. Compare this to the 1GB/s they are capable of (420MB/s with
> journals)…
>
> I’m therefore wondering what could possibly be so wrong with my setup ??
> Why would it impact so much the performance to add hosts ?
>
> On the hardware side, I have Broadcam BCM57711 10-Gigabit PCIe cards.
> I know, not perfect, but not THAT bad neither… ?
>
> Any hint would be greatly appreciated !
>
> Thanks
> Frederic Schaer
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph 0.94 (and lower) performance on >1 hosts ??

2015-07-21 Thread Florent MONTHEL
Hi Frederic,

When you have Ceph cluster with 1 node you don’t experienced network and 
communication overhead due to distributed model
With 2 nodes and EC 4+1 you will have communication between 2 nodes but you 
will keep internal communication (2 chunks on first node and 3 chunks on second 
node)
On your configuration EC pool is setup with 4+1 so you will have for each write 
overhead due to write spreading on 5 nodes (for 1 customer IO, you will 
experience 5 Ceph IO due to EC 4+1)
It’s the reason for that I think you’re reaching performance stability with 5 
nodes and more in your cluster


> On Jul 20, 2015, at 10:35 AM, SCHAER Frederic  wrote:
> 
> Hi,
>  
> As I explained in various previous threads, I’m having a hard time getting 
> the most out of my test ceph cluster.
> I’m benching things with rados bench.
> All Ceph hosts are on the same 10GB switch.
>  
> Basically, I know I can get about 1GB/s of disk write performance per host, 
> when I bench things with dd (hundreds of dd threads) +iperf 10gbit 
> inbound+iperf 10gbit outbound.
> I also can get 2GB/s or even more if I don’t bench the network at the same 
> time, so yes, there is a bottleneck between disks and network, but I can’t 
> identify which one, and it’s not relevant for what follows anyway
> (Dell R510 + MD1200 + PERC H700 + PERC H800 here, if anyone has hints about 
> this strange bottleneck though…)
>  
> My hosts each are connected though a single 10Gbits/s link for now.
>  
> My problem is the following. Please note I see the same kind of poor 
> performance with replicated pools...
> When testing EC pools, I ended putting a 4+1 pool on a single node in order 
> to track down the ceph bottleneck.
> On that node, I can get approximately 420MB/s write performance using rados 
> bench, but that’s fair enough since the dstat output shows that real data 
> throughput on disks is about 800+MB/s (that’s the ceph journal effect, I 
> presume).
>  
> I tested Ceph on my other standalone nodes : I can also get around 420MB/s, 
> since they’re identical.
> I’m testing things with 5 10Gbits/s clients, each running rados bench.
>  
> But what I really don’t get is the following :
>  
> -  With 1 host : throughput is 420MB/s
> -  With 2 hosts : I get 640MB/s. That’s surely not 2x420MB/s.
> -  With 5 hosts : I get around 1375MB/s . That’s far from the 
> expected 2GB/s.
>  
> The network never is maxed out, nor are the disks or CPUs.
> The hosts throughput I see with rados bench seems to match the dstat 
> throughput.
> That’s as if each additional host was only capable of adding 220MB/s of 
> throughput. Compare this to the 1GB/s they are capable of (420MB/s with 
> journals)…
>  
> I’m therefore wondering what could possibly be so wrong with my setup ??
> Why would it impact so much the performance to add hosts ?
>  
> On the hardware side, I have Broadcam BCM57711 10-Gigabit PCIe cards.
> I know, not perfect, but not THAT bad neither… ?
>  
> Any hint would be greatly appreciated !
>  
> Thanks
> Frederic Schaer
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com 
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Ceph 0.94 (and lower) performance on >1 hosts ??

2015-07-20 Thread SCHAER Frederic
Hi,

As I explained in various previous threads, I'm having a hard time getting the 
most out of my test ceph cluster.
I'm benching things with rados bench.
All Ceph hosts are on the same 10GB switch.

Basically, I know I can get about 1GB/s of disk write performance per host, 
when I bench things with dd (hundreds of dd threads) +iperf 10gbit 
inbound+iperf 10gbit outbound.
I also can get 2GB/s or even more if I don't bench the network at the same 
time, so yes, there is a bottleneck between disks and network, but I can't 
identify which one, and it's not relevant for what follows anyway
(Dell R510 + MD1200 + PERC H700 + PERC H800 here, if anyone has hints about 
this strange bottleneck though...)

My hosts each are connected though a single 10Gbits/s link for now.

My problem is the following. Please note I see the same kind of poor 
performance with replicated pools...
When testing EC pools, I ended putting a 4+1 pool on a single node in order to 
track down the ceph bottleneck.
On that node, I can get approximately 420MB/s write performance using rados 
bench, but that's fair enough since the dstat output shows that real data 
throughput on disks is about 800+MB/s (that's the ceph journal effect, I 
presume).

I tested Ceph on my other standalone nodes : I can also get around 420MB/s, 
since they're identical.
I'm testing things with 5 10Gbits/s clients, each running rados bench.

But what I really don't get is the following :


-  With 1 host : throughput is 420MB/s

-  With 2 hosts : I get 640MB/s. That's surely not 2x420MB/s.

-  With 5 hosts : I get around 1375MB/s . That's far from the expected 
2GB/s.

The network never is maxed out, nor are the disks or CPUs.
The hosts throughput I see with rados bench seems to match the dstat throughput.
That's as if each additional host was only capable of adding 220MB/s of 
throughput. Compare this to the 1GB/s they are capable of (420MB/s with 
journals)...

I'm therefore wondering what could possibly be so wrong with my setup ??
Why would it impact so much the performance to add hosts ?

On the hardware side, I have Broadcam BCM57711 10-Gigabit PCIe cards.
I know, not perfect, but not THAT bad neither... ?

Any hint would be greatly appreciated !

Thanks
Frederic Schaer
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com