Re: [ceph-users] IRQ balancing, distribution

Mark Nelson Mon, 22 Sep 2014 06:56:08 -0700

On 09/22/2014 01:55 AM, Christian Balzer wrote:


Hello,

not really specific to Ceph, but since one of the default questions by the
Ceph team when people are facing performance problems seems to be
"Have you tried turning it off and on again?" ^o^ err,
"Are all your interrupts on one CPU?"
I'm going to wax on about this for a bit and hope for some feedback from
others with different experiences and architectures than me.

This may be a result of me harping about this after a customer'sclusters had mysterious performance issues and where irqbalance didn'tappear to be working properly. :)


Now firstly that question if all your IRQ handling is happening on the
same CPU is a valid one, as depending on a bewildering range of factors
ranging from kernel parameters to actual hardware one often does indeed
wind up with that scenario, usually with all on CPU0.
Which certainly is the case with all my recent hardware and Debian
kernels.

Yes, there are certainly a lot of scenarios where this can happen. Ithink the hope has been that with MSI-X, interrupts will get evenlydistributed by default and that is typically better than throwing themall at core 0, but things are still quite complicated.


I'm using nearly exclusively AMD CPUs (Opteron 42xx, 43xx and 63xx) and
thus feedback from Intel users is very much sought after, as I'm
considering Intel based storage nodes in the future.
It's vaguely amusing that Ceph storage nodes seem to have more CPU
(individual core performance, not necessarily # of cores) and similar RAM
requirements than my VM hosts. ^o^

It might be reasonable to say that Ceph is a pretty intensive piece ofsoftware. With lots of OSDs on a system there are hundreds if notthousands of threads. Under heavy load conditions the CPUs, networkcards, HBAs, memory, socket interconnects, possibly SAS expanders areall getting worked pretty hard and possibly in unusual ways where boththroughput and latency are important. At the cluster scale things likeswitch bisection bandwidth and network topology become issues too. Highperformance clustered storage is imho one of the most complicatedperformance subjects in computing.

The good news is that much of this can be avoided by sticking to simpledesigns with fewer OSDs per node. The more OSDs you try to stick in 1system, the more you need to worry about all of this if you care abouthigh performance.


So the common wisdom is that all IRQs on one CPU is a bad thing, lest it
gets overloaded and for example drop network packets because of this.
And while that is true, I'm hard pressed to generate any load on my
clusters where the IRQ ratio on CPU0 goes much beyond 50%.

Thus it should come as no surprise that spreading out IRQs with irqbalance
or more accurately by manually setting the /proc/irq/xx/smp_affinity mask
doesn't give me any discernible differences when it comes to benchmark
results.

Ok, that's fine, but this is pretty subjective. Without knowing theload and the hardware setup I don't think we can really draw anyconclusions other than that in your test on your hardware this wasn'tthe bottleneck.


With irqbalance spreading things out willy-nilly w/o any regards or
knowledge about the hardware and what IRQ does what it's definitely
something I won't be using out of the box. This goes especially for systems
with different NUMA regions without proper policyscripts for irqbalance.

I believe irqbalance takes PCI topology into account when making mappingdecisions. See:


http://dcs.nac.uci.edu/support/sysadmin/security/archive/msg09707.html


So for my current hardware I'm going to keep IRQs on CPU0 and CPU1 which
are the same Bulldozer module and thus sharing L2 and L3 cache.
In particular the AHCI (journal SSDs) and HBA or RAID controller IRQs on
CPU0 and the network (Infiniband) on CPU1.
That should give me sufficient reserves in processing power and keep intra
core (module) and NUMA (additional physical CPUs) traffic to a minimum.
This also will (within a certain load range) allow these 2 CPUs (module)
to be ramped up to full speed while other cores can remain at a lower
frequency.

So it's been a while since I looked at AMD CPU interconnect topology,but back in the magnycours era I drew up some diagrams:


2 socket:

https://docs.google.com/drawings/d/1_egexLqN14k9bhoN2nkv3iTgAbbPcwuwJmhwWAmakwo/edit?usp=sharing

4 socket:

https://docs.google.com/drawings/d/1V5sFSInKq3uuKRbETx1LVOURyYQF_9Z4zElPrl1YIrw/edit?usp=sharing

I think Interlagos looks somewhat similar from a hypertransportperspective. My gut instinct is that you really want to keepeverything you can local to the socket on these kinds of systems. So ifyour HBA is on the first socket, you want your processing and interrupthandling there too. In the 4-socket configuration this is especiallytrue. It's entirely possible that you may have to go through both anon-die and a inter-socket HT link before you get to a neighbour CPU.With the 2-socket configuration it's not quite as bad.

Intel CPUs in some ways are nicer because you have fewer cores that arefaster and often have much more straightforward interconnect topologies(though at the high-end sometimes bizarre tradeoffs get made for memorylike "flexmem bridges" and such.) Better to just stick with a simplerand straightforward architecture imho.


Now with Intel some PCIe lanes are handled by a specific CPU (that's why
you often see the need for adding a 2nd CPU to use all slots) and in that
case pinning the IRQ handling for those slots on a specific CPU might
actually make a lot of sense. Especially if not all the traffic generated
by that card will have to transferred to the other CPU anyway.

You need to think about that on just about any multi-socket systemexcept possibly those that have full-throughput links to an external IOHUB from every socket.



Christian


_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] IRQ balancing, distribution

Reply via email to