Page reclamation in Linux is NUMA aware.  So page reclamation is not an issue.

You can see performance improvements only if all the components of a given IO 
completes  on a single core. This is hard to achieve in Ceph as a single IO 
goes through multiple thread switches and the threads are not bound to any 
core.  Starting an OSD with numactl  and binding it to one core might aggravate 
the problem as all the threads spawned by that OSD will compete for the CPU on 
a single core.  OSD with default configuration has 20+ threads .  Binding the 
OSD process to one core using taskset does not help as some memory (especially 
heap) may be already allocated on the other NUMA node.

Looks the design principle followed is to fan out by spawning multiple threads 
at each of the pipelining stage to utilize the available cores in the system.  
Because the IOs won't complete on the same core as issued, lots of cycles are 
lost for cache coherency.

Regards,
Anand



-----Original Message-----
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Stijn 
De Weirdt
Sent: Monday, September 22, 2014 2:36 PM
To: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] IRQ balancing, distribution

>> but another issue is the OSD processes: do you pin those as well? and
>> how much data do they actually handle. to checksum, the OSD process
>> needs all data, so that can also cause a lot of NUMA traffic, esp if
>> they are not pinned.
>>
> That's why all my (production) storage nodes have only a single 6 or 8
> core CPU. Unfortunately that also limits the amount of RAM in there,
> 16GB modules have just recently become an economically viable
> alternative to 8GB ones.
>
> Thus I don't pin OSD processes, given that on my 8 core nodes with 8
> OSDs and 4 journal SSDs I can make Ceph eat babies and nearly all CPU
> (not
> IOwait!) resources with the right (or is that wrong) tests, namely 4K
> FIOs.
>
> The linux scheduler usually is quite decent in keeping processes where
> the action is, thus you see for example a clear preference of DRBD or
> KVM vnet processes to be "near" or on the CPU(s) where the IRQs are.
the scheduler has improved recently, but i don't know since what version 
(certainly not backported to RHEL6 kernel).

pinning the OSDs might actually be a bad idea, unless the page cache is flushed 
before each osd restart. kernel VM has this nice "feature" where allocating 
memory in a NUMA domain does not trigger freeing of cache memory in the domain, 
but it will first try to allocate memory on another NUMA domain. although 
typically the VM cache will be maxed out on OSD boxes, i'm not sure the cache 
clearing itself is NUMA aware, so who knows where the memory is located when 
it's allocated.


stijn
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

________________________________

PLEASE NOTE: The information contained in this electronic mail message is 
intended only for the use of the designated recipient(s) named above. If the 
reader of this message is not the intended recipient, you are hereby notified 
that you have received this message in error and that any review, 
dissemination, distribution, or copying of this message is strictly prohibited. 
If you have received this communication in error, please notify the sender by 
telephone or e-mail (as shown above) immediately and destroy any and all copies 
of this message in your possession (whether hard copies or electronically 
stored copies).

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to