hi,
Page reclamation in Linux is NUMA aware.  So page reclamation is not
an issue.

except for the first min_free_kbytes? those can come from anywhere, no? or is the reclamation such that it tries to free equal portion for each NUMA domain. if the OSD allocates memory in chunks smaller then that value, you might be lucky.

You can see performance improvements only if all the components of a
given IO completes  on a single core. This is hard to achieve in Ceph
as a single IO goes through multiple thread switches and the threads
are not bound to any core.  Starting an OSD with numactl  and binding
it to one core might aggravate the problem as all the threads spawned
by that OSD will compete for the CPU on a single core.  OSD with
default configuration has 20+ threads .  Binding the OSD process to
one core using taskset does not help as some memory (especially heap)
may be already allocated on the other NUMA node.
this is not true if you start the process under numactl, is it?

but binding an OSD to a NUMA domain makes sense.


Looks the design principle followed is to fan out by spawning
multiple threads at each of the pipelining stage to utilize the
available cores in the system.  Because the IOs won't complete on the
same core as issued, lots of cycles are lost for cache coherency.
is intel HT a solution/help for this? turn on HT and start the OSD on the L2 (e.g. with hwloc-bind)

as a more general question, the recommendation for ceph to have one cpu core for each OSD; can these be HT cores or actual physical cores?



stijn


Regards, Anand



-----Original Message----- From: ceph-users
[mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Stijn De
Weirdt Sent: Monday, September 22, 2014 2:36 PM To:
ceph-users@lists.ceph.com Subject: Re: [ceph-users] IRQ balancing,
distribution

but another issue is the OSD processes: do you pin those as well?
and how much data do they actually handle. to checksum, the OSD
process needs all data, so that can also cause a lot of NUMA
traffic, esp if they are not pinned.

That's why all my (production) storage nodes have only a single 6
or 8 core CPU. Unfortunately that also limits the amount of RAM in
there, 16GB modules have just recently become an economically
viable alternative to 8GB ones.

Thus I don't pin OSD processes, given that on my 8 core nodes with
8 OSDs and 4 journal SSDs I can make Ceph eat babies and nearly all
CPU (not IOwait!) resources with the right (or is that wrong)
tests, namely 4K FIOs.

The linux scheduler usually is quite decent in keeping processes
where the action is, thus you see for example a clear preference of
DRBD or KVM vnet processes to be "near" or on the CPU(s) where the
IRQs are.
the scheduler has improved recently, but i don't know since what
version (certainly not backported to RHEL6 kernel).

pinning the OSDs might actually be a bad idea, unless the page cache
is flushed before each osd restart. kernel VM has this nice "feature"
where allocating memory in a NUMA domain does not trigger freeing of
cache memory in the domain, but it will first try to allocate memory
on another NUMA domain. although typically the VM cache will be maxed
out on OSD boxes, i'm not sure the cache clearing itself is NUMA
aware, so who knows where the memory is located when it's allocated.


stijn _______________________________________________ ceph-users
mailing list ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

________________________________

PLEASE NOTE: The information contained in this electronic mail
message is intended only for the use of the designated recipient(s)
named above. If the reader of this message is not the intended
recipient, you are hereby notified that you have received this
message in error and that any review, dissemination, distribution, or
copying of this message is strictly prohibited. If you have received
this communication in error, please notify the sender by telephone or
e-mail (as shown above) immediately and destroy any and all copies of
this message in your possession (whether hard copies or
electronically stored copies).



_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to