Re: [ceph-users] NUMA and ceph ... zone_reclaim_mode

2015-01-13 Thread Mark Nelson

On 01/12/2015 07:47 AM, Dan van der Ster wrote:

(resending to list)

Hi Kyle,
I'd like to +10 this old proposal of yours. Let me explain why...

A couple months ago we started testing a new use-case with radosgw --
this new user is writing millions of small files and has been causing
us some headaches. Since starting these tests, the relevant OSDs have
been randomly freezing for up to ~60s at a time. We have dedicated
servers for this use-case, so it doesn't affect our important RBD
users, and the OSDs always came back anyway ("wrongly marked me
down"..."). So I didn't give this problem much attention, though I
guessed that we must be suffering from some network connectivity
problem.

But last week I started looking into this problem in more detail. With
increased debug_osd logs I saw that when these OSDs are getting marked
down, even the osd tick message is not printed for >30s. I also
correlated these outages with massive drops in cached memory -- it
looked as if an admin was running drop_caches on our live machines.
Here is what we saw:

 
https://www.dropbox.com/s/418ve09b6m98tyc/Screenshot%202015-01-12%2010.04.16.png?dl=0

Notice the sawtooth cached pages. That server has 20 OSDs, each OSD
has ~1 million files totalling around 40GB (~40kB objects). Compare
that with a different OSD host, one that's used for Cinder RBD volumes
(and doesn't suffer from the freezing OSD problem).:

 
https://www.dropbox.com/s/1lmra5wz7e7qxjy/Screenshot%202015-01-12%2010.11.37.png?dl=0

These RBD servers have identical hardware, but in this case the 20
OSDs each hold around 100k files totalling ~400GB (~4MB objects).

Clearly the 10x increase in num files on the radosgw OSDs appears to
be causing a problem. In fact, since the servers are pretty idle most
of the time, it appears that the _scrubbing_ of these 20 million files
per server is causing the problem. It seems that scrubbing is creating
quite some memory pressure (via the inode cache, especially), so I
started testing different vfs_cache_pressure values (1,10,1000,1).
The only value that sort of helped was vfs_cache_pressure = 1, but
keeping all the inodes cached is a pretty extreme measure, and it
won't scale up when these OSDs are more full (they're only around 1%
full now!!)

Then I discovered the infamous behaviour of zone_reclaim_mode = 1, and
this old thread. And I read a bit more, e.g.

 
http://engineering.linkedin.com/performance/optimizing-linux-memory-management-low-latency-high-throughput-databases
 http://rhaas.blogspot.ch/2014/06/linux-disables-vmzonereclaimmode-by.html

Indeed all our servers have zone_reclaim_mode = 1. Numerous DB
communities regard this option as very bad for servers -- MongoDB even
prints a warning message at startup if zone_reclaim_mode is enabled.
And finally, in recent kernels (since ~June 2014) zone_reclaim_mode is
disabled by default. The vm doc now says:

 zone_reclaim_mode is disabled by default. For file servers or
workloads that benefit from having their data cached,
zone_reclaim_mode should be left disabled as the caching effect is
likely to be more important than data locality.

I've set zone_reclaim_mode = 0 on these radosgw OSD servers, and the
freezing OSD problem has gone away. Here's a plot of a server that had
zone_reclaim_mode set to zero late on Jan 9th:

 
https://www.dropbox.com/s/x5qyn1e1r6fasl5/Screenshot%202015-01-12%2011.47.27.png?dl=0

I also used numactl --interleave=all  on one host, but
it doesn't appear to make a huge different beyond disabling numa zone
reclaim.

Moving forward, I think it would be good for Ceph to a least document
this behaviour, but better would be to also detect when
zone_reclaim_mode != 0 and warn the admin (like MongoDB does). This
line from the commit which disables it in the kernel is pretty wise,
IMHO: "On current machines and workloads it is often the case that
zone_reclaim_mode destroys performance but not all users know how to
detect this. Favour the common case and disable it by default."



Interestingly I was seeing behavior that looked like this as well, 
though it manifested as OSDs going down with internal heartbeat timeouts 
during heavy 4K read/write benchmarks.  I was able to observe major page 
faults on the OSD nodes correlated with the "frozen" period despite 
significant amounts of memory used for buffer cache.  I also went down 
the vm_cache_pressure path.  Changing it to 1 fixed the issue.  I didn't 
think to go back and look at zone_reclaim_mode though!  Since then the 
system has been upgraded to fedora 21 with a 3.17 kernel and the issue 
no longer manifested itself.  I suspect now that this is due to the new 
default behavior.  Perhaps this solves the puzzle.  What is interesting 
is that at least on our test node this behavior didn't occur prior to 
firefly.  It may be that some change we made exacerbated the problem. 
Anyway, thank you for the excellent analysis Dan!


Mark
___
c

Re: [ceph-users] NUMA and ceph ... zone_reclaim_mode

2015-01-12 Thread Dan van der Ster
(resending to list)

Hi Kyle,
I'd like to +10 this old proposal of yours. Let me explain why...

A couple months ago we started testing a new use-case with radosgw --
this new user is writing millions of small files and has been causing
us some headaches. Since starting these tests, the relevant OSDs have
been randomly freezing for up to ~60s at a time. We have dedicated
servers for this use-case, so it doesn't affect our important RBD
users, and the OSDs always came back anyway ("wrongly marked me
down"..."). So I didn't give this problem much attention, though I
guessed that we must be suffering from some network connectivity
problem.

But last week I started looking into this problem in more detail. With
increased debug_osd logs I saw that when these OSDs are getting marked
down, even the osd tick message is not printed for >30s. I also
correlated these outages with massive drops in cached memory -- it
looked as if an admin was running drop_caches on our live machines.
Here is what we saw:


https://www.dropbox.com/s/418ve09b6m98tyc/Screenshot%202015-01-12%2010.04.16.png?dl=0

Notice the sawtooth cached pages. That server has 20 OSDs, each OSD
has ~1 million files totalling around 40GB (~40kB objects). Compare
that with a different OSD host, one that's used for Cinder RBD volumes
(and doesn't suffer from the freezing OSD problem).:


https://www.dropbox.com/s/1lmra5wz7e7qxjy/Screenshot%202015-01-12%2010.11.37.png?dl=0

These RBD servers have identical hardware, but in this case the 20
OSDs each hold around 100k files totalling ~400GB (~4MB objects).

Clearly the 10x increase in num files on the radosgw OSDs appears to
be causing a problem. In fact, since the servers are pretty idle most
of the time, it appears that the _scrubbing_ of these 20 million files
per server is causing the problem. It seems that scrubbing is creating
quite some memory pressure (via the inode cache, especially), so I
started testing different vfs_cache_pressure values (1,10,1000,1).
The only value that sort of helped was vfs_cache_pressure = 1, but
keeping all the inodes cached is a pretty extreme measure, and it
won't scale up when these OSDs are more full (they're only around 1%
full now!!)

Then I discovered the infamous behaviour of zone_reclaim_mode = 1, and
this old thread. And I read a bit more, e.g.


http://engineering.linkedin.com/performance/optimizing-linux-memory-management-low-latency-high-throughput-databases
http://rhaas.blogspot.ch/2014/06/linux-disables-vmzonereclaimmode-by.html

Indeed all our servers have zone_reclaim_mode = 1. Numerous DB
communities regard this option as very bad for servers -- MongoDB even
prints a warning message at startup if zone_reclaim_mode is enabled.
And finally, in recent kernels (since ~June 2014) zone_reclaim_mode is
disabled by default. The vm doc now says:

zone_reclaim_mode is disabled by default. For file servers or
workloads that benefit from having their data cached,
zone_reclaim_mode should be left disabled as the caching effect is
likely to be more important than data locality.

I've set zone_reclaim_mode = 0 on these radosgw OSD servers, and the
freezing OSD problem has gone away. Here's a plot of a server that had
zone_reclaim_mode set to zero late on Jan 9th:


https://www.dropbox.com/s/x5qyn1e1r6fasl5/Screenshot%202015-01-12%2011.47.27.png?dl=0

I also used numactl --interleave=all  on one host, but
it doesn't appear to make a huge different beyond disabling numa zone
reclaim.

Moving forward, I think it would be good for Ceph to a least document
this behaviour, but better would be to also detect when
zone_reclaim_mode != 0 and warn the admin (like MongoDB does). This
line from the commit which disables it in the kernel is pretty wise,
IMHO: "On current machines and workloads it is often the case that
zone_reclaim_mode destroys performance but not all users know how to
detect this. Favour the common case and disable it by default."

Cheers, Dan


On Thu, Dec 12, 2013 at 11:30 PM, Kyle Bader  wrote:
> It seems that NUMA can be problematic for ceph-osd daemons in certain
> circumstances. Namely it seems that if a NUMA zone is running out of
> memory due to uneven allocation it is possible for a NUMA zone to
> enter reclaim mode when threads/processes are scheduled on a core in
> that zone and those processes are request memory allocations greater
> than the zones remaining memory. In order for the kernel to satisfy
> the memory allocation for those processes it needs to page out some of
> the contents of the contentious zone, which can have dramatic
> performance implications due to cache misses, etc. I see two ways an
> operator could alleviate these issues:
>
> Set the vm.zone_reclaim_mode sysctl setting to 0, along with prefixing
> ceph-osd daemons with "numactl --interleave=all". This should probably
> be activated by a flag in /etc/default/ceph and modifying the
> ceph-osd.conf upstart script, along with adding a depend to the ceph
> pac

Re: [ceph-users] NUMA and ceph

2013-12-12 Thread Mark Nelson

On 12/12/2013 04:30 PM, Kyle Bader wrote:

It seems that NUMA can be problematic for ceph-osd daemons in certain
circumstances. Namely it seems that if a NUMA zone is running out of
memory due to uneven allocation it is possible for a NUMA zone to
enter reclaim mode when threads/processes are scheduled on a core in
that zone and those processes are request memory allocations greater
than the zones remaining memory. In order for the kernel to satisfy
the memory allocation for those processes it needs to page out some of
the contents of the contentious zone, which can have dramatic
performance implications due to cache misses, etc. I see two ways an
operator could alleviate these issues:


Yes, quite possibly I think, though I'd be curious to see what impact 
testing would show on modern dual socket Intel boxes.  I suspect this 
could especially be an issue on quad socket AMD boxes though, especially 
Magnycours era.




Set the vm.zone_reclaim_mode sysctl setting to 0, along with prefixing
ceph-osd daemons with "numactl --interleave=all". This should probably
be activated by a flag in /etc/default/ceph and modifying the
ceph-osd.conf upstart script, along with adding a depend to the ceph
package's debian/rules file on the "numactl" package.

The alternative is to use a cgroup for each ceph-osd daemon, pinning
each one to cores in the same NUMA zone using cpuset.cpu and
cpuset.mems. This would probably also live in /etc/default/ceph and
the upstart scripts.



Seems reasonable unless we are testing OSDs that we (eventually?) want 
to have utilize cores on multiple sockets.  If possible, pinning the OSD 
to whatever CPU has the the associated PCIE bus and NIC would be ideal, 
though there's no real good automated way to do that yet afaik.


Mark
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] NUMA and ceph

2013-12-12 Thread Kyle Bader
It seems that NUMA can be problematic for ceph-osd daemons in certain
circumstances. Namely it seems that if a NUMA zone is running out of
memory due to uneven allocation it is possible for a NUMA zone to
enter reclaim mode when threads/processes are scheduled on a core in
that zone and those processes are request memory allocations greater
than the zones remaining memory. In order for the kernel to satisfy
the memory allocation for those processes it needs to page out some of
the contents of the contentious zone, which can have dramatic
performance implications due to cache misses, etc. I see two ways an
operator could alleviate these issues:

Set the vm.zone_reclaim_mode sysctl setting to 0, along with prefixing
ceph-osd daemons with "numactl --interleave=all". This should probably
be activated by a flag in /etc/default/ceph and modifying the
ceph-osd.conf upstart script, along with adding a depend to the ceph
package's debian/rules file on the "numactl" package.

The alternative is to use a cgroup for each ceph-osd daemon, pinning
each one to cores in the same NUMA zone using cpuset.cpu and
cpuset.mems. This would probably also live in /etc/default/ceph and
the upstart scripts.

-- 
Kyle
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com