Re: [ceph-users] NUMA and ceph ... zone_reclaim_mode

2015-01-13 Thread Mark Nelson

On 01/12/2015 07:47 AM, Dan van der Ster wrote:

(resending to list)

Hi Kyle,
I'd like to +10 this old proposal of yours. Let me explain why...

A couple months ago we started testing a new use-case with radosgw --
this new user is writing millions of small files and has been causing
us some headaches. Since starting these tests, the relevant OSDs have
been randomly freezing for up to ~60s at a time. We have dedicated
servers for this use-case, so it doesn't affect our important RBD
users, and the OSDs always came back anyway (wrongly marked me
down...). So I didn't give this problem much attention, though I
guessed that we must be suffering from some network connectivity
problem.

But last week I started looking into this problem in more detail. With
increased debug_osd logs I saw that when these OSDs are getting marked
down, even the osd tick message is not printed for 30s. I also
correlated these outages with massive drops in cached memory -- it
looked as if an admin was running drop_caches on our live machines.
Here is what we saw:

 
https://www.dropbox.com/s/418ve09b6m98tyc/Screenshot%202015-01-12%2010.04.16.png?dl=0

Notice the sawtooth cached pages. That server has 20 OSDs, each OSD
has ~1 million files totalling around 40GB (~40kB objects). Compare
that with a different OSD host, one that's used for Cinder RBD volumes
(and doesn't suffer from the freezing OSD problem).:

 
https://www.dropbox.com/s/1lmra5wz7e7qxjy/Screenshot%202015-01-12%2010.11.37.png?dl=0

These RBD servers have identical hardware, but in this case the 20
OSDs each hold around 100k files totalling ~400GB (~4MB objects).

Clearly the 10x increase in num files on the radosgw OSDs appears to
be causing a problem. In fact, since the servers are pretty idle most
of the time, it appears that the _scrubbing_ of these 20 million files
per server is causing the problem. It seems that scrubbing is creating
quite some memory pressure (via the inode cache, especially), so I
started testing different vfs_cache_pressure values (1,10,1000,1).
The only value that sort of helped was vfs_cache_pressure = 1, but
keeping all the inodes cached is a pretty extreme measure, and it
won't scale up when these OSDs are more full (they're only around 1%
full now!!)

Then I discovered the infamous behaviour of zone_reclaim_mode = 1, and
this old thread. And I read a bit more, e.g.

 
http://engineering.linkedin.com/performance/optimizing-linux-memory-management-low-latency-high-throughput-databases
 http://rhaas.blogspot.ch/2014/06/linux-disables-vmzonereclaimmode-by.html

Indeed all our servers have zone_reclaim_mode = 1. Numerous DB
communities regard this option as very bad for servers -- MongoDB even
prints a warning message at startup if zone_reclaim_mode is enabled.
And finally, in recent kernels (since ~June 2014) zone_reclaim_mode is
disabled by default. The vm doc now says:

 zone_reclaim_mode is disabled by default. For file servers or
workloads that benefit from having their data cached,
zone_reclaim_mode should be left disabled as the caching effect is
likely to be more important than data locality.

I've set zone_reclaim_mode = 0 on these radosgw OSD servers, and the
freezing OSD problem has gone away. Here's a plot of a server that had
zone_reclaim_mode set to zero late on Jan 9th:

 
https://www.dropbox.com/s/x5qyn1e1r6fasl5/Screenshot%202015-01-12%2011.47.27.png?dl=0

I also used numactl --interleave=all ceph command on one host, but
it doesn't appear to make a huge different beyond disabling numa zone
reclaim.

Moving forward, I think it would be good for Ceph to a least document
this behaviour, but better would be to also detect when
zone_reclaim_mode != 0 and warn the admin (like MongoDB does). This
line from the commit which disables it in the kernel is pretty wise,
IMHO: On current machines and workloads it is often the case that
zone_reclaim_mode destroys performance but not all users know how to
detect this. Favour the common case and disable it by default.



Interestingly I was seeing behavior that looked like this as well, 
though it manifested as OSDs going down with internal heartbeat timeouts 
during heavy 4K read/write benchmarks.  I was able to observe major page 
faults on the OSD nodes correlated with the frozen period despite 
significant amounts of memory used for buffer cache.  I also went down 
the vm_cache_pressure path.  Changing it to 1 fixed the issue.  I didn't 
think to go back and look at zone_reclaim_mode though!  Since then the 
system has been upgraded to fedora 21 with a 3.17 kernel and the issue 
no longer manifested itself.  I suspect now that this is due to the new 
default behavior.  Perhaps this solves the puzzle.  What is interesting 
is that at least on our test node this behavior didn't occur prior to 
firefly.  It may be that some change we made exacerbated the problem. 
Anyway, thank you for the excellent analysis Dan!


Mark

[ceph-users] ssd osd fails often with FAILED assert(soid scrubber.start || soid = scrubber.end)

2015-01-13 Thread Udo Lembke
Hi,
since last thursday we had an ssd-pool (cache tier) in front of an
ec-pool and fill the pools with data via rsync (app. 50MB/s).
The ssd-pool has tree disks and one of them (an DC S3700) fails four
times since that.
I simply start the osd again and the pool pas rebuilded and work again
for some hours up to some days.

I switched the ceph-node and the ssh-adapter, but this don't solve the
issue.
There wasn't any messages in syslog/messages and an fsck runs without
trouble, so I guess the problem is not OS-related.

I found this issue http://tracker.ceph.com/issues/8747 but my
ceph-version is newer (debian: ceph version 0.80.7
(6c0127fcb58008793d3c8b62d925bc91963672a3)),
and it's looks that i can reproduce this issue during 1-3 days.

The osd is ext4-formatted. All other OSDs (62) runs without trouble.

# more ceph-osd.61.log
2015-01-13 16:29:26.494458 7fedf9a3d700  0 log [INF] : 17.0 scrub ok
2015-01-13 17:29:03.988530 7fedf823a700  0 log [INF] : 17.16 scrub ok
2015-01-13 17:30:31.901032 7fedf8a3b700  0 log [INF] : 17.18 scrub ok
2015-01-13 17:31:58.983736 7fedf823a700  0 log [INF] : 17.9 scrub ok
2015-01-13 17:32:30.780308 7fedf9a3d700  0 log [INF] : 17.c scrub ok
2015-01-13 17:32:33.311433 7fedf8a3b700  0 log [INF] : 17.11 scrub ok
2015-01-13 17:37:22.237214 7fedf9a3d700  0 log [INF] : 17.7 scrub ok
2015-01-13 20:15:07.874376 7fedf6236700 -1 osd/ReplicatedPG.cc: In
function 'void ReplicatedPG::finish_ctx(ReplicatedPG::OpContext*, int, bo
ol)' thread 7fedf6236700 time 2015-01-13 20:15:07.853440
osd/ReplicatedPG.cc: 5306: FAILED assert(soid  scrubber.start || soid
= scrubber.end)

 ceph version 0.80.7 (6c0127fcb58008793d3c8b62d925bc91963672a3)
 1: (ReplicatedPG::finish_ctx(ReplicatedPG::OpContext*, int,
bool)+0x1320) [0x9296b0]
 2:
(ReplicatedPG::try_flush_mark_clean(boost::shared_ptrReplicatedPG::FlushOp)+0x5f6)
[0x92b076]
 3: (ReplicatedPG::finish_flush(hobject_t, unsigned long, int)+0x296)
[0x92b876]
 4: (C_Flush::finish(int)+0x86) [0x986226]
 5: (Context::complete(int)+0x9) [0x78f449]
 6: (Finisher::finisher_thread_entry()+0x1c8) [0xad5a18]
 7: (()+0x6b50) [0x7fee152f6b50]
 8: (clone()+0x6d) [0x7fee13f047bd]
 NOTE: a copy of the executable, or `objdump -rdS executable` is
needed to interpret this.

--- begin dump of recent events ---
   -70 2015-01-11 19:54:47.962164 7fee15dd4780  5 asok(0x2f56230)
register_command perfcounters_dump hook 0x2f44010
   -69 2015-01-11 19:54:47.962190 7fee15dd4780  5 asok(0x2f56230)
register_command 1 hook 0x2f44010
   -68 2015-01-11 19:54:47.962195 7fee15dd4780  5 asok(0x2f56230)
register_command perf dump hook 0x2f44010
   -67 2015-01-11 19:54:47.962201 7fee15dd4780  5 asok(0x2f56230)
register_command perfcounters_schema hook 0x2f44010
   -66 2015-01-11 19:54:47.962203 7fee15dd4780  5 asok(0x2f56230)
register_command 2 hook 0x2f44010
   -65 2015-01-11 19:54:47.962207 7fee15dd4780  5 asok(0x2f56230)
register_command perf schema hook 0x2f44010
   -64 2015-01-11 19:54:47.962209 7fee15dd4780  5 asok(0x2f56230)
register_command config show hook 0x2f44010
   -63 2015-01-11 19:54:47.962214 7fee15dd4780  5 asok(0x2f56230)
register_command config set hook 0x2f44010
   -62 2015-01-11 19:54:47.962219 7fee15dd4780  5 asok(0x2f56230)
register_command config get hook 0x2f44010
   -61 2015-01-11 19:54:47.962223 7fee15dd4780  5 asok(0x2f56230)
register_command log flush hook 0x2f44010
   -60 2015-01-11 19:54:47.962226 7fee15dd4780  5 asok(0x2f56230)
register_command log dump hook 0x2f44010
   -59 2015-01-11 19:54:47.962229 7fee15dd4780  5 asok(0x2f56230)
register_command log reopen hook 0x2f44010
   -58 2015-01-11 19:54:47.965000 7fee15dd4780  0 ceph version 0.80.7
(6c0127fcb58008793d3c8b62d925bc91963672a3), process ceph-osd, pid 117
35
   -57 2015-01-11 19:54:47.967362 7fee15dd4780  1 finished
global_init_daemonize
   -56 2015-01-11 19:54:47.971666 7fee15dd4780  0
genericfilestorebackend(/var/lib/ceph/osd/ceph-61) detect_features:
FIEMAP ioctl is suppo
rted and appears to work
   -55 2015-01-11 19:54:47.971682 7fee15dd4780  0
genericfilestorebackend(/var/lib/ceph/osd/ceph-61) detect_features:
FIEMAP ioctl is disab
led via 'filestore fiemap' config option
   -54 2015-01-11 19:54:47.973281 7fee15dd4780  0
genericfilestorebackend(/var/lib/ceph/osd/ceph-61) detect_features:
syscall(SYS_syncfs, f
d) fully supported
   -53 2015-01-11 19:54:47.975393 7fee15dd4780  0
filestore(/var/lib/ceph/osd/ceph-61) limited size xattrs
   -52 2015-01-11 19:54:48.013905 7fee15dd4780  0
filestore(/var/lib/ceph/osd/ceph-61) mount: enabling WRITEAHEAD journal
mode: checkpoint
is not enabled
   -51 2015-01-11 19:54:49.245360 7fee15dd4780  0
genericfilestorebackend(/var/lib/ceph/osd/ceph-61) detect_features:
FIEMAP ioctl is suppo
rted and appears to work
   -50 2015-01-11 19:54:49.245370 7fee15dd4780  0
genericfilestorebackend(/var/lib/ceph/osd/ceph-61) detect_features:
FIEMAP ioctl is disab
led via 'filestore fiemap' config option
   -49 2015-01-11 19:54:49.247017 7fee15dd4780  0

Re: [ceph-users] CRUSH question - failing to rebalance after failure test

2015-01-13 Thread Sage Weil
On Mon, 12 Jan 2015, Christopher Kunz wrote:
 Hi,
 
 [redirecting back to list]
  Oh, it could be that... can you include the output from 'ceph osd tree'?  
  That's a more concise view that shows up/down, weight, and in/out.
  
  Thanks!
  sage
  
 
 root@cepharm17:~# ceph osd tree
 # id  weight  type name   up/down reweight
 -10.52root default
 -21   0.16chassis board0
 -20.032   host cepharm11
 0 0.032   osd.0   up  1   
 -30.032   host cepharm12
 1 0.032   osd.1   up  1   
 -40.032   host cepharm13
 2 0.032   osd.2   up  1   
 -50.032   host cepharm14
 3 0.032   osd.3   up  1   
 -60.032   host cepharm16
 4 0.032   osd.4   up  1   
 -22   0.18chassis board1
 -70.03host cepharm18
 5 0.03osd.5   up  1   
 -80.03host cepharm19
 6 0.03osd.6   up  1   
 -90.03host cepharm20
 7 0.03osd.7   up  1   
 -10   0.03host cepharm21
 8 0.03osd.8   up  1   
 -11   0.03host cepharm22
 9 0.03osd.9   up  1   
 -12   0.03host cepharm23
 100.03osd.10  up  1   
 -23   0.18chassis board2
 -13   0.03host cepharm25
 110.03osd.11  up  1   
 -14   0.03host cepharm26
 120.03osd.12  up  1   
 -15   0.03host cepharm27
 130.03osd.13  up  1   
 -16   0.03host cepharm28
 140.03osd.14  up  1   
 -17   0.03host cepharm29
 150.03osd.15  up  1   
 -18   0.03host cepharm30
 160.03osd.16  up  1   

Okay, it sounds like something is not quite right then.  Can you attach 
the OSDMap once it is in the not-quite-repaired state?  And/or try 
setting 'ceph osd crush tunables optimal' and see if that has any 
effect?

 I am working on one of these boxes:
 http://www.ambedded.com.tw/pt_spec.php?P_ID=20141109001
 So, each chassis is one 7-node board (with a shared 1gbe switch and
 shared electrical supply), and I figured each board is definitely a
 separate failure domain.

Cute!  That kind of looks like 3 sleds of 7 in one chassis though?  Or am 
I looking at the wrong thing?

sage
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] any workaround for FAILED assert(p != snapset.clones.end())

2015-01-13 Thread Luke Kao
Hello community,
We have a cluster using v0.80.5, and recently several OSDs goes down with error 
when removing a rbd snapshot:
osd/ReplicatedPG.cc: 2352: FAILED assert(p != snapset.clones.end())

and after restart those OSDs, it will go down again soon for the same error.
It looks like link to BUG#8629, but before upgrade to the patched version, is 
there any workaround other than reformat disk and create OSDs?

Also a side question: I don't find this bug fix in release note of v0.80.6 or 
v0.80.7, so should I assume the patch is not yet released?

Thanks

BR,
Luke Kao
MYCOM-OSI





This electronic message contains information from Mycom which may be privileged 
or confidential. The information is intended to be for the use of the 
individual(s) or entity named above. If you are not the intended recipient, be 
aware that any disclosure, copying, distribution or any other use of the 
contents of this information is prohibited. If you have received this 
electronic message in error, please notify us by post or telephone (to the 
numbers or correspondence address above) or by email (at the email address 
above) immediately.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] error adding OSD to crushmap

2015-01-13 Thread Jason King
Hi Luis,

Could you show us the output of *ceph osd tree*?

Jason

2015-01-12 20:45 GMT+08:00 Luis Periquito periqu...@gmail.com:

 Hi all,

 I've been trying to add a few new OSDs, and as I manage everything with
 puppet, it was manually adding via the CLI.

 At one point it adds the OSD to the crush map using:

 # ceph osd crush add 6 0.0 root=default

 but I get
 Error ENOENT: osd.6 does not exist.  create it before updating the crush
 map

 If I read correctly this command should be the correct one to create the
 OSD to the crush map...

 is this a bug? I'm running the latest firefly 0.80.7.

 thanks

 PS: I just edited the crushmap, but it would make it a lot easier to do it
 by the CLI commands...

 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com