Re: [ceph-users] NUMA and ceph ... zone_reclaim_mode
On 01/12/2015 07:47 AM, Dan van der Ster wrote: (resending to list) Hi Kyle, I'd like to +10 this old proposal of yours. Let me explain why... A couple months ago we started testing a new use-case with radosgw -- this new user is writing millions of small files and has been causing us some headaches. Since starting these tests, the relevant OSDs have been randomly freezing for up to ~60s at a time. We have dedicated servers for this use-case, so it doesn't affect our important RBD users, and the OSDs always came back anyway (wrongly marked me down...). So I didn't give this problem much attention, though I guessed that we must be suffering from some network connectivity problem. But last week I started looking into this problem in more detail. With increased debug_osd logs I saw that when these OSDs are getting marked down, even the osd tick message is not printed for 30s. I also correlated these outages with massive drops in cached memory -- it looked as if an admin was running drop_caches on our live machines. Here is what we saw: https://www.dropbox.com/s/418ve09b6m98tyc/Screenshot%202015-01-12%2010.04.16.png?dl=0 Notice the sawtooth cached pages. That server has 20 OSDs, each OSD has ~1 million files totalling around 40GB (~40kB objects). Compare that with a different OSD host, one that's used for Cinder RBD volumes (and doesn't suffer from the freezing OSD problem).: https://www.dropbox.com/s/1lmra5wz7e7qxjy/Screenshot%202015-01-12%2010.11.37.png?dl=0 These RBD servers have identical hardware, but in this case the 20 OSDs each hold around 100k files totalling ~400GB (~4MB objects). Clearly the 10x increase in num files on the radosgw OSDs appears to be causing a problem. In fact, since the servers are pretty idle most of the time, it appears that the _scrubbing_ of these 20 million files per server is causing the problem. It seems that scrubbing is creating quite some memory pressure (via the inode cache, especially), so I started testing different vfs_cache_pressure values (1,10,1000,1). The only value that sort of helped was vfs_cache_pressure = 1, but keeping all the inodes cached is a pretty extreme measure, and it won't scale up when these OSDs are more full (they're only around 1% full now!!) Then I discovered the infamous behaviour of zone_reclaim_mode = 1, and this old thread. And I read a bit more, e.g. http://engineering.linkedin.com/performance/optimizing-linux-memory-management-low-latency-high-throughput-databases http://rhaas.blogspot.ch/2014/06/linux-disables-vmzonereclaimmode-by.html Indeed all our servers have zone_reclaim_mode = 1. Numerous DB communities regard this option as very bad for servers -- MongoDB even prints a warning message at startup if zone_reclaim_mode is enabled. And finally, in recent kernels (since ~June 2014) zone_reclaim_mode is disabled by default. The vm doc now says: zone_reclaim_mode is disabled by default. For file servers or workloads that benefit from having their data cached, zone_reclaim_mode should be left disabled as the caching effect is likely to be more important than data locality. I've set zone_reclaim_mode = 0 on these radosgw OSD servers, and the freezing OSD problem has gone away. Here's a plot of a server that had zone_reclaim_mode set to zero late on Jan 9th: https://www.dropbox.com/s/x5qyn1e1r6fasl5/Screenshot%202015-01-12%2011.47.27.png?dl=0 I also used numactl --interleave=all ceph command on one host, but it doesn't appear to make a huge different beyond disabling numa zone reclaim. Moving forward, I think it would be good for Ceph to a least document this behaviour, but better would be to also detect when zone_reclaim_mode != 0 and warn the admin (like MongoDB does). This line from the commit which disables it in the kernel is pretty wise, IMHO: On current machines and workloads it is often the case that zone_reclaim_mode destroys performance but not all users know how to detect this. Favour the common case and disable it by default. Interestingly I was seeing behavior that looked like this as well, though it manifested as OSDs going down with internal heartbeat timeouts during heavy 4K read/write benchmarks. I was able to observe major page faults on the OSD nodes correlated with the frozen period despite significant amounts of memory used for buffer cache. I also went down the vm_cache_pressure path. Changing it to 1 fixed the issue. I didn't think to go back and look at zone_reclaim_mode though! Since then the system has been upgraded to fedora 21 with a 3.17 kernel and the issue no longer manifested itself. I suspect now that this is due to the new default behavior. Perhaps this solves the puzzle. What is interesting is that at least on our test node this behavior didn't occur prior to firefly. It may be that some change we made exacerbated the problem. Anyway, thank you for the excellent analysis Dan! Mark
[ceph-users] ssd osd fails often with FAILED assert(soid scrubber.start || soid = scrubber.end)
Hi, since last thursday we had an ssd-pool (cache tier) in front of an ec-pool and fill the pools with data via rsync (app. 50MB/s). The ssd-pool has tree disks and one of them (an DC S3700) fails four times since that. I simply start the osd again and the pool pas rebuilded and work again for some hours up to some days. I switched the ceph-node and the ssh-adapter, but this don't solve the issue. There wasn't any messages in syslog/messages and an fsck runs without trouble, so I guess the problem is not OS-related. I found this issue http://tracker.ceph.com/issues/8747 but my ceph-version is newer (debian: ceph version 0.80.7 (6c0127fcb58008793d3c8b62d925bc91963672a3)), and it's looks that i can reproduce this issue during 1-3 days. The osd is ext4-formatted. All other OSDs (62) runs without trouble. # more ceph-osd.61.log 2015-01-13 16:29:26.494458 7fedf9a3d700 0 log [INF] : 17.0 scrub ok 2015-01-13 17:29:03.988530 7fedf823a700 0 log [INF] : 17.16 scrub ok 2015-01-13 17:30:31.901032 7fedf8a3b700 0 log [INF] : 17.18 scrub ok 2015-01-13 17:31:58.983736 7fedf823a700 0 log [INF] : 17.9 scrub ok 2015-01-13 17:32:30.780308 7fedf9a3d700 0 log [INF] : 17.c scrub ok 2015-01-13 17:32:33.311433 7fedf8a3b700 0 log [INF] : 17.11 scrub ok 2015-01-13 17:37:22.237214 7fedf9a3d700 0 log [INF] : 17.7 scrub ok 2015-01-13 20:15:07.874376 7fedf6236700 -1 osd/ReplicatedPG.cc: In function 'void ReplicatedPG::finish_ctx(ReplicatedPG::OpContext*, int, bo ol)' thread 7fedf6236700 time 2015-01-13 20:15:07.853440 osd/ReplicatedPG.cc: 5306: FAILED assert(soid scrubber.start || soid = scrubber.end) ceph version 0.80.7 (6c0127fcb58008793d3c8b62d925bc91963672a3) 1: (ReplicatedPG::finish_ctx(ReplicatedPG::OpContext*, int, bool)+0x1320) [0x9296b0] 2: (ReplicatedPG::try_flush_mark_clean(boost::shared_ptrReplicatedPG::FlushOp)+0x5f6) [0x92b076] 3: (ReplicatedPG::finish_flush(hobject_t, unsigned long, int)+0x296) [0x92b876] 4: (C_Flush::finish(int)+0x86) [0x986226] 5: (Context::complete(int)+0x9) [0x78f449] 6: (Finisher::finisher_thread_entry()+0x1c8) [0xad5a18] 7: (()+0x6b50) [0x7fee152f6b50] 8: (clone()+0x6d) [0x7fee13f047bd] NOTE: a copy of the executable, or `objdump -rdS executable` is needed to interpret this. --- begin dump of recent events --- -70 2015-01-11 19:54:47.962164 7fee15dd4780 5 asok(0x2f56230) register_command perfcounters_dump hook 0x2f44010 -69 2015-01-11 19:54:47.962190 7fee15dd4780 5 asok(0x2f56230) register_command 1 hook 0x2f44010 -68 2015-01-11 19:54:47.962195 7fee15dd4780 5 asok(0x2f56230) register_command perf dump hook 0x2f44010 -67 2015-01-11 19:54:47.962201 7fee15dd4780 5 asok(0x2f56230) register_command perfcounters_schema hook 0x2f44010 -66 2015-01-11 19:54:47.962203 7fee15dd4780 5 asok(0x2f56230) register_command 2 hook 0x2f44010 -65 2015-01-11 19:54:47.962207 7fee15dd4780 5 asok(0x2f56230) register_command perf schema hook 0x2f44010 -64 2015-01-11 19:54:47.962209 7fee15dd4780 5 asok(0x2f56230) register_command config show hook 0x2f44010 -63 2015-01-11 19:54:47.962214 7fee15dd4780 5 asok(0x2f56230) register_command config set hook 0x2f44010 -62 2015-01-11 19:54:47.962219 7fee15dd4780 5 asok(0x2f56230) register_command config get hook 0x2f44010 -61 2015-01-11 19:54:47.962223 7fee15dd4780 5 asok(0x2f56230) register_command log flush hook 0x2f44010 -60 2015-01-11 19:54:47.962226 7fee15dd4780 5 asok(0x2f56230) register_command log dump hook 0x2f44010 -59 2015-01-11 19:54:47.962229 7fee15dd4780 5 asok(0x2f56230) register_command log reopen hook 0x2f44010 -58 2015-01-11 19:54:47.965000 7fee15dd4780 0 ceph version 0.80.7 (6c0127fcb58008793d3c8b62d925bc91963672a3), process ceph-osd, pid 117 35 -57 2015-01-11 19:54:47.967362 7fee15dd4780 1 finished global_init_daemonize -56 2015-01-11 19:54:47.971666 7fee15dd4780 0 genericfilestorebackend(/var/lib/ceph/osd/ceph-61) detect_features: FIEMAP ioctl is suppo rted and appears to work -55 2015-01-11 19:54:47.971682 7fee15dd4780 0 genericfilestorebackend(/var/lib/ceph/osd/ceph-61) detect_features: FIEMAP ioctl is disab led via 'filestore fiemap' config option -54 2015-01-11 19:54:47.973281 7fee15dd4780 0 genericfilestorebackend(/var/lib/ceph/osd/ceph-61) detect_features: syscall(SYS_syncfs, f d) fully supported -53 2015-01-11 19:54:47.975393 7fee15dd4780 0 filestore(/var/lib/ceph/osd/ceph-61) limited size xattrs -52 2015-01-11 19:54:48.013905 7fee15dd4780 0 filestore(/var/lib/ceph/osd/ceph-61) mount: enabling WRITEAHEAD journal mode: checkpoint is not enabled -51 2015-01-11 19:54:49.245360 7fee15dd4780 0 genericfilestorebackend(/var/lib/ceph/osd/ceph-61) detect_features: FIEMAP ioctl is suppo rted and appears to work -50 2015-01-11 19:54:49.245370 7fee15dd4780 0 genericfilestorebackend(/var/lib/ceph/osd/ceph-61) detect_features: FIEMAP ioctl is disab led via 'filestore fiemap' config option -49 2015-01-11 19:54:49.247017 7fee15dd4780 0
Re: [ceph-users] CRUSH question - failing to rebalance after failure test
On Mon, 12 Jan 2015, Christopher Kunz wrote: Hi, [redirecting back to list] Oh, it could be that... can you include the output from 'ceph osd tree'? That's a more concise view that shows up/down, weight, and in/out. Thanks! sage root@cepharm17:~# ceph osd tree # id weight type name up/down reweight -10.52root default -21 0.16chassis board0 -20.032 host cepharm11 0 0.032 osd.0 up 1 -30.032 host cepharm12 1 0.032 osd.1 up 1 -40.032 host cepharm13 2 0.032 osd.2 up 1 -50.032 host cepharm14 3 0.032 osd.3 up 1 -60.032 host cepharm16 4 0.032 osd.4 up 1 -22 0.18chassis board1 -70.03host cepharm18 5 0.03osd.5 up 1 -80.03host cepharm19 6 0.03osd.6 up 1 -90.03host cepharm20 7 0.03osd.7 up 1 -10 0.03host cepharm21 8 0.03osd.8 up 1 -11 0.03host cepharm22 9 0.03osd.9 up 1 -12 0.03host cepharm23 100.03osd.10 up 1 -23 0.18chassis board2 -13 0.03host cepharm25 110.03osd.11 up 1 -14 0.03host cepharm26 120.03osd.12 up 1 -15 0.03host cepharm27 130.03osd.13 up 1 -16 0.03host cepharm28 140.03osd.14 up 1 -17 0.03host cepharm29 150.03osd.15 up 1 -18 0.03host cepharm30 160.03osd.16 up 1 Okay, it sounds like something is not quite right then. Can you attach the OSDMap once it is in the not-quite-repaired state? And/or try setting 'ceph osd crush tunables optimal' and see if that has any effect? I am working on one of these boxes: http://www.ambedded.com.tw/pt_spec.php?P_ID=20141109001 So, each chassis is one 7-node board (with a shared 1gbe switch and shared electrical supply), and I figured each board is definitely a separate failure domain. Cute! That kind of looks like 3 sleds of 7 in one chassis though? Or am I looking at the wrong thing? sage ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] any workaround for FAILED assert(p != snapset.clones.end())
Hello community, We have a cluster using v0.80.5, and recently several OSDs goes down with error when removing a rbd snapshot: osd/ReplicatedPG.cc: 2352: FAILED assert(p != snapset.clones.end()) and after restart those OSDs, it will go down again soon for the same error. It looks like link to BUG#8629, but before upgrade to the patched version, is there any workaround other than reformat disk and create OSDs? Also a side question: I don't find this bug fix in release note of v0.80.6 or v0.80.7, so should I assume the patch is not yet released? Thanks BR, Luke Kao MYCOM-OSI This electronic message contains information from Mycom which may be privileged or confidential. The information is intended to be for the use of the individual(s) or entity named above. If you are not the intended recipient, be aware that any disclosure, copying, distribution or any other use of the contents of this information is prohibited. If you have received this electronic message in error, please notify us by post or telephone (to the numbers or correspondence address above) or by email (at the email address above) immediately. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] error adding OSD to crushmap
Hi Luis, Could you show us the output of *ceph osd tree*? Jason 2015-01-12 20:45 GMT+08:00 Luis Periquito periqu...@gmail.com: Hi all, I've been trying to add a few new OSDs, and as I manage everything with puppet, it was manually adding via the CLI. At one point it adds the OSD to the crush map using: # ceph osd crush add 6 0.0 root=default but I get Error ENOENT: osd.6 does not exist. create it before updating the crush map If I read correctly this command should be the correct one to create the OSD to the crush map... is this a bug? I'm running the latest firefly 0.80.7. thanks PS: I just edited the crushmap, but it would make it a lot easier to do it by the CLI commands... ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com