Hello Lukas, Unfortunately, I'm all out of ideas at the moment. There are some memory profiling techniques which can help identify what is causing the memory utilization, but it's a bit beyond what I typically work on. Others on the list may have experience with this (or otherwise have ideas) and may chip in...
Wish I could be more help.. Michael J. Kidd Sr. Storage Consultant Inktank Professional Services - by Red Hat On Thu, Oct 30, 2014 at 11:00 AM, Lukáš Kubín <lukas.ku...@gmail.com> wrote: > Thanks Michael, still no luck. > > Letting the problematic OSD.10 down has no effect. Within minutes more of > OSDs fail on same issue after consuming ~50GB of memory. Also, I can see > two of those cache-tier OSDs on separate hosts which remain utilized almost > 200% CPU all the time > > I've performed upgrade of all cluster to 0.80.7. Did not help. > > I have also tried to unset norecovery+nobackfill flags to attempt a > recovery completion. No luck, several OSDs fail with the same issue > preventing the recovery to complete. I've performed your fix steps from the > start again and currently I'm behind the "unset noin" step. > > I could get some of pools to a state with no degraded objects temporarily. > Then (within minutes) some OSD fails and it's degraded again. > > I have also tried to let the OSD processes get restarted automatically to > keep them up as much as possible. > > I consider disabling the tiering pool "volumes-cache" as that's something > I can miss: > > pool name category KB objects clones > degraded > backups - 0 0 0 > 0 > data - 0 0 0 > 0 > images - 777989590 95027 0 > 8883 > metadata - 0 0 0 > 0 > rbd - 0 0 0 > 0 > volumes - 115608693 25965 179 > 3307 > volumes-cache - 649577103 16708730 9894 > 1144650 > > > Can I just switch it into the forward mode and let it empty > (cache-flush-evict-all) to see if that changes anything? > > Could you or any of your colleagues provide anything else to try? > > Thank you, > > Lukas > > > On Thu, Oct 30, 2014 at 3:05 PM, Michael J. Kidd <michael.k...@inktank.com > > wrote: > >> Hello Lukas, >> The 'slow request' logs are expected while the cluster is in such a >> state.. the OSD processes simply aren't able to respond quickly to client >> IO requests. >> >> I would recommend trying to recover without the most problematic disk ( >> seems to be OSD.10? ).. Simply shut it down and see if the other OSDs >> settle down. You should also take a look at the kernel logs for any >> indications of a problem with the disks themselves, or possibly do an FIO >> test against the drive with the OSD shut down (to a file on the OSD >> filesystem, not the raw drive.. this would be destructive). >> >> Also, you could upgrade to 0.80.7. There are some bug fixes, but I'm not >> sure if any would specifically help this situation.. not likely to hurt >> through. >> >> The desired state is for the cluster to be steady-state before the next >> move (unsetting the next flag). Hopefully this can be achieved without >> needing to take down OSDs in multiple hosts. >> >> I'm also unsure about the cache tiering and how it could relate to the >> load being seen. >> >> Hope this helps... >> >> Michael J. Kidd >> Sr. Storage Consultant >> Inktank Professional Services >> - by Red Hat >> >> On Thu, Oct 30, 2014 at 4:00 AM, Lukáš Kubín <lukas.ku...@gmail.com> >> wrote: >> >>> Hi, >>> I've noticed the following messages always accumulate in OSD log before >>> it exhausts all memory: >>> >>> 2014-10-30 08:48:42.994190 7f80a2019700 0 log [WRN] : slow request >>> 38.901192 seconds old, received at 2014-10-30 08:48:04.092889: >>> osd_op(osd.29.3076:207644827 rbd_data.2e4ee3ba663be.000000000000363b@17 >>> [copy-get max 8388608] 7.af87e887 >>> ack+read+ignore_cache+ignore_overlay+map_snap_clone e3359) v4 currently >>> reached pg >>> >>> >>> Note this is always from the most frequently failing osd.10 (sata tier) >>> referring to osd.29 (ssd cache tier). That osd.29 is consuming huge CPU and >>> memory resources, but keeps running without failures. >>> >>> Can this be eg. a bug? Or some erroneous I/O request which initiated >>> this behaviour? Can I eg. attempt to upgrade the Ceph to a more recent >>> release in the current unhealthy status of the cluster? Can I eg. try >>> disabling the caching tier? Or just somehow evacuate the problematic OSD? >>> >>> I'll welcome any ideas. Currently, I'm keeping the osd.10 in an >>> automatic restart loop with 60 seconds pause before starting again. >>> >>> Thanks and greetings, >>> >>> Lukas >>> >>> On Wed, Oct 29, 2014 at 8:04 PM, Lukáš Kubín <lukas.ku...@gmail.com> >>> wrote: >>> >>>> I should have figured that out myself since I did that recently. Thanks. >>>> >>>> Unfortunately, I'm still at the step "ceph osd unset noin". After >>>> setting all the OSDs in, the original issue reapears preventing me to >>>> proceed with recovery. It now appears mostly at single OSD - osd.10 which >>>> consumes ~200% CPU and all memory within 45 seconds being killed by Linux >>>> then: >>>> >>>> Oct 29 18:24:38 q09 kernel: Out of memory: Kill process 17202 >>>> (ceph-osd) score 912 or sacrifice child >>>> Oct 29 18:24:38 q09 kernel: Killed process 17202, UID 0, (ceph-osd) >>>> total-vm:62713176kB, anon-rss:62009772kB, file-rss:328kB >>>> >>>> >>>> I've tried to restart it several times with same result. Similar >>>> situation with OSDs 0 and 13. >>>> >>>> Also, I've noticed one of SSD cache tier's OSD - osd.29 generating high >>>> CPU utilization around 180%. >>>> >>>> All the problematic OSD's have been the same ones all the time - OSD >>>> 0,8,10,13 and 29 - they are those which I've found to be down this morning. >>>> >>>> There is some minor load coming from client - Openstack instances, I >>>> preferred not to kill them: >>>> >>>> [root@q04 ceph-recovery]# ceph -s >>>> cluster ec433b4a-9dc0-4d08-bde4-f1657b1fdb99 >>>> health HEALTH_ERR 31 pgs backfill; 241 pgs degraded; 62 pgs down; >>>> 193 pgs incomplete; 13 pgs inconsistent; 62 pgs peering; 12 pgs recovering; >>>> 205 pgs recovery_wait; 93 pgs stuck inactive; 608 pgs stuck unclean; 381138 >>>> requests are blocked > 32 sec; recovery 1162468/35207488 objects degraded >>>> (3.302%); 466/17112963 unfound (0.003%); 13 scrub errors; 1/34 in osds are >>>> down; nobackfill,norecover,noscrub,nodeep-scrub flag(s) set >>>> monmap e2: 3 mons at {q03= >>>> 10.255.253.33:6789/0,q04=10.255.253.34:6789/0,q05=10.255.253.35:6789/0}, >>>> election epoch 92, quorum 0,1,2 q03,q04,q05 >>>> osdmap e2782: 34 osds: 33 up, 34 in >>>> flags nobackfill,norecover,noscrub,nodeep-scrub >>>> pgmap v7440374: 5632 pgs, 7 pools, 1449 GB data, 16711 kobjects >>>> 3148 GB used, 15010 GB / 18158 GB avail >>>> 1162468/35207488 objects degraded (3.302%); 466/17112963 >>>> unfound (0.003%) >>>> 13 active >>>> 22 active+recovery_wait+remapped >>>> 1 active+recovery_wait+inconsistent >>>> 4794 active+clean >>>> 193 incomplete >>>> 62 down+peering >>>> 9 active+degraded+remapped+wait_backfill >>>> 182 active+recovery_wait >>>> 74 active+remapped >>>> 12 active+recovering >>>> 12 active+clean+inconsistent >>>> 22 active+remapped+wait_backfill >>>> 4 active+clean+replay >>>> 232 active+degraded >>>> client io 0 B/s rd, 1048 kB/s wr, 184 op/s >>>> >>>> >>>> Below I'm sending the requested output. >>>> >>>> Do you have any other ideas how to recover from this? >>>> >>>> Thanks a lot. >>>> >>>> Lukas >>>> >>>> >>>> >>>> >>>> [root@q04 ceph-recovery]# ceph osd crush rule dump >>>> [ >>>> { "rule_id": 0, >>>> "rule_name": "replicated_ruleset", >>>> "ruleset": 0, >>>> "type": 1, >>>> "min_size": 1, >>>> "max_size": 10, >>>> "steps": [ >>>> { "op": "take", >>>> "item": -1, >>>> "item_name": "default"}, >>>> { "op": "chooseleaf_firstn", >>>> "num": 0, >>>> "type": "host"}, >>>> { "op": "emit"}]}, >>>> { "rule_id": 1, >>>> "rule_name": "ssd", >>>> "ruleset": 1, >>>> "type": 1, >>>> "min_size": 1, >>>> "max_size": 10, >>>> "steps": [ >>>> { "op": "take", >>>> "item": -5, >>>> "item_name": "ssd"}, >>>> { "op": "chooseleaf_firstn", >>>> "num": 0, >>>> "type": "host"}, >>>> { "op": "emit"}]}, >>>> { "rule_id": 2, >>>> "rule_name": "sata", >>>> "ruleset": 2, >>>> "type": 1, >>>> "min_size": 1, >>>> "max_size": 10, >>>> "steps": [ >>>> { "op": "take", >>>> "item": -4, >>>> "item_name": "sata"}, >>>> { "op": "chooseleaf_firstn", >>>> "num": 0, >>>> "type": "host"}, >>>> { "op": "emit"}]}] >>>> >>>> [root@q04 ceph-recovery]# ceph osd dump | grep pool >>>> pool 0 'data' replicated size 2 min_size 1 crush_ruleset 2 object_hash >>>> rjenkins pg_num 512 pgp_num 512 last_change 630 flags hashpspool >>>> crash_replay_interval 45 stripe_width 0 >>>> pool 1 'metadata' replicated size 2 min_size 1 crush_ruleset 2 >>>> object_hash rjenkins pg_num 512 pgp_num 512 last_change 632 flags >>>> hashpspool stripe_width 0 >>>> pool 2 'rbd' replicated size 2 min_size 1 crush_ruleset 2 object_hash >>>> rjenkins pg_num 512 pgp_num 512 last_change 634 flags hashpspool >>>> stripe_width 0 >>>> pool 7 'volumes' replicated size 2 min_size 2 crush_ruleset 0 >>>> object_hash rjenkins pg_num 1024 pgp_num 1024 last_change 1517 flags >>>> hashpspool tiers 14 read_tier 14 write_tier 14 stripe_width 0 >>>> pool 8 'images' replicated size 2 min_size 2 crush_ruleset 0 >>>> object_hash rjenkins pg_num 1024 pgp_num 1024 last_change 1519 flags >>>> hashpspool stripe_width 0 >>>> pool 12 'backups' replicated size 2 min_size 1 crush_ruleset 0 >>>> object_hash rjenkins pg_num 1024 pgp_num 1024 last_change 862 flags >>>> hashpspool stripe_width 0 >>>> pool 14 'volumes-cache' replicated size 2 min_size 1 crush_ruleset 1 >>>> object_hash rjenkins pg_num 1024 pgp_num 1024 last_change 1517 flags >>>> hashpspool tier_of 7 cache_mode writeback target_bytes 1000000000000 >>>> hit_set bloom{false_positive_probability: 0.05, target_size: 0, seed: 0} >>>> 3600s x1 stripe_width 0 >>>> >>>> >>>> On Wed, Oct 29, 2014 at 6:43 PM, Michael J. Kidd < >>>> michael.k...@inktank.com> wrote: >>>> >>>>> Ah, sorry... since they were set out manually, they'll need to be set >>>>> in manually.. >>>>> >>>>> for i in $(ceph osd tree | grep osd | awk '{print $3}'); do ceph osd >>>>> in $i; done >>>>> >>>>> >>>>> >>>>> Michael J. Kidd >>>>> Sr. Storage Consultant >>>>> Inktank Professional Services >>>>> - by Red Hat >>>>> >>>>> On Wed, Oct 29, 2014 at 12:33 PM, Lukáš Kubín <lukas.ku...@gmail.com> >>>>> wrote: >>>>> >>>>>> I've ended up at step "ceph osd unset noin". My OSDs are up, but not >>>>>> in, even after an hour: >>>>>> >>>>>> [root@q04 ceph-recovery]# ceph osd stat >>>>>> osdmap e2602: 34 osds: 34 up, 0 in >>>>>> flags nobackfill,norecover,noscrub,nodeep-scrub >>>>>> >>>>>> >>>>>> There seems to be no activity generated by OSD processes, >>>>>> occasionally they show 0,3% which I believe is just some basic >>>>>> communication processing. No load in network interfaces. >>>>>> >>>>>> Is there some other step needed to bring the OSDs in? >>>>>> >>>>>> Thank you. >>>>>> >>>>>> Lukas >>>>>> >>>>>> On Wed, Oct 29, 2014 at 3:58 PM, Michael J. Kidd < >>>>>> michael.k...@inktank.com> wrote: >>>>>> >>>>>>> Hello Lukas, >>>>>>> Please try the following process for getting all your OSDs up and >>>>>>> operational... >>>>>>> >>>>>>> * Set the following flags: noup, noin, noscrub, nodeep-scrub, >>>>>>> norecover, nobackfill >>>>>>> for i in noup noin noscrub nodeep-scrub norecover nobackfill; do >>>>>>> ceph osd set $i; done >>>>>>> >>>>>>> * Stop all OSDs (I know, this seems counter productive) >>>>>>> * Set all OSDs down / out >>>>>>> for i in $(ceph osd tree | grep osd | awk '{print $3}'); do ceph osd >>>>>>> down $i; ceph osd out $i; done >>>>>>> * Set recovery / backfill throttles as well as heartbeat and OSD map >>>>>>> processing tweaks in the /etc/ceph/ceph.conf file under the [osd] >>>>>>> section: >>>>>>> [osd] >>>>>>> osd_max_backfills = 1 >>>>>>> osd_recovery_max_active = 1 >>>>>>> osd_recovery_max_single_start = 1 >>>>>>> osd_backfill_scan_min = 8 >>>>>>> osd_heartbeat_interval = 36 >>>>>>> osd_heartbeat_grace = 240 >>>>>>> osd_map_message_max = 1000 >>>>>>> osd_map_cache_size = 3136 >>>>>>> >>>>>>> * Start all OSDs >>>>>>> * Monitor 'top' for 0% CPU on all OSD processes.. it may take a >>>>>>> while.. I usually issue 'top' then, the keys M c >>>>>>> - M = Sort by memory usage >>>>>>> - c = Show command arguments >>>>>>> - This allows to easily monitor the OSD process and know which OSDs >>>>>>> have settled, etc.. >>>>>>> * Once all OSDs have hit 0% CPU utilization, remove the 'noup' flag >>>>>>> - ceph osd unset noup >>>>>>> * Again, wait for 0% CPU utilization (may be immediate, may take a >>>>>>> while.. just gotta wait) >>>>>>> * Once all OSDs have hit 0% CPU again, remove the 'noin' flag >>>>>>> - ceph osd unset noin >>>>>>> - All OSDs should now appear up/in, and will go through peering.. >>>>>>> * Once ceph -s shows no further activity, and OSDs are back at 0% >>>>>>> CPU again, unset 'nobackfill' >>>>>>> - ceph osd unset nobackfill >>>>>>> * Once ceph -s shows no further activity, and OSDs are back at 0% >>>>>>> CPU again, unset 'norecover' >>>>>>> - ceph osd unset norecover >>>>>>> * Monitor OSD memory usage... some OSDs may get killed off again, >>>>>>> but their subsequent restart should consume less memory and allow more >>>>>>> recovery to occur between each step above.. and ultimately, hopefully... >>>>>>> your entire cluster will come back online and be usable. >>>>>>> >>>>>>> ## Clean-up: >>>>>>> * Remove all of the above set options from ceph.conf >>>>>>> * Reset the running OSDs to their defaults: >>>>>>> ceph tell osd.\* injectargs '--osd_max_backfills 10 >>>>>>> --osd_recovery_max_active 15 --osd_recovery_max_single_start 5 >>>>>>> --osd_backfill_scan_min 64 --osd_heartbeat_interval 6 >>>>>>> --osd_heartbeat_grace >>>>>>> 36 --osd_map_message_max 100 --osd_map_cache_size 500' >>>>>>> * Unset the noscrub and nodeep-scrub flags: >>>>>>> - ceph osd unset noscrub >>>>>>> - ceph osd unset nodeep-scrub >>>>>>> >>>>>>> >>>>>>> ## For help identifying why memory usage was so high, please provide: >>>>>>> * ceph osd dump | grep pool >>>>>>> * ceph osd crush rule dump >>>>>>> >>>>>>> Let us know if this helps... I know it looks extreme, but it's >>>>>>> worked for me in the past.. >>>>>>> >>>>>>> >>>>>>> Michael J. Kidd >>>>>>> Sr. Storage Consultant >>>>>>> Inktank Professional Services >>>>>>> - by Red Hat >>>>>>> >>>>>>> On Wed, Oct 29, 2014 at 8:51 AM, Lukáš Kubín <lukas.ku...@gmail.com> >>>>>>> wrote: >>>>>>> >>>>>>>> Hello, >>>>>>>> I've found my ceph v 0.80.3 cluster in a state with 5 of 34 OSDs >>>>>>>> being down through night after months of running without change. From >>>>>>>> Linux >>>>>>>> logs I found out the OSD processes were killed because they consumed >>>>>>>> all >>>>>>>> available memory. >>>>>>>> >>>>>>>> Those 5 failed OSDs were from different hosts of my 4-node cluster >>>>>>>> (see below). Two hosts act as SSD cache tier in some of my pools. The >>>>>>>> other >>>>>>>> two hosts are the default rotational drives storage. >>>>>>>> >>>>>>>> After checking the Linux was not out of memory I've attempted to >>>>>>>> restart those failed OSDs. Most of those OSD daemon exhaust all memory >>>>>>>> in >>>>>>>> seconds and got killed by Linux again: >>>>>>>> >>>>>>>> Oct 28 22:16:34 q07 kernel: Out of memory: Kill process 24207 >>>>>>>> (ceph-osd) score 867 or sacrifice child >>>>>>>> Oct 28 22:16:34 q07 kernel: Killed process 24207, UID 0, (ceph-osd) >>>>>>>> total-vm:59974412kB, anon-rss:59076880kB, file-rss:512kB >>>>>>>> >>>>>>>> >>>>>>>> On the host I've found lots of similar "slow request" messages >>>>>>>> preceding the crash: >>>>>>>> >>>>>>>> 2014-10-28 22:11:20.885527 7f25f84d1700 0 log [WRN] : slow request >>>>>>>> 31.117125 seconds old, received at 2014-10-28 22:10:49.768291: >>>>>>>> osd_sub_op(client.168752.0:2197931 14.2c7 >>>>>>>> 888596c7/rbd_data.293272f8695e4.000000000000006f/head//14 [] v >>>>>>>> 1551'377417 >>>>>>>> snapset=0=[]:[] snapc=0=[]) v10 currently no flag points reached >>>>>>>> 2014-10-28 22:11:21.885668 7f25f84d1700 0 log [WRN] : 67 slow >>>>>>>> requests, 1 included below; oldest blocked for > 9879.304770 secs >>>>>>>> >>>>>>>> >>>>>>>> Apparently I can't get the cluster fixed by restarting the OSDs all >>>>>>>> over again. Is there any other option then? >>>>>>>> >>>>>>>> Thank you. >>>>>>>> >>>>>>>> Lukas Kubin >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> [root@q04 ~]# ceph -s >>>>>>>> cluster ec433b4a-9dc0-4d08-bde4-f1657b1fdb99 >>>>>>>> health HEALTH_ERR 9 pgs backfill; 1 pgs backfilling; 521 pgs >>>>>>>> degraded; 425 pgs incomplete; 13 pgs inconsistent; 20 pgs recovering; >>>>>>>> 50 >>>>>>>> pgs recovery_wait; 151 pgs stale; 425 pgs stuck inactive; 151 pgs stuck >>>>>>>> stale; 1164 pgs stuck unclean; 12070270 requests are blocked > 32 sec; >>>>>>>> recovery 887322/35206223 objects degraded (2.520%); 119/17131232 >>>>>>>> unfound >>>>>>>> (0.001%); 13 scrub errors >>>>>>>> monmap e2: 3 mons at {q03= >>>>>>>> 10.255.253.33:6789/0,q04=10.255.253.34:6789/0,q05=10.255.253.35:6789/0}, >>>>>>>> election epoch 90, quorum 0,1,2 q03,q04,q05 >>>>>>>> osdmap e2194: 34 osds: 31 up, 31 in >>>>>>>> pgmap v7429812: 5632 pgs, 7 pools, 1446 GB data, 16729 >>>>>>>> kobjects >>>>>>>> 2915 GB used, 12449 GB / 15365 GB avail >>>>>>>> 887322/35206223 objects degraded (2.520%); 119/17131232 >>>>>>>> unfound (0.001%) >>>>>>>> 38 active+recovery_wait+remapped >>>>>>>> 4455 active+clean >>>>>>>> 65 stale+incomplete >>>>>>>> 3 active+recovering+remapped >>>>>>>> 359 incomplete >>>>>>>> 12 active+recovery_wait >>>>>>>> 139 active+remapped >>>>>>>> 86 stale+active+degraded >>>>>>>> 16 active+recovering >>>>>>>> 1 active+remapped+backfilling >>>>>>>> 13 active+clean+inconsistent >>>>>>>> 9 active+remapped+wait_backfill >>>>>>>> 434 active+degraded >>>>>>>> 1 remapped+incomplete >>>>>>>> 1 active+recovering+degraded+remapped >>>>>>>> client io 0 B/s rd, 469 kB/s wr, 48 op/s >>>>>>>> >>>>>>>> [root@q04 ~]# ceph osd tree >>>>>>>> # id weight type name up/down reweight >>>>>>>> -5 3.24 root ssd >>>>>>>> -6 1.62 host q06 >>>>>>>> 16 0.18 osd.16 up 1 >>>>>>>> 17 0.18 osd.17 up 1 >>>>>>>> 18 0.18 osd.18 up 1 >>>>>>>> 19 0.18 osd.19 up 1 >>>>>>>> 20 0.18 osd.20 up 1 >>>>>>>> 21 0.18 osd.21 up 1 >>>>>>>> 22 0.18 osd.22 up 1 >>>>>>>> 23 0.18 osd.23 up 1 >>>>>>>> 24 0.18 osd.24 up 1 >>>>>>>> -7 1.62 host q07 >>>>>>>> 25 0.18 osd.25 up 1 >>>>>>>> 26 0.18 osd.26 up 1 >>>>>>>> 27 0.18 osd.27 up 1 >>>>>>>> 28 0.18 osd.28 up 1 >>>>>>>> 29 0.18 osd.29 up 1 >>>>>>>> 30 0.18 osd.30 up 1 >>>>>>>> 31 0.18 osd.31 up 1 >>>>>>>> 32 0.18 osd.32 up 1 >>>>>>>> 33 0.18 osd.33 up 1 >>>>>>>> -1 14.56 root default >>>>>>>> -4 14.56 root sata >>>>>>>> -2 7.28 host q08 >>>>>>>> 0 0.91 osd.0 up 1 >>>>>>>> 1 0.91 osd.1 up 1 >>>>>>>> 2 0.91 osd.2 up 1 >>>>>>>> 3 0.91 osd.3 up 1 >>>>>>>> 11 0.91 osd.11 up 1 >>>>>>>> 12 0.91 osd.12 up 1 >>>>>>>> 13 0.91 osd.13 down 0 >>>>>>>> 14 0.91 osd.14 up 1 >>>>>>>> -3 7.28 host q09 >>>>>>>> 4 0.91 osd.4 up 1 >>>>>>>> 5 0.91 osd.5 up 1 >>>>>>>> 6 0.91 osd.6 up 1 >>>>>>>> 7 0.91 osd.7 up 1 >>>>>>>> 8 0.91 osd.8 down 0 >>>>>>>> 9 0.91 osd.9 up 1 >>>>>>>> 10 0.91 osd.10 down 0 >>>>>>>> 15 0.91 osd.15 up 1 >>>>>>>> >>>>>>>> >>>>>>>> _______________________________________________ >>>>>>>> ceph-users mailing list >>>>>>>> ceph-users@lists.ceph.com >>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> >
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com