Ah, sorry... since they were set out manually, they'll need to be set in manually..
for i in $(ceph osd tree | grep osd | awk '{print $3}'); do ceph osd in $i; done Michael J. Kidd Sr. Storage Consultant Inktank Professional Services - by Red Hat On Wed, Oct 29, 2014 at 12:33 PM, Lukáš Kubín <lukas.ku...@gmail.com> wrote: > I've ended up at step "ceph osd unset noin". My OSDs are up, but not in, > even after an hour: > > [root@q04 ceph-recovery]# ceph osd stat > osdmap e2602: 34 osds: 34 up, 0 in > flags nobackfill,norecover,noscrub,nodeep-scrub > > > There seems to be no activity generated by OSD processes, occasionally > they show 0,3% which I believe is just some basic communication processing. > No load in network interfaces. > > Is there some other step needed to bring the OSDs in? > > Thank you. > > Lukas > > On Wed, Oct 29, 2014 at 3:58 PM, Michael J. Kidd <michael.k...@inktank.com > > wrote: > >> Hello Lukas, >> Please try the following process for getting all your OSDs up and >> operational... >> >> * Set the following flags: noup, noin, noscrub, nodeep-scrub, norecover, >> nobackfill >> for i in noup noin noscrub nodeep-scrub norecover nobackfill; do ceph osd >> set $i; done >> >> * Stop all OSDs (I know, this seems counter productive) >> * Set all OSDs down / out >> for i in $(ceph osd tree | grep osd | awk '{print $3}'); do ceph osd down >> $i; ceph osd out $i; done >> * Set recovery / backfill throttles as well as heartbeat and OSD map >> processing tweaks in the /etc/ceph/ceph.conf file under the [osd] section: >> [osd] >> osd_max_backfills = 1 >> osd_recovery_max_active = 1 >> osd_recovery_max_single_start = 1 >> osd_backfill_scan_min = 8 >> osd_heartbeat_interval = 36 >> osd_heartbeat_grace = 240 >> osd_map_message_max = 1000 >> osd_map_cache_size = 3136 >> >> * Start all OSDs >> * Monitor 'top' for 0% CPU on all OSD processes.. it may take a while.. >> I usually issue 'top' then, the keys M c >> - M = Sort by memory usage >> - c = Show command arguments >> - This allows to easily monitor the OSD process and know which OSDs have >> settled, etc.. >> * Once all OSDs have hit 0% CPU utilization, remove the 'noup' flag >> - ceph osd unset noup >> * Again, wait for 0% CPU utilization (may be immediate, may take a >> while.. just gotta wait) >> * Once all OSDs have hit 0% CPU again, remove the 'noin' flag >> - ceph osd unset noin >> - All OSDs should now appear up/in, and will go through peering.. >> * Once ceph -s shows no further activity, and OSDs are back at 0% CPU >> again, unset 'nobackfill' >> - ceph osd unset nobackfill >> * Once ceph -s shows no further activity, and OSDs are back at 0% CPU >> again, unset 'norecover' >> - ceph osd unset norecover >> * Monitor OSD memory usage... some OSDs may get killed off again, but >> their subsequent restart should consume less memory and allow more recovery >> to occur between each step above.. and ultimately, hopefully... your entire >> cluster will come back online and be usable. >> >> ## Clean-up: >> * Remove all of the above set options from ceph.conf >> * Reset the running OSDs to their defaults: >> ceph tell osd.\* injectargs '--osd_max_backfills 10 >> --osd_recovery_max_active 15 --osd_recovery_max_single_start 5 >> --osd_backfill_scan_min 64 --osd_heartbeat_interval 6 --osd_heartbeat_grace >> 36 --osd_map_message_max 100 --osd_map_cache_size 500' >> * Unset the noscrub and nodeep-scrub flags: >> - ceph osd unset noscrub >> - ceph osd unset nodeep-scrub >> >> >> ## For help identifying why memory usage was so high, please provide: >> * ceph osd dump | grep pool >> * ceph osd crush rule dump >> >> Let us know if this helps... I know it looks extreme, but it's worked for >> me in the past.. >> >> >> Michael J. Kidd >> Sr. Storage Consultant >> Inktank Professional Services >> - by Red Hat >> >> On Wed, Oct 29, 2014 at 8:51 AM, Lukáš Kubín <lukas.ku...@gmail.com> >> wrote: >> >>> Hello, >>> I've found my ceph v 0.80.3 cluster in a state with 5 of 34 OSDs being >>> down through night after months of running without change. From Linux logs >>> I found out the OSD processes were killed because they consumed all >>> available memory. >>> >>> Those 5 failed OSDs were from different hosts of my 4-node cluster (see >>> below). Two hosts act as SSD cache tier in some of my pools. The other two >>> hosts are the default rotational drives storage. >>> >>> After checking the Linux was not out of memory I've attempted to restart >>> those failed OSDs. Most of those OSD daemon exhaust all memory in seconds >>> and got killed by Linux again: >>> >>> Oct 28 22:16:34 q07 kernel: Out of memory: Kill process 24207 (ceph-osd) >>> score 867 or sacrifice child >>> Oct 28 22:16:34 q07 kernel: Killed process 24207, UID 0, (ceph-osd) >>> total-vm:59974412kB, anon-rss:59076880kB, file-rss:512kB >>> >>> >>> On the host I've found lots of similar "slow request" messages preceding >>> the crash: >>> >>> 2014-10-28 22:11:20.885527 7f25f84d1700 0 log [WRN] : slow request >>> 31.117125 seconds old, received at 2014-10-28 22:10:49.768291: >>> osd_sub_op(client.168752.0:2197931 14.2c7 >>> 888596c7/rbd_data.293272f8695e4.000000000000006f/head//14 [] v 1551'377417 >>> snapset=0=[]:[] snapc=0=[]) v10 currently no flag points reached >>> 2014-10-28 22:11:21.885668 7f25f84d1700 0 log [WRN] : 67 slow requests, >>> 1 included below; oldest blocked for > 9879.304770 secs >>> >>> >>> Apparently I can't get the cluster fixed by restarting the OSDs all over >>> again. Is there any other option then? >>> >>> Thank you. >>> >>> Lukas Kubin >>> >>> >>> >>> [root@q04 ~]# ceph -s >>> cluster ec433b4a-9dc0-4d08-bde4-f1657b1fdb99 >>> health HEALTH_ERR 9 pgs backfill; 1 pgs backfilling; 521 pgs >>> degraded; 425 pgs incomplete; 13 pgs inconsistent; 20 pgs recovering; 50 >>> pgs recovery_wait; 151 pgs stale; 425 pgs stuck inactive; 151 pgs stuck >>> stale; 1164 pgs stuck unclean; 12070270 requests are blocked > 32 sec; >>> recovery 887322/35206223 objects degraded (2.520%); 119/17131232 unfound >>> (0.001%); 13 scrub errors >>> monmap e2: 3 mons at {q03= >>> 10.255.253.33:6789/0,q04=10.255.253.34:6789/0,q05=10.255.253.35:6789/0}, >>> election epoch 90, quorum 0,1,2 q03,q04,q05 >>> osdmap e2194: 34 osds: 31 up, 31 in >>> pgmap v7429812: 5632 pgs, 7 pools, 1446 GB data, 16729 kobjects >>> 2915 GB used, 12449 GB / 15365 GB avail >>> 887322/35206223 objects degraded (2.520%); 119/17131232 >>> unfound (0.001%) >>> 38 active+recovery_wait+remapped >>> 4455 active+clean >>> 65 stale+incomplete >>> 3 active+recovering+remapped >>> 359 incomplete >>> 12 active+recovery_wait >>> 139 active+remapped >>> 86 stale+active+degraded >>> 16 active+recovering >>> 1 active+remapped+backfilling >>> 13 active+clean+inconsistent >>> 9 active+remapped+wait_backfill >>> 434 active+degraded >>> 1 remapped+incomplete >>> 1 active+recovering+degraded+remapped >>> client io 0 B/s rd, 469 kB/s wr, 48 op/s >>> >>> [root@q04 ~]# ceph osd tree >>> # id weight type name up/down reweight >>> -5 3.24 root ssd >>> -6 1.62 host q06 >>> 16 0.18 osd.16 up 1 >>> 17 0.18 osd.17 up 1 >>> 18 0.18 osd.18 up 1 >>> 19 0.18 osd.19 up 1 >>> 20 0.18 osd.20 up 1 >>> 21 0.18 osd.21 up 1 >>> 22 0.18 osd.22 up 1 >>> 23 0.18 osd.23 up 1 >>> 24 0.18 osd.24 up 1 >>> -7 1.62 host q07 >>> 25 0.18 osd.25 up 1 >>> 26 0.18 osd.26 up 1 >>> 27 0.18 osd.27 up 1 >>> 28 0.18 osd.28 up 1 >>> 29 0.18 osd.29 up 1 >>> 30 0.18 osd.30 up 1 >>> 31 0.18 osd.31 up 1 >>> 32 0.18 osd.32 up 1 >>> 33 0.18 osd.33 up 1 >>> -1 14.56 root default >>> -4 14.56 root sata >>> -2 7.28 host q08 >>> 0 0.91 osd.0 up 1 >>> 1 0.91 osd.1 up 1 >>> 2 0.91 osd.2 up 1 >>> 3 0.91 osd.3 up 1 >>> 11 0.91 osd.11 up 1 >>> 12 0.91 osd.12 up 1 >>> 13 0.91 osd.13 down 0 >>> 14 0.91 osd.14 up 1 >>> -3 7.28 host q09 >>> 4 0.91 osd.4 up 1 >>> 5 0.91 osd.5 up 1 >>> 6 0.91 osd.6 up 1 >>> 7 0.91 osd.7 up 1 >>> 8 0.91 osd.8 down 0 >>> 9 0.91 osd.9 up 1 >>> 10 0.91 osd.10 down 0 >>> 15 0.91 osd.15 up 1 >>> >>> >>> _______________________________________________ >>> ceph-users mailing list >>> ceph-users@lists.ceph.com >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>> >>> >> >
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com