Can anyone comment on this issue please, I can't seem to bring my cluster healthy.
On Fri, Jan 4, 2019 at 6:26 AM Arun POONIA <arun.poo...@nuagenetworks.net> wrote: > Hi Caspar, > > Number of IOPs are also quite low. It used be around 1K Plus on one of > Pool (VMs) now its like close to 10-30 . > > Thansk > Arun > > On Fri, Jan 4, 2019 at 5:41 AM Arun POONIA <arun.poo...@nuagenetworks.net> > wrote: > >> Hi Caspar, >> >> Yes and No, numbers are going up and down. If I run ceph -s command I can >> see it decreases one time and later it increases again. I see there are so >> many blocked/slow requests. Almost all the OSDs have slow requests. Around >> 12% PGs are inactive not sure how to activate them again. >> >> >> [root@fre101 ~]# ceph health detail >> 2019-01-04 05:39:23.860142 7fc37a3a0700 -1 asok(0x7fc3740017a0) >> AdminSocketConfigObs::init: failed: AdminSocket::bind_and_listen: failed to >> bind the UNIX domain socket to >> '/var/run/ceph-guests/ceph-client.admin.1066526.140477441513808.asok': (2) >> No such file or directory >> HEALTH_ERR 1 osds down; 3 pools have many more objects per pg than >> average; 472812/12392654 objects misplaced (3.815%); 3610 PGs pending on >> creation; Reduced data availability: 6578 pgs inactive, 1882 pgs down, 86 >> pgs peering, 850 pgs stale; Degraded data redundancy: 216694/12392654 >> objects degraded (1.749%), 866 pgs degraded, 16 pgs undersized; 116082 slow >> requests are blocked > 32 sec; 551 stuck requests are blocked > 4096 sec; >> too many PGs per OSD (2709 > max 200) >> OSD_DOWN 1 osds down >> osd.28 (root=default,host=fre119) is down >> MANY_OBJECTS_PER_PG 3 pools have many more objects per pg than average >> pool glance-images objects per pg (10478) is more than 92.7257 times >> cluster average (113) >> pool vms objects per pg (4717) is more than 41.7434 times cluster >> average (113) >> pool volumes objects per pg (1220) is more than 10.7965 times cluster >> average (113) >> OBJECT_MISPLACED 472812/12392654 objects misplaced (3.815%) >> PENDING_CREATING_PGS 3610 PGs pending on creation >> osds >> [osd.0,osd.1,osd.10,osd.11,osd.14,osd.15,osd.17,osd.18,osd.19,osd.20,osd.21,osd.22,osd.23,osd.25,osd.26,osd.27,osd.28,osd.3,osd.30,osd.32,osd.33,osd.35,osd.36,osd.37,osd.38,osd.4,osd.5,osd.6,osd.7,osd.9] >> have pending PGs. >> PG_AVAILABILITY Reduced data availability: 6578 pgs inactive, 1882 pgs >> down, 86 pgs peering, 850 pgs stale >> pg 10.900 is down, acting [18] >> pg 10.90e is stuck inactive for 60266.030164, current state >> activating, last acting [2,38] >> pg 10.913 is stuck stale for 1887.552862, current state stale+down, >> last acting [9] >> pg 10.915 is stuck inactive for 60266.215231, current state >> activating, last acting [30,38] >> pg 11.903 is stuck inactive for 59294.465961, current state >> activating, last acting [11,38] >> pg 11.910 is down, acting [21] >> pg 11.919 is down, acting [25] >> pg 12.902 is stuck inactive for 57118.544590, current state >> activating, last acting [36,14] >> pg 13.8f8 is stuck inactive for 60707.167787, current state >> activating, last acting [29,37] >> pg 13.901 is stuck stale for 60226.543289, current state >> stale+active+clean, last acting [1,31] >> pg 13.905 is stuck inactive for 60266.050940, current state >> activating, last acting [2,36] >> pg 13.909 is stuck inactive for 60707.160714, current state >> activating, last acting [34,36] >> pg 13.90e is stuck inactive for 60707.410749, current state >> activating, last acting [21,36] >> pg 13.911 is down, acting [25] >> pg 13.914 is stale+down, acting [29] >> pg 13.917 is stuck stale for 580.224688, current state stale+down, >> last acting [16] >> pg 14.901 is stuck inactive for 60266.037762, current state >> activating+degraded, last acting [22,37] >> pg 14.90f is stuck inactive for 60296.996447, current state >> activating, last acting [30,36] >> pg 14.910 is stuck inactive for 60266.077310, current state >> activating+degraded, last acting [17,37] >> pg 14.915 is stuck inactive for 60266.032445, current state >> activating, last acting [34,36] >> pg 15.8fa is stuck stale for 560.223249, current state stale+down, >> last acting [8] >> pg 15.90c is stuck inactive for 59294.402388, current state >> activating, last acting [29,38] >> pg 15.90d is stuck inactive for 60266.176492, current state >> activating, last acting [5,36] >> pg 15.915 is down, acting [0] >> pg 15.917 is stuck inactive for 56279.658951, current state >> activating, last acting [13,38] >> pg 15.91c is stuck stale for 374.590704, current state stale+down, >> last acting [12] >> pg 16.903 is stuck inactive for 56580.905961, current state >> activating, last acting [25,37] >> pg 16.90e is stuck inactive for 60266.271680, current state >> activating, last acting [14,37] >> pg 16.919 is stuck inactive for 59901.802184, current state >> activating, last acting [20,37] >> pg 16.91e is stuck inactive for 60297.038159, current state >> activating, last acting [22,37] >> pg 17.8e5 is stuck inactive for 60266.149061, current state >> activating, last acting [25,36] >> pg 17.910 is stuck inactive for 59901.850204, current state >> activating, last acting [26,37] >> pg 17.913 is stuck inactive for 60707.208364, current state >> activating, last acting [13,36] >> pg 17.91a is stuck inactive for 60266.187509, current state >> activating, last acting [4,37] >> pg 17.91f is down, acting [6] >> pg 18.908 is stuck inactive for 60707.216314, current state >> activating, last acting [10,36] >> pg 18.911 is stuck stale for 244.570413, current state stale+down, >> last acting [34] >> pg 18.919 is stuck inactive for 60265.980816, current state >> activating, last acting [28,36] >> pg 18.91a is stuck inactive for 59901.814714, current state >> activating, last acting [28,37] >> pg 18.91e is stuck inactive for 60707.179338, current state >> activating, last acting [0,36] >> pg 19.90a is stuck inactive for 60203.089988, current state >> activating, last acting [35,38] >> pg 20.8e0 is stuck inactive for 60296.839098, current state >> activating+degraded, last acting [18,37] >> pg 20.913 is stuck inactive for 60296.977401, current state >> activating+degraded, last acting [11,37] >> pg 20.91d is stuck inactive for 60296.891370, current state >> activating+degraded, last acting [10,38] >> pg 21.8e1 is stuck inactive for 60707.422330, current state >> activating, last acting [21,38] >> pg 21.907 is stuck inactive for 60296.855511, current state >> activating, last acting [20,36] >> pg 21.90e is stuck inactive for 60266.055557, current state >> activating, last acting [1,38] >> pg 21.917 is stuck inactive for 60296.940074, current state >> activating, last acting [15,36] >> pg 22.90b is stuck inactive for 60707.286070, current state >> activating, last acting [20,36] >> pg 22.90c is stuck inactive for 59901.788199, current state >> activating, last acting [20,37] >> pg 22.90f is stuck inactive for 60297.062020, current state >> activating, last acting [38,35] >> PG_DEGRADED Degraded data redundancy: 216694/12392654 objects degraded >> (1.749%), 866 pgs degraded, 16 pgs undersized >> pg 12.85a is active+undersized+degraded, acting [3] >> pg 14.843 is activating+degraded, acting [7,38] >> pg 14.85f is activating+degraded, acting [25,36] >> pg 14.865 is activating+degraded, acting [33,37] >> pg 14.87a is activating+degraded, acting [28,36] >> pg 14.87e is activating+degraded, acting [17,38] >> pg 14.882 is activating+degraded, acting [4,36] >> pg 14.88a is activating+degraded, acting [2,37] >> pg 14.893 is activating+degraded, acting [24,36] >> pg 14.897 is active+undersized+degraded, acting [34] >> pg 14.89c is activating+degraded, acting [14,38] >> pg 14.89e is activating+degraded, acting [15,38] >> pg 14.8a8 is active+undersized+degraded, acting [33] >> pg 14.8b1 is activating+degraded, acting [30,38] >> pg 14.8d4 is active+undersized+degraded, acting [13] >> pg 14.8d8 is active+undersized+degraded, acting [4] >> pg 14.8e6 is active+undersized+degraded, acting [10] >> pg 14.8e7 is active+undersized+degraded, acting [1] >> pg 14.8ef is activating+degraded, acting [9,36] >> pg 14.8f8 is active+undersized+degraded, acting [30] >> pg 14.901 is activating+degraded, acting [22,37] >> pg 14.910 is activating+degraded, acting [17,37] >> pg 14.913 is active+undersized+degraded, acting [18] >> pg 20.821 is activating+degraded, acting [37,33] >> pg 20.825 is activating+degraded, acting [25,36] >> pg 20.84f is active+undersized+degraded, acting [2] >> pg 20.85a is active+undersized+degraded, acting [11] >> pg 20.85f is activating+degraded, acting [1,38] >> pg 20.865 is activating+degraded, acting [8,38] >> pg 20.869 is activating+degraded, acting [27,37] >> pg 20.87b is active+undersized+degraded, acting [30] >> pg 20.88b is activating+degraded, acting [6,38] >> pg 20.895 is activating+degraded, acting [37,27] >> pg 20.89c is activating+degraded, acting [1,36] >> pg 20.8a3 is activating+degraded, acting [30,36] >> pg 20.8ad is activating+degraded, acting [1,38] >> pg 20.8af is activating+degraded, acting [33,37] >> pg 20.8b7 is activating+degraded, acting [0,38] >> pg 20.8b9 is activating+degraded, acting [20,38] >> pg 20.8d4 is activating+degraded, acting [28,37] >> pg 20.8d5 is activating+degraded, acting [24,37] >> pg 20.8e0 is activating+degraded, acting [18,37] >> pg 20.8e3 is activating+degraded, acting [21,38] >> pg 20.8ea is activating+degraded, acting [17,36] >> pg 20.8ee is active+undersized+degraded, acting [4] >> pg 20.8f2 is activating+degraded, acting [3,36] >> pg 20.8fb is activating+degraded, acting [10,38] >> pg 20.8fc is activating+degraded, acting [20,38] >> pg 20.913 is activating+degraded, acting [11,37] >> pg 20.916 is active+undersized+degraded, acting [21] >> pg 20.91d is activating+degraded, acting [10,38] >> REQUEST_SLOW 116082 slow requests are blocked > 32 sec >> 10619 ops are blocked > 2097.15 sec >> 74227 ops are blocked > 1048.58 sec >> 18561 ops are blocked > 524.288 sec >> 10862 ops are blocked > 262.144 sec >> 1037 ops are blocked > 131.072 sec >> 520 ops are blocked > 65.536 sec >> 256 ops are blocked > 32.768 sec >> osd.29 has blocked requests > 32.768 sec >> osd.15 has blocked requests > 262.144 sec >> osds 12,13,31 have blocked requests > 524.288 sec >> osds 1,8,16,19,23,25,26,33,37,38 have blocked requests > 1048.58 sec >> osds 3,4,5,6,10,14,17,22,27,30,32,35,36 have blocked requests > >> 2097.15 sec >> REQUEST_STUCK 551 stuck requests are blocked > 4096 sec >> 551 ops are blocked > 4194.3 sec >> osds 0,28 have stuck requests > 4194.3 sec >> TOO_MANY_PGS too many PGs per OSD (2709 > max 200) >> [root@fre101 ~]# >> [root@fre101 ~]# >> [root@fre101 ~]# >> [root@fre101 ~]# >> [root@fre101 ~]# >> [root@fre101 ~]# >> [root@fre101 ~]# ceph -s >> 2019-01-04 05:39:29.364100 7f0fb32f2700 -1 asok(0x7f0fac0017a0) >> AdminSocketConfigObs::init: failed: AdminSocket::bind_and_listen: failed to >> bind the UNIX domain socket to >> '/var/run/ceph-guests/ceph-client.admin.1066635.139705286924624.asok': (2) >> No such file or directory >> cluster: >> id: adb9ad8e-f458-4124-bf58-7963a8d1391f >> health: HEALTH_ERR >> 3 pools have many more objects per pg than average >> 473825/12392654 objects misplaced (3.823%) >> 3723 PGs pending on creation >> Reduced data availability: 6677 pgs inactive, 1948 pgs down, >> 157 pgs peering, 850 pgs stale >> Degraded data redundancy: 306567/12392654 objects degraded >> (2.474%), 949 pgs degraded, 16 pgs undersized >> 98047 slow requests are blocked > 32 sec >> 33 stuck requests are blocked > 4096 sec >> too many PGs per OSD (2690 > max 200) >> >> services: >> mon: 3 daemons, quorum ceph-mon01,ceph-mon02,ceph-mon03 >> mgr: ceph-mon03(active), standbys: ceph-mon01, ceph-mon02 >> osd: 39 osds: 39 up, 39 in; 76 remapped pgs >> rgw: 1 daemon active >> >> data: >> pools: 18 pools, 54656 pgs >> objects: 6051k objects, 10944 GB >> usage: 21934 GB used, 50687 GB / 72622 GB avail >> pgs: 13.267% pgs not active >> 306567/12392654 objects degraded (2.474%) >> 473825/12392654 objects misplaced (3.823%) >> 44937 active+clean >> 3850 activating >> 1936 active+undersized >> 1078 down >> 864 stale+down >> 597 peering >> 591 activating+degraded >> 316 active+undersized+degraded >> 205 stale+active+clean >> 133 stale+activating >> 67 activating+remapped >> 32 stale+activating+degraded >> 21 stale+activating+remapped >> 9 stale+active+undersized >> 6 down+remapped >> 5 stale+activating+undersized+degraded+remapped >> 2 activating+degraded+remapped >> 1 stale+activating+degraded+remapped >> 1 stale+active+undersized+degraded >> 1 remapped+peering >> 1 active+clean+remapped >> 1 stale+remapped+peering >> 1 stale+peering >> 1 activating+undersized+degraded+remapped >> >> io: >> client: 0 B/s rd, 23566 B/s wr, 0 op/s rd, 3 op/s wr >> >> Thanks >> >> Arun >> >> On Fri, Jan 4, 2019 at 5:38 AM Caspar Smit <caspars...@supernas.eu> >> wrote: >> >>> Are the numbers still decreasing? >>> >>> This one for instance: >>> >>> "3883 PGs pending on creation" >>> >>> Caspar >>> >>> >>> Op vr 4 jan. 2019 om 14:23 schreef Arun POONIA < >>> arun.poo...@nuagenetworks.net>: >>> >>>> Hi Caspar, >>>> >>>> Yes, cluster was working fine with number of PGs per OSD warning up >>>> until now. I am not sure how to recover from stale down/inactive PGs. If >>>> you happen to know about this can you let me know? >>>> >>>> Current State: >>>> >>>> [root@fre101 ~]# ceph -s >>>> 2019-01-04 05:22:05.942349 7f314f613700 -1 asok(0x7f31480017a0) >>>> AdminSocketConfigObs::init: failed: AdminSocket::bind_and_listen: failed to >>>> bind the UNIX domain socket to >>>> '/var/run/ceph-guests/ceph-client.admin.1053724.139849638091088.asok': (2) >>>> No such file or directory >>>> cluster: >>>> id: adb9ad8e-f458-4124-bf58-7963a8d1391f >>>> health: HEALTH_ERR >>>> 3 pools have many more objects per pg than average >>>> 505714/12392650 objects misplaced (4.081%) >>>> 3883 PGs pending on creation >>>> Reduced data availability: 6519 pgs inactive, 1870 pgs >>>> down, 1 pg peering, 886 pgs stale >>>> Degraded data redundancy: 42987/12392650 objects degraded >>>> (0.347%), 634 pgs degraded, 16 pgs undersized >>>> 125827 slow requests are blocked > 32 sec >>>> 2 stuck requests are blocked > 4096 sec >>>> too many PGs per OSD (2758 > max 200) >>>> >>>> services: >>>> mon: 3 daemons, quorum ceph-mon01,ceph-mon02,ceph-mon03 >>>> mgr: ceph-mon03(active), standbys: ceph-mon01, ceph-mon02 >>>> osd: 39 osds: 39 up, 39 in; 76 remapped pgs >>>> rgw: 1 daemon active >>>> >>>> data: >>>> pools: 18 pools, 54656 pgs >>>> objects: 6051k objects, 10944 GB >>>> usage: 21933 GB used, 50688 GB / 72622 GB avail >>>> pgs: 11.927% pgs not active >>>> 42987/12392650 objects degraded (0.347%) >>>> 505714/12392650 objects misplaced (4.081%) >>>> 48080 active+clean >>>> 3885 activating >>>> 1111 down >>>> 759 stale+down >>>> 614 activating+degraded >>>> 74 activating+remapped >>>> 46 stale+active+clean >>>> 35 stale+activating >>>> 21 stale+activating+remapped >>>> 9 stale+active+undersized >>>> 9 stale+activating+degraded >>>> 5 stale+activating+undersized+degraded+remapped >>>> 3 activating+degraded+remapped >>>> 1 stale+activating+degraded+remapped >>>> 1 stale+active+undersized+degraded >>>> 1 remapped+peering >>>> 1 active+clean+remapped >>>> 1 activating+undersized+degraded+remapped >>>> >>>> io: >>>> client: 0 B/s rd, 25397 B/s wr, 4 op/s rd, 4 op/s wr >>>> >>>> I will update number of PGs per OSD once these inactive or stale PGs >>>> come online. I am not able to access VMs (VMs, Images) which are using >>>> Ceph. >>>> >>>> Thanks >>>> Arun >>>> >>>> On Fri, Jan 4, 2019 at 4:53 AM Caspar Smit <caspars...@supernas.eu> >>>> wrote: >>>> >>>>> Hi Arun, >>>>> >>>>> How did you end up with a 'working' cluster with so many pgs per OSD? >>>>> >>>>> "too many PGs per OSD (2968 > max 200)" >>>>> >>>>> To (temporarily) allow this kind of pgs per osd you could try this: >>>>> >>>>> Change these values in the global section in your ceph.conf: >>>>> >>>>> mon max pg per osd = 200 >>>>> osd max pg per osd hard ratio = 2 >>>>> >>>>> It allows 200*2 = 400 Pgs per OSD before disabling the creation of new >>>>> pgs. >>>>> >>>>> Above are the defaults (for Luminous, maybe other versions too) >>>>> You can check your current settings with: >>>>> >>>>> ceph daemon mon.ceph-mon01 config show |grep pg_per_osd >>>>> >>>>> Since your current pgs per osd ratio is way higher then the default >>>>> you could set them to for instance: >>>>> >>>>> mon max pg per osd = 1000 >>>>> osd max pg per osd hard ratio = 5 >>>>> >>>>> Which allow for 5000 pgs per osd before disabling creation of new pgs. >>>>> >>>>> You'll need to inject the setting into the mons/osds and restart mgrs >>>>> to make them active. >>>>> >>>>> ceph tell mon.* injectargs ‘--mon_max_pg_per_osd 1000’ >>>>> ceph tell mon.* injectargs ‘--osd_max_pg_per_osd_hard_ratio 5’ >>>>> ceph tell osd.* injectargs ‘--mon_max_pg_per_osd 1000’ >>>>> ceph tell osd.* injectargs ‘--osd_max_pg_per_osd_hard_ratio 5’ >>>>> restart mgrs >>>>> >>>>> Kind regards, >>>>> Caspar >>>>> >>>>> >>>>> Op vr 4 jan. 2019 om 04:28 schreef Arun POONIA < >>>>> arun.poo...@nuagenetworks.net>: >>>>> >>>>>> Hi Chris, >>>>>> >>>>>> Indeed that's what happened. I didn't set noout flag either and I did >>>>>> zapped disk on new server every time. In my cluster status fre201 is only >>>>>> new server. >>>>>> >>>>>> Current Status after enabling 3 OSDs on fre201 host. >>>>>> >>>>>> [root@fre201 ~]# ceph osd tree >>>>>> ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF >>>>>> -1 70.92137 root default >>>>>> -2 5.45549 host fre101 >>>>>> 0 hdd 1.81850 osd.0 up 1.00000 1.00000 >>>>>> 1 hdd 1.81850 osd.1 up 1.00000 1.00000 >>>>>> 2 hdd 1.81850 osd.2 up 1.00000 1.00000 >>>>>> -9 5.45549 host fre103 >>>>>> 3 hdd 1.81850 osd.3 up 1.00000 1.00000 >>>>>> 4 hdd 1.81850 osd.4 up 1.00000 1.00000 >>>>>> 5 hdd 1.81850 osd.5 up 1.00000 1.00000 >>>>>> -3 5.45549 host fre105 >>>>>> 6 hdd 1.81850 osd.6 up 1.00000 1.00000 >>>>>> 7 hdd 1.81850 osd.7 up 1.00000 1.00000 >>>>>> 8 hdd 1.81850 osd.8 up 1.00000 1.00000 >>>>>> -4 5.45549 host fre107 >>>>>> 9 hdd 1.81850 osd.9 up 1.00000 1.00000 >>>>>> 10 hdd 1.81850 osd.10 up 1.00000 1.00000 >>>>>> 11 hdd 1.81850 osd.11 up 1.00000 1.00000 >>>>>> -5 5.45549 host fre109 >>>>>> 12 hdd 1.81850 osd.12 up 1.00000 1.00000 >>>>>> 13 hdd 1.81850 osd.13 up 1.00000 1.00000 >>>>>> 14 hdd 1.81850 osd.14 up 1.00000 1.00000 >>>>>> -6 5.45549 host fre111 >>>>>> 15 hdd 1.81850 osd.15 up 1.00000 1.00000 >>>>>> 16 hdd 1.81850 osd.16 up 1.00000 1.00000 >>>>>> 17 hdd 1.81850 osd.17 up 0.79999 1.00000 >>>>>> -7 5.45549 host fre113 >>>>>> 18 hdd 1.81850 osd.18 up 1.00000 1.00000 >>>>>> 19 hdd 1.81850 osd.19 up 1.00000 1.00000 >>>>>> 20 hdd 1.81850 osd.20 up 1.00000 1.00000 >>>>>> -8 5.45549 host fre115 >>>>>> 21 hdd 1.81850 osd.21 up 1.00000 1.00000 >>>>>> 22 hdd 1.81850 osd.22 up 1.00000 1.00000 >>>>>> 23 hdd 1.81850 osd.23 up 1.00000 1.00000 >>>>>> -10 5.45549 host fre117 >>>>>> 24 hdd 1.81850 osd.24 up 1.00000 1.00000 >>>>>> 25 hdd 1.81850 osd.25 up 1.00000 1.00000 >>>>>> 26 hdd 1.81850 osd.26 up 1.00000 1.00000 >>>>>> -11 5.45549 host fre119 >>>>>> 27 hdd 1.81850 osd.27 up 1.00000 1.00000 >>>>>> 28 hdd 1.81850 osd.28 up 1.00000 1.00000 >>>>>> 29 hdd 1.81850 osd.29 up 1.00000 1.00000 >>>>>> -12 5.45549 host fre121 >>>>>> 30 hdd 1.81850 osd.30 up 1.00000 1.00000 >>>>>> 31 hdd 1.81850 osd.31 up 1.00000 1.00000 >>>>>> 32 hdd 1.81850 osd.32 up 1.00000 1.00000 >>>>>> -13 5.45549 host fre123 >>>>>> 33 hdd 1.81850 osd.33 up 1.00000 1.00000 >>>>>> 34 hdd 1.81850 osd.34 up 1.00000 1.00000 >>>>>> 35 hdd 1.81850 osd.35 up 1.00000 1.00000 >>>>>> -27 5.45549 host fre201 >>>>>> 36 hdd 1.81850 osd.36 up 1.00000 1.00000 >>>>>> 37 hdd 1.81850 osd.37 up 1.00000 1.00000 >>>>>> 38 hdd 1.81850 osd.38 up 1.00000 1.00000 >>>>>> [root@fre201 ~]# >>>>>> [root@fre201 ~]# >>>>>> [root@fre201 ~]# >>>>>> [root@fre201 ~]# >>>>>> [root@fre201 ~]# >>>>>> [root@fre201 ~]# ceph -s >>>>>> cluster: >>>>>> id: adb9ad8e-f458-4124-bf58-7963a8d1391f >>>>>> health: HEALTH_ERR >>>>>> 3 pools have many more objects per pg than average >>>>>> 585791/12391450 objects misplaced (4.727%) >>>>>> 2 scrub errors >>>>>> 2374 PGs pending on creation >>>>>> Reduced data availability: 6578 pgs inactive, 2025 pgs >>>>>> down, 74 pgs peering, 1234 pgs stale >>>>>> Possible data damage: 2 pgs inconsistent >>>>>> Degraded data redundancy: 64969/12391450 objects degraded >>>>>> (0.524%), 616 pgs degraded, 20 pgs undersized >>>>>> 96242 slow requests are blocked > 32 sec >>>>>> 228 stuck requests are blocked > 4096 sec >>>>>> too many PGs per OSD (2768 > max 200) >>>>>> >>>>>> services: >>>>>> mon: 3 daemons, quorum ceph-mon01,ceph-mon02,ceph-mon03 >>>>>> mgr: ceph-mon03(active), standbys: ceph-mon01, ceph-mon02 >>>>>> osd: 39 osds: 39 up, 39 in; 96 remapped pgs >>>>>> rgw: 1 daemon active >>>>>> >>>>>> data: >>>>>> pools: 18 pools, 54656 pgs >>>>>> objects: 6050k objects, 10942 GB >>>>>> usage: 21900 GB used, 50721 GB / 72622 GB avail >>>>>> pgs: 0.002% pgs unknown >>>>>> 12.050% pgs not active >>>>>> 64969/12391450 objects degraded (0.524%) >>>>>> 585791/12391450 objects misplaced (4.727%) >>>>>> 47489 active+clean >>>>>> 3670 activating >>>>>> 1098 stale+down >>>>>> 923 down >>>>>> 575 activating+degraded >>>>>> 563 stale+active+clean >>>>>> 105 stale+activating >>>>>> 78 activating+remapped >>>>>> 72 peering >>>>>> 25 stale+activating+degraded >>>>>> 23 stale+activating+remapped >>>>>> 9 stale+active+undersized >>>>>> 6 stale+activating+undersized+degraded+remapped >>>>>> 5 stale+active+undersized+degraded >>>>>> 4 down+remapped >>>>>> 4 activating+degraded+remapped >>>>>> 2 active+clean+inconsistent >>>>>> 1 stale+activating+degraded+remapped >>>>>> 1 stale+active+clean+remapped >>>>>> 1 stale+remapped+peering >>>>>> 1 remapped+peering >>>>>> 1 unknown >>>>>> >>>>>> io: >>>>>> client: 0 B/s rd, 208 kB/s wr, 22 op/s rd, 22 op/s wr >>>>>> >>>>>> >>>>>> >>>>>> Thanks >>>>>> Arun >>>>>> >>>>>> >>>>>> On Thu, Jan 3, 2019 at 7:19 PM Chris <bitskr...@bitskrieg.net> wrote: >>>>>> >>>>>>> If you added OSDs and then deleted them repeatedly without waiting >>>>>>> for replication to finish as the cluster attempted to re-balance across >>>>>>> them, its highly likely that you are permanently missing PGs >>>>>>> (especially if >>>>>>> the disks were zapped each time). >>>>>>> >>>>>>> If those 3 down OSDs can be revived there is a (small) chance that >>>>>>> you can right the ship, but 1400pg/OSD is pretty extreme. I'm >>>>>>> surprised the cluster even let you do that - this sounds like a data >>>>>>> loss >>>>>>> event. >>>>>>> >>>>>>> Bring back the 3 OSD and see what those 2 inconsistent pgs look like >>>>>>> with ceph pg query. >>>>>>> >>>>>>> On January 3, 2019 21:59:38 Arun POONIA < >>>>>>> arun.poo...@nuagenetworks.net> wrote: >>>>>>> >>>>>>>> Hi, >>>>>>>> >>>>>>>> Recently I tried adding a new node (OSD) to ceph cluster using >>>>>>>> ceph-deploy tool. Since I was experimenting with tool and ended up >>>>>>>> deleting >>>>>>>> OSD nodes on new server couple of times. >>>>>>>> >>>>>>>> Now since ceph OSDs are running on new server cluster PGs seems to >>>>>>>> be inactive (10-15%) and they are not recovering or rebalancing. Not >>>>>>>> sure >>>>>>>> what to do. I tried shutting down OSDs on new server. >>>>>>>> >>>>>>>> Status: >>>>>>>> [root@fre105 ~]# ceph -s >>>>>>>> 2019-01-03 18:56:42.867081 7fa0bf573700 -1 asok(0x7fa0b80017a0) >>>>>>>> AdminSocketConfigObs::init: failed: AdminSocket::bind_and_listen: >>>>>>>> failed to >>>>>>>> bind the UNIX domain socket to >>>>>>>> '/var/run/ceph-guests/ceph-client.admin.4018644.140328258509136.asok': >>>>>>>> (2) >>>>>>>> No such file or directory >>>>>>>> cluster: >>>>>>>> id: adb9ad8e-f458-4124-bf58-7963a8d1391f >>>>>>>> health: HEALTH_ERR >>>>>>>> 3 pools have many more objects per pg than average >>>>>>>> 373907/12391198 objects misplaced (3.018%) >>>>>>>> 2 scrub errors >>>>>>>> 9677 PGs pending on creation >>>>>>>> Reduced data availability: 7145 pgs inactive, 6228 pgs >>>>>>>> down, 1 pg peering, 2717 pgs stale >>>>>>>> Possible data damage: 2 pgs inconsistent >>>>>>>> Degraded data redundancy: 178350/12391198 objects >>>>>>>> degraded (1.439%), 346 pgs degraded, 1297 pgs undersized >>>>>>>> 52486 slow requests are blocked > 32 sec >>>>>>>> 9287 stuck requests are blocked > 4096 sec >>>>>>>> too many PGs per OSD (2968 > max 200) >>>>>>>> >>>>>>>> services: >>>>>>>> mon: 3 daemons, quorum ceph-mon01,ceph-mon02,ceph-mon03 >>>>>>>> mgr: ceph-mon03(active), standbys: ceph-mon01, ceph-mon02 >>>>>>>> osd: 39 osds: 36 up, 36 in; 51 remapped pgs >>>>>>>> rgw: 1 daemon active >>>>>>>> >>>>>>>> data: >>>>>>>> pools: 18 pools, 54656 pgs >>>>>>>> objects: 6050k objects, 10941 GB >>>>>>>> usage: 21727 GB used, 45308 GB / 67035 GB avail >>>>>>>> pgs: 13.073% pgs not active >>>>>>>> 178350/12391198 objects degraded (1.439%) >>>>>>>> 373907/12391198 objects misplaced (3.018%) >>>>>>>> 46177 active+clean >>>>>>>> 5054 down >>>>>>>> 1173 stale+down >>>>>>>> 1084 stale+active+undersized >>>>>>>> 547 activating >>>>>>>> 201 stale+active+undersized+degraded >>>>>>>> 158 stale+activating >>>>>>>> 96 activating+degraded >>>>>>>> 46 stale+active+clean >>>>>>>> 42 activating+remapped >>>>>>>> 34 stale+activating+degraded >>>>>>>> 23 stale+activating+remapped >>>>>>>> 6 stale+activating+undersized+degraded+remapped >>>>>>>> 6 activating+undersized+degraded+remapped >>>>>>>> 2 activating+degraded+remapped >>>>>>>> 2 active+clean+inconsistent >>>>>>>> 1 stale+activating+degraded+remapped >>>>>>>> 1 stale+active+clean+remapped >>>>>>>> 1 stale+remapped >>>>>>>> 1 down+remapped >>>>>>>> 1 remapped+peering >>>>>>>> >>>>>>>> io: >>>>>>>> client: 0 B/s rd, 208 kB/s wr, 28 op/s rd, 28 op/s wr >>>>>>>> >>>>>>>> Thanks >>>>>>>> -- >>>>>>>> Arun Poonia >>>>>>>> >>>>>>>> _______________________________________________ >>>>>>>> ceph-users mailing list >>>>>>>> ceph-users@lists.ceph.com >>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>>> -- >>>>>> Arun Poonia >>>>>> >>>>>> _______________________________________________ >>>>>> ceph-users mailing list >>>>>> ceph-users@lists.ceph.com >>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>>>>> >>>>> _______________________________________________ >>>>> ceph-users mailing list >>>>> ceph-users@lists.ceph.com >>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>>>> >>>> >>>> >>>> -- >>>> Arun Poonia >>>> >>>> _______________________________________________ >>> ceph-users mailing list >>> ceph-users@lists.ceph.com >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>> >> >> >> -- >> Arun Poonia >> >> > > -- > Arun Poonia > > -- Arun Poonia
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com