Hi Kevin, You are right. Increasing number of PGs per OSD resolved the issue. I will probably add this config in /etc/ceph/ceph.conf file of ceph mon and OSDs so it applies on host boot.
Thanks Arun On Fri, Jan 4, 2019 at 3:46 PM Kevin Olbrich <k...@sv01.de> wrote: > Hi Arun, > > actually deleting was no good idea, thats why I wrote, that the OSDs > should be "out". > You have down PGs, that because the data is on OSDs that are > unavailable but known by the cluster. > This can be checked by using "ceph pg 0.5 query" (change PG name). > > Because your PG count is so much oversized, the overdose limits get > hit on every recovery on your cluster. > I had the same problem on a medium cluster when I added to many new > disks at once. > You already got this info from Caspar earlier in this thread. > > > https://ceph.com/planet/placement-groups-with-ceph-luminous-stay-in-activating-state/ > > https://blog.widodh.nl/2018/01/placement-groups-with-ceph-luminous-stay-in-activating-state/ > > The second link shows one of the config params you need to inject to > all your OSDs like this: > ceph tell osd.* injectargs --mon_max_pg_per_osd 10000 > > This might help you getting these PGs some sort of "active" > (+recovery/+degraded/+inconsistent/etc.). > > The down PGs will most likely never come back. It would bet, you will > find OSD IDs that are invalid in the acting set, meaning that > non-existent OSDs hold your data. > I had a similar problem on a test cluster with erasure code pools > where too many disks failed at the same time, you will then see > negative values as OSD IDs. > > Maybe this helps a little bit. > > Kevin > > Am Sa., 5. Jan. 2019 um 00:20 Uhr schrieb Arun POONIA > <arun.poo...@nuagenetworks.net>: > > > > Hi Kevin, > > > > I tried deleting newly added server from Ceph Cluster and looks like > Ceph is not recovering. I agree with unfound data but it doesn't say about > unfound data. It says inactive/down for PGs and I can't bring them up. > > > > > > [root@fre101 ~]# ceph health detail > > 2019-01-04 15:17:05.711641 7f27b0f31700 -1 asok(0x7f27ac0017a0) > AdminSocketConfigObs::init: failed: AdminSocket::bind_and_listen: failed to > bind the UNIX domain socket to > '/var/run/ceph-guests/ceph-client.admin.129552.139808366139728.asok': (2) > No such file or directory > > HEALTH_ERR 3 pools have many more objects per pg than average; > 523656/12393978 objects misplaced (4.225%); 6517 PGs pending on creation; > Reduced data availability: 6585 pgs inactive, 1267 pgs down, 2 pgs peering, > 2703 pgs stale; Degraded data redundancy: 86858/12393978 objects degraded > (0.701%), 717 pgs degraded, 21 pgs undersized; 99059 slow requests are > blocked > 32 sec; 4834 stuck requests are blocked > 4096 sec; too many PGs > per OSD (3003 > max 200) > > MANY_OBJECTS_PER_PG 3 pools have many more objects per pg than average > > pool glance-images objects per pg (10478) is more than 92.7257 times > cluster average (113) > > pool vms objects per pg (4722) is more than 41.7876 times cluster > average (113) > > pool volumes objects per pg (1220) is more than 10.7965 times > cluster average (113) > > OBJECT_MISPLACED 523656/12393978 objects misplaced (4.225%) > > PENDING_CREATING_PGS 6517 PGs pending on creation > > osds > [osd.0,osd.1,osd.10,osd.11,osd.12,osd.13,osd.14,osd.15,osd.16,osd.17,osd.18,osd.19,osd.2,osd.20,osd.21,osd.22,osd.23,osd.24,osd.25,osd.26,osd.27,osd.28,osd.29,osd.3,osd.30,osd.31,osd.32,osd.33,osd.34,osd.35,osd.4,osd.5,osd.6,osd.7,osd.8,osd.9] > have pending PGs. > > PG_AVAILABILITY Reduced data availability: 6585 pgs inactive, 1267 pgs > down, 2 pgs peering, 2703 pgs stale > > pg 10.90e is stuck inactive for 94928.999109, current state > activating, last acting [2,6] > > pg 10.913 is stuck inactive for 95094.175400, current state > activating, last acting [9,5] > > pg 10.915 is stuck inactive for 94929.184177, current state > activating, last acting [30,26] > > pg 11.907 is stuck stale for 9612.906582, current state > stale+active+clean, last acting [38,24] > > pg 11.910 is stuck stale for 11822.359237, current state stale+down, > last acting [21] > > pg 11.915 is stuck stale for 9612.906604, current state > stale+active+clean, last acting [38,31] > > pg 11.919 is stuck inactive for 95636.716568, current state > activating, last acting [25,12] > > pg 12.902 is stuck stale for 10810.497213, current state > stale+activating, last acting [36,14] > > pg 13.901 is stuck stale for 94889.512234, current state > stale+active+clean, last acting [1,31] > > pg 13.904 is stuck stale for 10745.279158, current state > stale+active+clean, last acting [37,8] > > pg 13.908 is stuck stale for 10745.279176, current state > stale+active+clean, last acting [37,19] > > pg 13.909 is stuck inactive for 95370.129659, current state > activating, last acting [34,19] > > pg 13.90e is stuck inactive for 95370.379694, current state > activating, last acting [21,20] > > pg 13.911 is stuck inactive for 98449.317873, current state > activating, last acting [25,22] > > pg 13.914 is stuck stale for 11827.503651, current state stale+down, > last acting [29] > > pg 13.917 is stuck inactive for 94564.811121, current state > activating, last acting [16,12] > > pg 14.901 is stuck inactive for 94929.006707, current state > activating+degraded, last acting [22,8] > > pg 14.910 is stuck inactive for 94929.046256, current state > activating+degraded, last acting [17,2] > > pg 14.912 is stuck inactive for 10831.758524, current state > activating, last acting [18,2] > > pg 14.915 is stuck inactive for 94929.001390, current state > activating, last acting [34,23] > > pg 15.90c is stuck inactive for 93957.371333, current state > activating, last acting [29,10] > > pg 15.90d is stuck inactive for 94929.145438, current state > activating, last acting [5,31] > > pg 15.913 is stuck stale for 10745.279197, current state > stale+active+clean, last acting [37,12] > > pg 15.915 is stuck stale for 12343.606595, current state stale+down, > last acting [0] > > pg 15.91c is stuck stale for 10650.058945, current state stale+down, > last acting [12] > > pg 16.90e is stuck inactive for 94929.240626, current state > activating, last acting [14,2] > > pg 16.919 is stuck inactive for 94564.771129, current state > activating, last acting [20,4] > > pg 16.91e is stuck inactive for 94960.007104, current state > activating, last acting [22,12] > > pg 17.908 is stuck inactive for 12250.346380, current state > activating, last acting [27,18] > > pg 17.90b is stuck inactive for 11714.951268, current state > activating, last acting [12,25] > > pg 17.910 is stuck inactive for 94564.819149, current state > activating, last acting [26,16] > > pg 17.913 is stuck inactive for 95370.177309, current state > activating, last acting [13,31] > > pg 17.91f is stuck inactive for 95147.032346, current state > activating, last acting [6,18] > > pg 18.908 is stuck inactive for 95370.185260, current state > activating, last acting [10,2] > > pg 18.911 is stuck inactive for 95379.637224, current state > activating, last acting [34,9] > > pg 18.91e is stuck inactive for 95370.148283, current state > activating, last acting [0,34] > > pg 19.90e is stuck inactive for 10229.611524, current state > activating, last acting [18,0] > > pg 19.90f is stuck stale for 9612.906611, current state > stale+active+clean, last acting [38,18] > > pg 19.912 is stuck stale for 10745.279169, current state > stale+active+clean, last acting [37,29] > > pg 19.915 is stuck stale for 10810.497226, current state > stale+active+clean, last acting [36,13] > > pg 20.90f is stuck stale for 10810.497234, current state > stale+active+clean, last acting [36,26] > > pg 20.913 is stuck inactive for 94959.946347, current state > activating+degraded, last acting [11,0] > > pg 20.91d is stuck inactive for 94959.860315, current state > activating+degraded, last acting [10,16] > > pg 21.907 is stuck inactive for 94959.824457, current state > activating, last acting [20,0] > > pg 21.90e is stuck inactive for 94929.024503, current state > activating, last acting [1,27] > > pg 21.917 is stuck inactive for 94959.909019, current state > activating, last acting [15,2] > > pg 21.918 is stuck inactive for 10655.096673, current state > activating, last acting [35,9] > > pg 22.90b is stuck inactive for 95370.255015, current state > activating, last acting [20,26] > > pg 22.90c is stuck inactive for 94564.757145, current state > activating, last acting [20,14] > > pg 22.90f is stuck stale for 9612.906599, current state > stale+activating, last acting [38,35] > > pg 22.912 is stuck inactive for 11370.195675, current state > activating, last acting [30,15] > > PG_DEGRADED Degraded data redundancy: 86858/12393978 objects degraded > (0.701%), 717 pgs degraded, 21 pgs undersized > > pg 14.804 is activating+degraded, acting [6,30] > > pg 14.834 is activating+degraded, acting [15,8] > > pg 14.843 is activating+degraded, acting [7,25] > > pg 14.85f is activating+degraded, acting [25,11] > > pg 14.865 is activating+degraded, acting [33,25] > > pg 14.87a is activating+degraded, acting [28,6] > > pg 14.882 is activating+degraded, acting [4,21] > > pg 14.893 is activating+degraded, acting [24,17] > > pg 14.89c is activating+degraded, acting [14,21] > > pg 14.89e is activating+degraded, acting [15,28] > > pg 14.8ad is activating+degraded, acting [30,3] > > pg 14.8b1 is activating+degraded, acting [30,2] > > pg 14.8b4 is activating+degraded, acting [11,18] > > pg 14.8b7 is activating+degraded, acting [7,16] > > pg 14.8e2 is activating+degraded, acting [20,30] > > pg 14.8ec is activating+degraded, acting [25,21] > > pg 14.8ef is activating+degraded, acting [9,31] > > pg 14.8f9 is activating+degraded, acting [27,21] > > pg 14.901 is activating+degraded, acting [22,8] > > pg 14.910 is activating+degraded, acting [17,2] > > pg 20.808 is activating+degraded, acting [20,12] > > pg 20.825 is activating+degraded, acting [25,35] > > pg 20.827 is activating+degraded, acting [23,16] > > pg 20.829 is activating+degraded, acting [20,31] > > pg 20.837 is activating+degraded, acting [31,6] > > pg 20.83c is activating+degraded, acting [26,17] > > pg 20.85e is activating+degraded, acting [4,27] > > pg 20.85f is activating+degraded, acting [1,25] > > pg 20.865 is activating+degraded, acting [8,33] > > pg 20.88b is activating+degraded, acting [6,32] > > pg 20.895 is stale+activating+degraded, acting [37,27] > > pg 20.89c is activating+degraded, acting [1,24] > > pg 20.8a3 is activating+degraded, acting [30,1] > > pg 20.8ad is activating+degraded, acting [1,20] > > pg 20.8af is activating+degraded, acting [33,31] > > pg 20.8b4 is activating+degraded, acting [9,1] > > pg 20.8b7 is activating+degraded, acting [0,33] > > pg 20.8b9 is activating+degraded, acting [20,24] > > pg 20.8c5 is activating+degraded, acting [27,14] > > pg 20.8d1 is activating+degraded, acting [10,7] > > pg 20.8d4 is activating+degraded, acting [28,21] > > pg 20.8d5 is activating+degraded, acting [24,15] > > pg 20.8e0 is activating+degraded, acting [18,0] > > pg 20.8e2 is activating+degraded, acting [25,7] > > pg 20.8ea is activating+degraded, acting [17,21] > > pg 20.8f1 is activating+degraded, acting [15,11] > > pg 20.8fb is activating+degraded, acting [10,24] > > pg 20.8fc is activating+degraded, acting [20,15] > > pg 20.8ff is activating+degraded, acting [18,25] > > pg 20.913 is activating+degraded, acting [11,0] > > pg 20.91d is activating+degraded, acting [10,16] > > REQUEST_SLOW 99059 slow requests are blocked > 32 sec > > 24235 ops are blocked > 2097.15 sec > > 17029 ops are blocked > 1048.58 sec > > 54122 ops are blocked > 524.288 sec > > 2311 ops are blocked > 262.144 sec > > 767 ops are blocked > 131.072 sec > > 396 ops are blocked > 65.536 sec > > 199 ops are blocked > 32.768 sec > > osd.32 has blocked requests > 262.144 sec > > osds 5,8,12,26,28 have blocked requests > 524.288 sec > > osds 1,3,9,10 have blocked requests > 1048.58 sec > > osds 2,14,18,19,20,23,24,25,27,29,30,31,33,34,35 have blocked > requests > 2097.15 sec > > REQUEST_STUCK 4834 stuck requests are blocked > 4096 sec > > 4834 ops are blocked > 4194.3 sec > > osds 0,4,11,13,17,21,22 have stuck requests > 4194.3 sec > > TOO_MANY_PGS too many PGs per OSD (3003 > max 200) > > [root@fre101 ~]# > > > > [root@fre101 ~]# ceph -s > > 2019-01-04 15:18:53.398950 7fc372c94700 -1 asok(0x7fc36c0017a0) > AdminSocketConfigObs::init: failed: AdminSocket::bind_and_listen: failed to > bind the UNIX domain socket to > '/var/run/ceph-guests/ceph-client.admin.130425.140477307296080.asok': (2) > No such file or directory > > cluster: > > id: adb9ad8e-f458-4124-bf58-7963a8d1391f > > health: HEALTH_ERR > > 3 pools have many more objects per pg than average > > 523656/12393978 objects misplaced (4.225%) > > 6523 PGs pending on creation > > Reduced data availability: 6584 pgs inactive, 1267 pgs down, > 2 pgs peering, 2696 pgs stale > > Degraded data redundancy: 86858/12393978 objects degraded > (0.701%), 717 pgs degraded, 21 pgs undersized > > 107622 slow requests are blocked > 32 sec > > 4957 stuck requests are blocked > 4096 sec > > too many PGs per OSD (3003 > max 200) > > > > services: > > mon: 3 daemons, quorum ceph-mon01,ceph-mon02,ceph-mon03 > > mgr: ceph-mon03(active), standbys: ceph-mon01, ceph-mon02 > > osd: 39 osds: 39 up, 36 in; 85 remapped pgs > > rgw: 1 daemon active > > > > data: > > pools: 18 pools, 54656 pgs > > objects: 6051k objects, 10947 GB > > usage: 21971 GB used, 50650 GB / 72622 GB avail > > pgs: 0.002% pgs unknown > > 12.046% pgs not active > > 86858/12393978 objects degraded (0.701%) > > 523656/12393978 objects misplaced (4.225%) > > 46743 active+clean > > 4342 activating > > 1317 stale+active+clean > > 1151 stale+down > > 667 activating+degraded > > 159 stale+activating > > 116 down > > 77 activating+remapped > > 34 stale+activating+degraded > > 21 stale+activating+remapped > > 9 stale+active+undersized > > 6 stale+activating+undersized+degraded+remapped > > 5 activating+undersized+degraded+remapped > > 3 activating+degraded+remapped > > 2 stale+remapped+peering > > 1 stale+activating+degraded+remapped > > 1 stale+active+undersized+degraded > > 1 stale+active+clean+remapped > > 1 unknown > > > > io: > > client: 0 B/s rd, 33213 B/s wr, 5 op/s rd, 5 op/s wr > > recovery: 437 kB/s, 0 objects/s > > > > > > Are there any other suggestion besides force deleting PGs (like 6000 + ) > > > > Thanks > > Arun > > > > On Fri, Jan 4, 2019 at 11:55 AM Kevin Olbrich <k...@sv01.de> wrote: > >> > >> I don't think this will help you. Unfound means, the cluster is unable > >> to find the data anywhere (it's lost). > >> It would be sufficient to shut down the new host - the OSDs will then > be out. > >> > >> You can also force-heal the cluster, something like "do your best > possible": > >> > >> ceph pg 2.5 mark_unfound_lost revert|delete > >> > >> Src: > http://docs.ceph.com/docs/mimic/rados/troubleshooting/troubleshooting-pg/ > >> > >> Kevin > >> > >> Am Fr., 4. Jan. 2019 um 20:47 Uhr schrieb Arun POONIA > >> <arun.poo...@nuagenetworks.net>: > >> > > >> > Hi Kevin, > >> > > >> > Can I remove newly added server from Cluster and see if it heals > cluster ? > >> > > >> > When I check Hard Disk Iops on new server which are very low compared > to existing cluster server. > >> > > >> > Indeed this is a critical cluster but I don't have expertise to make > it flawless. > >> > > >> > Thanks > >> > Arun > >> > > >> > On Fri, Jan 4, 2019 at 11:35 AM Kevin Olbrich <k...@sv01.de> wrote: > >> >> > >> >> If you realy created and destroyed OSDs before the cluster healed > >> >> itself, this data will be permanently lost (not found / inactive). > >> >> Also your PG count is so much oversized, the calculation for peering > >> >> will most likely break because this was never tested. > >> >> > >> >> If this is a critical cluster, I would start a new one and bring back > >> >> the backups (using a better PG count). > >> >> > >> >> Kevin > >> >> > >> >> Am Fr., 4. Jan. 2019 um 20:25 Uhr schrieb Arun POONIA > >> >> <arun.poo...@nuagenetworks.net>: > >> >> > > >> >> > Can anyone comment on this issue please, I can't seem to bring my > cluster healthy. > >> >> > > >> >> > On Fri, Jan 4, 2019 at 6:26 AM Arun POONIA < > arun.poo...@nuagenetworks.net> wrote: > >> >> >> > >> >> >> Hi Caspar, > >> >> >> > >> >> >> Number of IOPs are also quite low. It used be around 1K Plus on > one of Pool (VMs) now its like close to 10-30 . > >> >> >> > >> >> >> Thansk > >> >> >> Arun > >> >> >> > >> >> >> On Fri, Jan 4, 2019 at 5:41 AM Arun POONIA < > arun.poo...@nuagenetworks.net> wrote: > >> >> >>> > >> >> >>> Hi Caspar, > >> >> >>> > >> >> >>> Yes and No, numbers are going up and down. If I run ceph -s > command I can see it decreases one time and later it increases again. I see > there are so many blocked/slow requests. Almost all the OSDs have slow > requests. Around 12% PGs are inactive not sure how to activate them again. > >> >> >>> > >> >> >>> > >> >> >>> [root@fre101 ~]# ceph health detail > >> >> >>> 2019-01-04 05:39:23.860142 7fc37a3a0700 -1 asok(0x7fc3740017a0) > AdminSocketConfigObs::init: failed: AdminSocket::bind_and_listen: failed to > bind the UNIX domain socket to > '/var/run/ceph-guests/ceph-client.admin.1066526.140477441513808.asok': (2) > No such file or directory > >> >> >>> HEALTH_ERR 1 osds down; 3 pools have many more objects per pg > than average; 472812/12392654 objects misplaced (3.815%); 3610 PGs pending > on creation; Reduced data availability: 6578 pgs inactive, 1882 pgs down, > 86 pgs peering, 850 pgs stale; Degraded data redundancy: 216694/12392654 > objects degraded (1.749%), 866 pgs degraded, 16 pgs undersized; 116082 slow > requests are blocked > 32 sec; 551 stuck requests are blocked > 4096 sec; > too many PGs per OSD (2709 > max 200) > >> >> >>> OSD_DOWN 1 osds down > >> >> >>> osd.28 (root=default,host=fre119) is down > >> >> >>> MANY_OBJECTS_PER_PG 3 pools have many more objects per pg than > average > >> >> >>> pool glance-images objects per pg (10478) is more than > 92.7257 times cluster average (113) > >> >> >>> pool vms objects per pg (4717) is more than 41.7434 times > cluster average (113) > >> >> >>> pool volumes objects per pg (1220) is more than 10.7965 > times cluster average (113) > >> >> >>> OBJECT_MISPLACED 472812/12392654 objects misplaced (3.815%) > >> >> >>> PENDING_CREATING_PGS 3610 PGs pending on creation > >> >> >>> osds > [osd.0,osd.1,osd.10,osd.11,osd.14,osd.15,osd.17,osd.18,osd.19,osd.20,osd.21,osd.22,osd.23,osd.25,osd.26,osd.27,osd.28,osd.3,osd.30,osd.32,osd.33,osd.35,osd.36,osd.37,osd.38,osd.4,osd.5,osd.6,osd.7,osd.9] > have pending PGs. > >> >> >>> PG_AVAILABILITY Reduced data availability: 6578 pgs inactive, > 1882 pgs down, 86 pgs peering, 850 pgs stale > >> >> >>> pg 10.900 is down, acting [18] > >> >> >>> pg 10.90e is stuck inactive for 60266.030164, current state > activating, last acting [2,38] > >> >> >>> pg 10.913 is stuck stale for 1887.552862, current state > stale+down, last acting [9] > >> >> >>> pg 10.915 is stuck inactive for 60266.215231, current state > activating, last acting [30,38] > >> >> >>> pg 11.903 is stuck inactive for 59294.465961, current state > activating, last acting [11,38] > >> >> >>> pg 11.910 is down, acting [21] > >> >> >>> pg 11.919 is down, acting [25] > >> >> >>> pg 12.902 is stuck inactive for 57118.544590, current state > activating, last acting [36,14] > >> >> >>> pg 13.8f8 is stuck inactive for 60707.167787, current state > activating, last acting [29,37] > >> >> >>> pg 13.901 is stuck stale for 60226.543289, current state > stale+active+clean, last acting [1,31] > >> >> >>> pg 13.905 is stuck inactive for 60266.050940, current state > activating, last acting [2,36] > >> >> >>> pg 13.909 is stuck inactive for 60707.160714, current state > activating, last acting [34,36] > >> >> >>> pg 13.90e is stuck inactive for 60707.410749, current state > activating, last acting [21,36] > >> >> >>> pg 13.911 is down, acting [25] > >> >> >>> pg 13.914 is stale+down, acting [29] > >> >> >>> pg 13.917 is stuck stale for 580.224688, current state > stale+down, last acting [16] > >> >> >>> pg 14.901 is stuck inactive for 60266.037762, current state > activating+degraded, last acting [22,37] > >> >> >>> pg 14.90f is stuck inactive for 60296.996447, current state > activating, last acting [30,36] > >> >> >>> pg 14.910 is stuck inactive for 60266.077310, current state > activating+degraded, last acting [17,37] > >> >> >>> pg 14.915 is stuck inactive for 60266.032445, current state > activating, last acting [34,36] > >> >> >>> pg 15.8fa is stuck stale for 560.223249, current state > stale+down, last acting [8] > >> >> >>> pg 15.90c is stuck inactive for 59294.402388, current state > activating, last acting [29,38] > >> >> >>> pg 15.90d is stuck inactive for 60266.176492, current state > activating, last acting [5,36] > >> >> >>> pg 15.915 is down, acting [0] > >> >> >>> pg 15.917 is stuck inactive for 56279.658951, current state > activating, last acting [13,38] > >> >> >>> pg 15.91c is stuck stale for 374.590704, current state > stale+down, last acting [12] > >> >> >>> pg 16.903 is stuck inactive for 56580.905961, current state > activating, last acting [25,37] > >> >> >>> pg 16.90e is stuck inactive for 60266.271680, current state > activating, last acting [14,37] > >> >> >>> pg 16.919 is stuck inactive for 59901.802184, current state > activating, last acting [20,37] > >> >> >>> pg 16.91e is stuck inactive for 60297.038159, current state > activating, last acting [22,37] > >> >> >>> pg 17.8e5 is stuck inactive for 60266.149061, current state > activating, last acting [25,36] > >> >> >>> pg 17.910 is stuck inactive for 59901.850204, current state > activating, last acting [26,37] > >> >> >>> pg 17.913 is stuck inactive for 60707.208364, current state > activating, last acting [13,36] > >> >> >>> pg 17.91a is stuck inactive for 60266.187509, current state > activating, last acting [4,37] > >> >> >>> pg 17.91f is down, acting [6] > >> >> >>> pg 18.908 is stuck inactive for 60707.216314, current state > activating, last acting [10,36] > >> >> >>> pg 18.911 is stuck stale for 244.570413, current state > stale+down, last acting [34] > >> >> >>> pg 18.919 is stuck inactive for 60265.980816, current state > activating, last acting [28,36] > >> >> >>> pg 18.91a is stuck inactive for 59901.814714, current state > activating, last acting [28,37] > >> >> >>> pg 18.91e is stuck inactive for 60707.179338, current state > activating, last acting [0,36] > >> >> >>> pg 19.90a is stuck inactive for 60203.089988, current state > activating, last acting [35,38] > >> >> >>> pg 20.8e0 is stuck inactive for 60296.839098, current state > activating+degraded, last acting [18,37] > >> >> >>> pg 20.913 is stuck inactive for 60296.977401, current state > activating+degraded, last acting [11,37] > >> >> >>> pg 20.91d is stuck inactive for 60296.891370, current state > activating+degraded, last acting [10,38] > >> >> >>> pg 21.8e1 is stuck inactive for 60707.422330, current state > activating, last acting [21,38] > >> >> >>> pg 21.907 is stuck inactive for 60296.855511, current state > activating, last acting [20,36] > >> >> >>> pg 21.90e is stuck inactive for 60266.055557, current state > activating, last acting [1,38] > >> >> >>> pg 21.917 is stuck inactive for 60296.940074, current state > activating, last acting [15,36] > >> >> >>> pg 22.90b is stuck inactive for 60707.286070, current state > activating, last acting [20,36] > >> >> >>> pg 22.90c is stuck inactive for 59901.788199, current state > activating, last acting [20,37] > >> >> >>> pg 22.90f is stuck inactive for 60297.062020, current state > activating, last acting [38,35] > >> >> >>> PG_DEGRADED Degraded data redundancy: 216694/12392654 objects > degraded (1.749%), 866 pgs degraded, 16 pgs undersized > >> >> >>> pg 12.85a is active+undersized+degraded, acting [3] > >> >> >>> pg 14.843 is activating+degraded, acting [7,38] > >> >> >>> pg 14.85f is activating+degraded, acting [25,36] > >> >> >>> pg 14.865 is activating+degraded, acting [33,37] > >> >> >>> pg 14.87a is activating+degraded, acting [28,36] > >> >> >>> pg 14.87e is activating+degraded, acting [17,38] > >> >> >>> pg 14.882 is activating+degraded, acting [4,36] > >> >> >>> pg 14.88a is activating+degraded, acting [2,37] > >> >> >>> pg 14.893 is activating+degraded, acting [24,36] > >> >> >>> pg 14.897 is active+undersized+degraded, acting [34] > >> >> >>> pg 14.89c is activating+degraded, acting [14,38] > >> >> >>> pg 14.89e is activating+degraded, acting [15,38] > >> >> >>> pg 14.8a8 is active+undersized+degraded, acting [33] > >> >> >>> pg 14.8b1 is activating+degraded, acting [30,38] > >> >> >>> pg 14.8d4 is active+undersized+degraded, acting [13] > >> >> >>> pg 14.8d8 is active+undersized+degraded, acting [4] > >> >> >>> pg 14.8e6 is active+undersized+degraded, acting [10] > >> >> >>> pg 14.8e7 is active+undersized+degraded, acting [1] > >> >> >>> pg 14.8ef is activating+degraded, acting [9,36] > >> >> >>> pg 14.8f8 is active+undersized+degraded, acting [30] > >> >> >>> pg 14.901 is activating+degraded, acting [22,37] > >> >> >>> pg 14.910 is activating+degraded, acting [17,37] > >> >> >>> pg 14.913 is active+undersized+degraded, acting [18] > >> >> >>> pg 20.821 is activating+degraded, acting [37,33] > >> >> >>> pg 20.825 is activating+degraded, acting [25,36] > >> >> >>> pg 20.84f is active+undersized+degraded, acting [2] > >> >> >>> pg 20.85a is active+undersized+degraded, acting [11] > >> >> >>> pg 20.85f is activating+degraded, acting [1,38] > >> >> >>> pg 20.865 is activating+degraded, acting [8,38] > >> >> >>> pg 20.869 is activating+degraded, acting [27,37] > >> >> >>> pg 20.87b is active+undersized+degraded, acting [30] > >> >> >>> pg 20.88b is activating+degraded, acting [6,38] > >> >> >>> pg 20.895 is activating+degraded, acting [37,27] > >> >> >>> pg 20.89c is activating+degraded, acting [1,36] > >> >> >>> pg 20.8a3 is activating+degraded, acting [30,36] > >> >> >>> pg 20.8ad is activating+degraded, acting [1,38] > >> >> >>> pg 20.8af is activating+degraded, acting [33,37] > >> >> >>> pg 20.8b7 is activating+degraded, acting [0,38] > >> >> >>> pg 20.8b9 is activating+degraded, acting [20,38] > >> >> >>> pg 20.8d4 is activating+degraded, acting [28,37] > >> >> >>> pg 20.8d5 is activating+degraded, acting [24,37] > >> >> >>> pg 20.8e0 is activating+degraded, acting [18,37] > >> >> >>> pg 20.8e3 is activating+degraded, acting [21,38] > >> >> >>> pg 20.8ea is activating+degraded, acting [17,36] > >> >> >>> pg 20.8ee is active+undersized+degraded, acting [4] > >> >> >>> pg 20.8f2 is activating+degraded, acting [3,36] > >> >> >>> pg 20.8fb is activating+degraded, acting [10,38] > >> >> >>> pg 20.8fc is activating+degraded, acting [20,38] > >> >> >>> pg 20.913 is activating+degraded, acting [11,37] > >> >> >>> pg 20.916 is active+undersized+degraded, acting [21] > >> >> >>> pg 20.91d is activating+degraded, acting [10,38] > >> >> >>> REQUEST_SLOW 116082 slow requests are blocked > 32 sec > >> >> >>> 10619 ops are blocked > 2097.15 sec > >> >> >>> 74227 ops are blocked > 1048.58 sec > >> >> >>> 18561 ops are blocked > 524.288 sec > >> >> >>> 10862 ops are blocked > 262.144 sec > >> >> >>> 1037 ops are blocked > 131.072 sec > >> >> >>> 520 ops are blocked > 65.536 sec > >> >> >>> 256 ops are blocked > 32.768 sec > >> >> >>> osd.29 has blocked requests > 32.768 sec > >> >> >>> osd.15 has blocked requests > 262.144 sec > >> >> >>> osds 12,13,31 have blocked requests > 524.288 sec > >> >> >>> osds 1,8,16,19,23,25,26,33,37,38 have blocked requests > > 1048.58 sec > >> >> >>> osds 3,4,5,6,10,14,17,22,27,30,32,35,36 have blocked > requests > 2097.15 sec > >> >> >>> REQUEST_STUCK 551 stuck requests are blocked > 4096 sec > >> >> >>> 551 ops are blocked > 4194.3 sec > >> >> >>> osds 0,28 have stuck requests > 4194.3 sec > >> >> >>> TOO_MANY_PGS too many PGs per OSD (2709 > max 200) > >> >> >>> [root@fre101 ~]# > >> >> >>> [root@fre101 ~]# > >> >> >>> [root@fre101 ~]# > >> >> >>> [root@fre101 ~]# > >> >> >>> [root@fre101 ~]# > >> >> >>> [root@fre101 ~]# > >> >> >>> [root@fre101 ~]# ceph -s > >> >> >>> 2019-01-04 05:39:29.364100 7f0fb32f2700 -1 asok(0x7f0fac0017a0) > AdminSocketConfigObs::init: failed: AdminSocket::bind_and_listen: failed to > bind the UNIX domain socket to > '/var/run/ceph-guests/ceph-client.admin.1066635.139705286924624.asok': (2) > No such file or directory > >> >> >>> cluster: > >> >> >>> id: adb9ad8e-f458-4124-bf58-7963a8d1391f > >> >> >>> health: HEALTH_ERR > >> >> >>> 3 pools have many more objects per pg than average > >> >> >>> 473825/12392654 objects misplaced (3.823%) > >> >> >>> 3723 PGs pending on creation > >> >> >>> Reduced data availability: 6677 pgs inactive, 1948 > pgs down, 157 pgs peering, 850 pgs stale > >> >> >>> Degraded data redundancy: 306567/12392654 objects > degraded (2.474%), 949 pgs degraded, 16 pgs undersized > >> >> >>> 98047 slow requests are blocked > 32 sec > >> >> >>> 33 stuck requests are blocked > 4096 sec > >> >> >>> too many PGs per OSD (2690 > max 200) > >> >> >>> > >> >> >>> services: > >> >> >>> mon: 3 daemons, quorum ceph-mon01,ceph-mon02,ceph-mon03 > >> >> >>> mgr: ceph-mon03(active), standbys: ceph-mon01, ceph-mon02 > >> >> >>> osd: 39 osds: 39 up, 39 in; 76 remapped pgs > >> >> >>> rgw: 1 daemon active > >> >> >>> > >> >> >>> data: > >> >> >>> pools: 18 pools, 54656 pgs > >> >> >>> objects: 6051k objects, 10944 GB > >> >> >>> usage: 21934 GB used, 50687 GB / 72622 GB avail > >> >> >>> pgs: 13.267% pgs not active > >> >> >>> 306567/12392654 objects degraded (2.474%) > >> >> >>> 473825/12392654 objects misplaced (3.823%) > >> >> >>> 44937 active+clean > >> >> >>> 3850 activating > >> >> >>> 1936 active+undersized > >> >> >>> 1078 down > >> >> >>> 864 stale+down > >> >> >>> 597 peering > >> >> >>> 591 activating+degraded > >> >> >>> 316 active+undersized+degraded > >> >> >>> 205 stale+active+clean > >> >> >>> 133 stale+activating > >> >> >>> 67 activating+remapped > >> >> >>> 32 stale+activating+degraded > >> >> >>> 21 stale+activating+remapped > >> >> >>> 9 stale+active+undersized > >> >> >>> 6 down+remapped > >> >> >>> 5 stale+activating+undersized+degraded+remapped > >> >> >>> 2 activating+degraded+remapped > >> >> >>> 1 stale+activating+degraded+remapped > >> >> >>> 1 stale+active+undersized+degraded > >> >> >>> 1 remapped+peering > >> >> >>> 1 active+clean+remapped > >> >> >>> 1 stale+remapped+peering > >> >> >>> 1 stale+peering > >> >> >>> 1 activating+undersized+degraded+remapped > >> >> >>> > >> >> >>> io: > >> >> >>> client: 0 B/s rd, 23566 B/s wr, 0 op/s rd, 3 op/s wr > >> >> >>> > >> >> >>> Thanks > >> >> >>> > >> >> >>> Arun > >> >> >>> > >> >> >>> On Fri, Jan 4, 2019 at 5:38 AM Caspar Smit < > caspars...@supernas.eu> wrote: > >> >> >>>> > >> >> >>>> Are the numbers still decreasing? > >> >> >>>> > >> >> >>>> This one for instance: > >> >> >>>> > >> >> >>>> "3883 PGs pending on creation" > >> >> >>>> > >> >> >>>> Caspar > >> >> >>>> > >> >> >>>> > >> >> >>>> Op vr 4 jan. 2019 om 14:23 schreef Arun POONIA < > arun.poo...@nuagenetworks.net>: > >> >> >>>>> > >> >> >>>>> Hi Caspar, > >> >> >>>>> > >> >> >>>>> Yes, cluster was working fine with number of PGs per OSD > warning up until now. I am not sure how to recover from stale down/inactive > PGs. If you happen to know about this can you let me know? > >> >> >>>>> > >> >> >>>>> Current State: > >> >> >>>>> > >> >> >>>>> [root@fre101 ~]# ceph -s > >> >> >>>>> 2019-01-04 05:22:05.942349 7f314f613700 -1 > asok(0x7f31480017a0) AdminSocketConfigObs::init: failed: > AdminSocket::bind_and_listen: failed to bind the UNIX domain socket to > '/var/run/ceph-guests/ceph-client.admin.1053724.139849638091088.asok': (2) > No such file or directory > >> >> >>>>> cluster: > >> >> >>>>> id: adb9ad8e-f458-4124-bf58-7963a8d1391f > >> >> >>>>> health: HEALTH_ERR > >> >> >>>>> 3 pools have many more objects per pg than average > >> >> >>>>> 505714/12392650 objects misplaced (4.081%) > >> >> >>>>> 3883 PGs pending on creation > >> >> >>>>> Reduced data availability: 6519 pgs inactive, 1870 > pgs down, 1 pg peering, 886 pgs stale > >> >> >>>>> Degraded data redundancy: 42987/12392650 objects > degraded (0.347%), 634 pgs degraded, 16 pgs undersized > >> >> >>>>> 125827 slow requests are blocked > 32 sec > >> >> >>>>> 2 stuck requests are blocked > 4096 sec > >> >> >>>>> too many PGs per OSD (2758 > max 200) > >> >> >>>>> > >> >> >>>>> services: > >> >> >>>>> mon: 3 daemons, quorum ceph-mon01,ceph-mon02,ceph-mon03 > >> >> >>>>> mgr: ceph-mon03(active), standbys: ceph-mon01, ceph-mon02 > >> >> >>>>> osd: 39 osds: 39 up, 39 in; 76 remapped pgs > >> >> >>>>> rgw: 1 daemon active > >> >> >>>>> > >> >> >>>>> data: > >> >> >>>>> pools: 18 pools, 54656 pgs > >> >> >>>>> objects: 6051k objects, 10944 GB > >> >> >>>>> usage: 21933 GB used, 50688 GB / 72622 GB avail > >> >> >>>>> pgs: 11.927% pgs not active > >> >> >>>>> 42987/12392650 objects degraded (0.347%) > >> >> >>>>> 505714/12392650 objects misplaced (4.081%) > >> >> >>>>> 48080 active+clean > >> >> >>>>> 3885 activating > >> >> >>>>> 1111 down > >> >> >>>>> 759 stale+down > >> >> >>>>> 614 activating+degraded > >> >> >>>>> 74 activating+remapped > >> >> >>>>> 46 stale+active+clean > >> >> >>>>> 35 stale+activating > >> >> >>>>> 21 stale+activating+remapped > >> >> >>>>> 9 stale+active+undersized > >> >> >>>>> 9 stale+activating+degraded > >> >> >>>>> 5 > stale+activating+undersized+degraded+remapped > >> >> >>>>> 3 activating+degraded+remapped > >> >> >>>>> 1 stale+activating+degraded+remapped > >> >> >>>>> 1 stale+active+undersized+degraded > >> >> >>>>> 1 remapped+peering > >> >> >>>>> 1 active+clean+remapped > >> >> >>>>> 1 activating+undersized+degraded+remapped > >> >> >>>>> > >> >> >>>>> io: > >> >> >>>>> client: 0 B/s rd, 25397 B/s wr, 4 op/s rd, 4 op/s wr > >> >> >>>>> > >> >> >>>>> I will update number of PGs per OSD once these inactive or > stale PGs come online. I am not able to access VMs (VMs, Images) which are > using Ceph. > >> >> >>>>> > >> >> >>>>> Thanks > >> >> >>>>> Arun > >> >> >>>>> > >> >> >>>>> On Fri, Jan 4, 2019 at 4:53 AM Caspar Smit < > caspars...@supernas.eu> wrote: > >> >> >>>>>> > >> >> >>>>>> Hi Arun, > >> >> >>>>>> > >> >> >>>>>> How did you end up with a 'working' cluster with so many pgs > per OSD? > >> >> >>>>>> > >> >> >>>>>> "too many PGs per OSD (2968 > max 200)" > >> >> >>>>>> > >> >> >>>>>> To (temporarily) allow this kind of pgs per osd you could try > this: > >> >> >>>>>> > >> >> >>>>>> Change these values in the global section in your ceph.conf: > >> >> >>>>>> > >> >> >>>>>> mon max pg per osd = 200 > >> >> >>>>>> osd max pg per osd hard ratio = 2 > >> >> >>>>>> > >> >> >>>>>> It allows 200*2 = 400 Pgs per OSD before disabling the > creation of new pgs. > >> >> >>>>>> > >> >> >>>>>> Above are the defaults (for Luminous, maybe other versions > too) > >> >> >>>>>> You can check your current settings with: > >> >> >>>>>> > >> >> >>>>>> ceph daemon mon.ceph-mon01 config show |grep pg_per_osd > >> >> >>>>>> > >> >> >>>>>> Since your current pgs per osd ratio is way higher then the > default you could set them to for instance: > >> >> >>>>>> > >> >> >>>>>> mon max pg per osd = 1000 > >> >> >>>>>> osd max pg per osd hard ratio = 5 > >> >> >>>>>> > >> >> >>>>>> Which allow for 5000 pgs per osd before disabling creation of > new pgs. > >> >> >>>>>> > >> >> >>>>>> You'll need to inject the setting into the mons/osds and > restart mgrs to make them active. > >> >> >>>>>> > >> >> >>>>>> ceph tell mon.* injectargs ‘--mon_max_pg_per_osd 1000’ > >> >> >>>>>> ceph tell mon.* injectargs ‘--osd_max_pg_per_osd_hard_ratio 5’ > >> >> >>>>>> ceph tell osd.* injectargs ‘--mon_max_pg_per_osd 1000’ > >> >> >>>>>> ceph tell osd.* injectargs ‘--osd_max_pg_per_osd_hard_ratio 5’ > >> >> >>>>>> restart mgrs > >> >> >>>>>> > >> >> >>>>>> Kind regards, > >> >> >>>>>> Caspar > >> >> >>>>>> > >> >> >>>>>> > >> >> >>>>>> Op vr 4 jan. 2019 om 04:28 schreef Arun POONIA < > arun.poo...@nuagenetworks.net>: > >> >> >>>>>>> > >> >> >>>>>>> Hi Chris, > >> >> >>>>>>> > >> >> >>>>>>> Indeed that's what happened. I didn't set noout flag either > and I did zapped disk on new server every time. In my cluster status fre201 > is only new server. > >> >> >>>>>>> > >> >> >>>>>>> Current Status after enabling 3 OSDs on fre201 host. > >> >> >>>>>>> > >> >> >>>>>>> [root@fre201 ~]# ceph osd tree > >> >> >>>>>>> ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF > >> >> >>>>>>> -1 70.92137 root default > >> >> >>>>>>> -2 5.45549 host fre101 > >> >> >>>>>>> 0 hdd 1.81850 osd.0 up 1.00000 1.00000 > >> >> >>>>>>> 1 hdd 1.81850 osd.1 up 1.00000 1.00000 > >> >> >>>>>>> 2 hdd 1.81850 osd.2 up 1.00000 1.00000 > >> >> >>>>>>> -9 5.45549 host fre103 > >> >> >>>>>>> 3 hdd 1.81850 osd.3 up 1.00000 1.00000 > >> >> >>>>>>> 4 hdd 1.81850 osd.4 up 1.00000 1.00000 > >> >> >>>>>>> 5 hdd 1.81850 osd.5 up 1.00000 1.00000 > >> >> >>>>>>> -3 5.45549 host fre105 > >> >> >>>>>>> 6 hdd 1.81850 osd.6 up 1.00000 1.00000 > >> >> >>>>>>> 7 hdd 1.81850 osd.7 up 1.00000 1.00000 > >> >> >>>>>>> 8 hdd 1.81850 osd.8 up 1.00000 1.00000 > >> >> >>>>>>> -4 5.45549 host fre107 > >> >> >>>>>>> 9 hdd 1.81850 osd.9 up 1.00000 1.00000 > >> >> >>>>>>> 10 hdd 1.81850 osd.10 up 1.00000 1.00000 > >> >> >>>>>>> 11 hdd 1.81850 osd.11 up 1.00000 1.00000 > >> >> >>>>>>> -5 5.45549 host fre109 > >> >> >>>>>>> 12 hdd 1.81850 osd.12 up 1.00000 1.00000 > >> >> >>>>>>> 13 hdd 1.81850 osd.13 up 1.00000 1.00000 > >> >> >>>>>>> 14 hdd 1.81850 osd.14 up 1.00000 1.00000 > >> >> >>>>>>> -6 5.45549 host fre111 > >> >> >>>>>>> 15 hdd 1.81850 osd.15 up 1.00000 1.00000 > >> >> >>>>>>> 16 hdd 1.81850 osd.16 up 1.00000 1.00000 > >> >> >>>>>>> 17 hdd 1.81850 osd.17 up 0.79999 1.00000 > >> >> >>>>>>> -7 5.45549 host fre113 > >> >> >>>>>>> 18 hdd 1.81850 osd.18 up 1.00000 1.00000 > >> >> >>>>>>> 19 hdd 1.81850 osd.19 up 1.00000 1.00000 > >> >> >>>>>>> 20 hdd 1.81850 osd.20 up 1.00000 1.00000 > >> >> >>>>>>> -8 5.45549 host fre115 > >> >> >>>>>>> 21 hdd 1.81850 osd.21 up 1.00000 1.00000 > >> >> >>>>>>> 22 hdd 1.81850 osd.22 up 1.00000 1.00000 > >> >> >>>>>>> 23 hdd 1.81850 osd.23 up 1.00000 1.00000 > >> >> >>>>>>> -10 5.45549 host fre117 > >> >> >>>>>>> 24 hdd 1.81850 osd.24 up 1.00000 1.00000 > >> >> >>>>>>> 25 hdd 1.81850 osd.25 up 1.00000 1.00000 > >> >> >>>>>>> 26 hdd 1.81850 osd.26 up 1.00000 1.00000 > >> >> >>>>>>> -11 5.45549 host fre119 > >> >> >>>>>>> 27 hdd 1.81850 osd.27 up 1.00000 1.00000 > >> >> >>>>>>> 28 hdd 1.81850 osd.28 up 1.00000 1.00000 > >> >> >>>>>>> 29 hdd 1.81850 osd.29 up 1.00000 1.00000 > >> >> >>>>>>> -12 5.45549 host fre121 > >> >> >>>>>>> 30 hdd 1.81850 osd.30 up 1.00000 1.00000 > >> >> >>>>>>> 31 hdd 1.81850 osd.31 up 1.00000 1.00000 > >> >> >>>>>>> 32 hdd 1.81850 osd.32 up 1.00000 1.00000 > >> >> >>>>>>> -13 5.45549 host fre123 > >> >> >>>>>>> 33 hdd 1.81850 osd.33 up 1.00000 1.00000 > >> >> >>>>>>> 34 hdd 1.81850 osd.34 up 1.00000 1.00000 > >> >> >>>>>>> 35 hdd 1.81850 osd.35 up 1.00000 1.00000 > >> >> >>>>>>> -27 5.45549 host fre201 > >> >> >>>>>>> 36 hdd 1.81850 osd.36 up 1.00000 1.00000 > >> >> >>>>>>> 37 hdd 1.81850 osd.37 up 1.00000 1.00000 > >> >> >>>>>>> 38 hdd 1.81850 osd.38 up 1.00000 1.00000 > >> >> >>>>>>> [root@fre201 ~]# > >> >> >>>>>>> [root@fre201 ~]# > >> >> >>>>>>> [root@fre201 ~]# > >> >> >>>>>>> [root@fre201 ~]# > >> >> >>>>>>> [root@fre201 ~]# > >> >> >>>>>>> [root@fre201 ~]# ceph -s > >> >> >>>>>>> cluster: > >> >> >>>>>>> id: adb9ad8e-f458-4124-bf58-7963a8d1391f > >> >> >>>>>>> health: HEALTH_ERR > >> >> >>>>>>> 3 pools have many more objects per pg than > average > >> >> >>>>>>> 585791/12391450 objects misplaced (4.727%) > >> >> >>>>>>> 2 scrub errors > >> >> >>>>>>> 2374 PGs pending on creation > >> >> >>>>>>> Reduced data availability: 6578 pgs inactive, > 2025 pgs down, 74 pgs peering, 1234 pgs stale > >> >> >>>>>>> Possible data damage: 2 pgs inconsistent > >> >> >>>>>>> Degraded data redundancy: 64969/12391450 objects > degraded (0.524%), 616 pgs degraded, 20 pgs undersized > >> >> >>>>>>> 96242 slow requests are blocked > 32 sec > >> >> >>>>>>> 228 stuck requests are blocked > 4096 sec > >> >> >>>>>>> too many PGs per OSD (2768 > max 200) > >> >> >>>>>>> > >> >> >>>>>>> services: > >> >> >>>>>>> mon: 3 daemons, quorum ceph-mon01,ceph-mon02,ceph-mon03 > >> >> >>>>>>> mgr: ceph-mon03(active), standbys: ceph-mon01, ceph-mon02 > >> >> >>>>>>> osd: 39 osds: 39 up, 39 in; 96 remapped pgs > >> >> >>>>>>> rgw: 1 daemon active > >> >> >>>>>>> > >> >> >>>>>>> data: > >> >> >>>>>>> pools: 18 pools, 54656 pgs > >> >> >>>>>>> objects: 6050k objects, 10942 GB > >> >> >>>>>>> usage: 21900 GB used, 50721 GB / 72622 GB avail > >> >> >>>>>>> pgs: 0.002% pgs unknown > >> >> >>>>>>> 12.050% pgs not active > >> >> >>>>>>> 64969/12391450 objects degraded (0.524%) > >> >> >>>>>>> 585791/12391450 objects misplaced (4.727%) > >> >> >>>>>>> 47489 active+clean > >> >> >>>>>>> 3670 activating > >> >> >>>>>>> 1098 stale+down > >> >> >>>>>>> 923 down > >> >> >>>>>>> 575 activating+degraded > >> >> >>>>>>> 563 stale+active+clean > >> >> >>>>>>> 105 stale+activating > >> >> >>>>>>> 78 activating+remapped > >> >> >>>>>>> 72 peering > >> >> >>>>>>> 25 stale+activating+degraded > >> >> >>>>>>> 23 stale+activating+remapped > >> >> >>>>>>> 9 stale+active+undersized > >> >> >>>>>>> 6 > stale+activating+undersized+degraded+remapped > >> >> >>>>>>> 5 stale+active+undersized+degraded > >> >> >>>>>>> 4 down+remapped > >> >> >>>>>>> 4 activating+degraded+remapped > >> >> >>>>>>> 2 active+clean+inconsistent > >> >> >>>>>>> 1 stale+activating+degraded+remapped > >> >> >>>>>>> 1 stale+active+clean+remapped > >> >> >>>>>>> 1 stale+remapped+peering > >> >> >>>>>>> 1 remapped+peering > >> >> >>>>>>> 1 unknown > >> >> >>>>>>> > >> >> >>>>>>> io: > >> >> >>>>>>> client: 0 B/s rd, 208 kB/s wr, 22 op/s rd, 22 op/s wr > >> >> >>>>>>> > >> >> >>>>>>> > >> >> >>>>>>> > >> >> >>>>>>> Thanks > >> >> >>>>>>> Arun > >> >> >>>>>>> > >> >> >>>>>>> > >> >> >>>>>>> On Thu, Jan 3, 2019 at 7:19 PM Chris < > bitskr...@bitskrieg.net> wrote: > >> >> >>>>>>>> > >> >> >>>>>>>> If you added OSDs and then deleted them repeatedly without > waiting for replication to finish as the cluster attempted to re-balance > across them, its highly likely that you are permanently missing PGs > (especially if the disks were zapped each time). > >> >> >>>>>>>> > >> >> >>>>>>>> If those 3 down OSDs can be revived there is a (small) > chance that you can right the ship, but 1400pg/OSD is pretty extreme. I'm > surprised the cluster even let you do that - this sounds like a data loss > event. > >> >> >>>>>>>> > >> >> >>>>>>>> Bring back the 3 OSD and see what those 2 inconsistent pgs > look like with ceph pg query. > >> >> >>>>>>>> > >> >> >>>>>>>> On January 3, 2019 21:59:38 Arun POONIA < > arun.poo...@nuagenetworks.net> wrote: > >> >> >>>>>>>>> > >> >> >>>>>>>>> Hi, > >> >> >>>>>>>>> > >> >> >>>>>>>>> Recently I tried adding a new node (OSD) to ceph cluster > using ceph-deploy tool. Since I was experimenting with tool and ended up > deleting OSD nodes on new server couple of times. > >> >> >>>>>>>>> > >> >> >>>>>>>>> Now since ceph OSDs are running on new server cluster PGs > seems to be inactive (10-15%) and they are not recovering or rebalancing. > Not sure what to do. I tried shutting down OSDs on new server. > >> >> >>>>>>>>> > >> >> >>>>>>>>> Status: > >> >> >>>>>>>>> [root@fre105 ~]# ceph -s > >> >> >>>>>>>>> 2019-01-03 18:56:42.867081 7fa0bf573700 -1 > asok(0x7fa0b80017a0) AdminSocketConfigObs::init: failed: > AdminSocket::bind_and_listen: failed to bind the UNIX domain socket to > '/var/run/ceph-guests/ceph-client.admin.4018644.140328258509136.asok': (2) > No such file or directory > >> >> >>>>>>>>> cluster: > >> >> >>>>>>>>> id: adb9ad8e-f458-4124-bf58-7963a8d1391f > >> >> >>>>>>>>> health: HEALTH_ERR > >> >> >>>>>>>>> 3 pools have many more objects per pg than > average > >> >> >>>>>>>>> 373907/12391198 objects misplaced (3.018%) > >> >> >>>>>>>>> 2 scrub errors > >> >> >>>>>>>>> 9677 PGs pending on creation > >> >> >>>>>>>>> Reduced data availability: 7145 pgs inactive, > 6228 pgs down, 1 pg peering, 2717 pgs stale > >> >> >>>>>>>>> Possible data damage: 2 pgs inconsistent > >> >> >>>>>>>>> Degraded data redundancy: 178350/12391198 > objects degraded (1.439%), 346 pgs degraded, 1297 pgs undersized > >> >> >>>>>>>>> 52486 slow requests are blocked > 32 sec > >> >> >>>>>>>>> 9287 stuck requests are blocked > 4096 sec > >> >> >>>>>>>>> too many PGs per OSD (2968 > max 200) > >> >> >>>>>>>>> > >> >> >>>>>>>>> services: > >> >> >>>>>>>>> mon: 3 daemons, quorum ceph-mon01,ceph-mon02,ceph-mon03 > >> >> >>>>>>>>> mgr: ceph-mon03(active), standbys: ceph-mon01, > ceph-mon02 > >> >> >>>>>>>>> osd: 39 osds: 36 up, 36 in; 51 remapped pgs > >> >> >>>>>>>>> rgw: 1 daemon active > >> >> >>>>>>>>> > >> >> >>>>>>>>> data: > >> >> >>>>>>>>> pools: 18 pools, 54656 pgs > >> >> >>>>>>>>> objects: 6050k objects, 10941 GB > >> >> >>>>>>>>> usage: 21727 GB used, 45308 GB / 67035 GB avail > >> >> >>>>>>>>> pgs: 13.073% pgs not active > >> >> >>>>>>>>> 178350/12391198 objects degraded (1.439%) > >> >> >>>>>>>>> 373907/12391198 objects misplaced (3.018%) > >> >> >>>>>>>>> 46177 active+clean > >> >> >>>>>>>>> 5054 down > >> >> >>>>>>>>> 1173 stale+down > >> >> >>>>>>>>> 1084 stale+active+undersized > >> >> >>>>>>>>> 547 activating > >> >> >>>>>>>>> 201 stale+active+undersized+degraded > >> >> >>>>>>>>> 158 stale+activating > >> >> >>>>>>>>> 96 activating+degraded > >> >> >>>>>>>>> 46 stale+active+clean > >> >> >>>>>>>>> 42 activating+remapped > >> >> >>>>>>>>> 34 stale+activating+degraded > >> >> >>>>>>>>> 23 stale+activating+remapped > >> >> >>>>>>>>> 6 > stale+activating+undersized+degraded+remapped > >> >> >>>>>>>>> 6 activating+undersized+degraded+remapped > >> >> >>>>>>>>> 2 activating+degraded+remapped > >> >> >>>>>>>>> 2 active+clean+inconsistent > >> >> >>>>>>>>> 1 stale+activating+degraded+remapped > >> >> >>>>>>>>> 1 stale+active+clean+remapped > >> >> >>>>>>>>> 1 stale+remapped > >> >> >>>>>>>>> 1 down+remapped > >> >> >>>>>>>>> 1 remapped+peering > >> >> >>>>>>>>> > >> >> >>>>>>>>> io: > >> >> >>>>>>>>> client: 0 B/s rd, 208 kB/s wr, 28 op/s rd, 28 op/s wr > >> >> >>>>>>>>> > >> >> >>>>>>>>> Thanks > >> >> >>>>>>>>> -- > >> >> >>>>>>>>> Arun Poonia > >> >> >>>>>>>>> > >> >> >>>>>>>>> _______________________________________________ > >> >> >>>>>>>>> ceph-users mailing list > >> >> >>>>>>>>> ceph-users@lists.ceph.com > >> >> >>>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > >> >> >>>>>>>>> > >> >> >>>>>>>> > >> >> >>>>>>> > >> >> >>>>>>> > >> >> >>>>>>> -- > >> >> >>>>>>> Arun Poonia > >> >> >>>>>>> > >> >> >>>>>>> _______________________________________________ > >> >> >>>>>>> ceph-users mailing list > >> >> >>>>>>> ceph-users@lists.ceph.com > >> >> >>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > >> >> >>>>>> > >> >> >>>>>> _______________________________________________ > >> >> >>>>>> ceph-users mailing list > >> >> >>>>>> ceph-users@lists.ceph.com > >> >> >>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > >> >> >>>>> > >> >> >>>>> > >> >> >>>>> > >> >> >>>>> -- > >> >> >>>>> Arun Poonia > >> >> >>>>> > >> >> >>>> _______________________________________________ > >> >> >>>> ceph-users mailing list > >> >> >>>> ceph-users@lists.ceph.com > >> >> >>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > >> >> >>> > >> >> >>> > >> >> >>> > >> >> >>> -- > >> >> >>> Arun Poonia > >> >> >>> > >> >> >> > >> >> >> > >> >> >> -- > >> >> >> Arun Poonia > >> >> >> > >> >> > > >> >> > > >> >> > -- > >> >> > Arun Poonia > >> >> > > >> >> > _______________________________________________ > >> >> > ceph-users mailing list > >> >> > ceph-users@lists.ceph.com > >> >> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > >> > > >> > > >> > > >> > -- > >> > Arun Poonia > >> > > > > > > > > > -- > > Arun Poonia > > > -- Arun Poonia
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com