This is a (harmless) bug that existed since Mimic and will be fixed in 14.2.5 (I think?). The health error will clear up without any intervention.
Paul -- Paul Emmerich Looking for help with your Ceph cluster? Contact us at https://croit.io croit GmbH Freseniusstr. 31h 81247 München www.croit.io Tel: +49 89 1896585 90 On Mon, Dec 9, 2019 at 12:03 PM Eugen Block <ebl...@nde.ag> wrote: > Hi, > > since we upgraded our cluster to Nautilus we also see those messages > sometimes when it's rebalancing. There are several reports about this > [1] [2], we didn't see it in Luminous. But eventually the rebalancing > finished and the error message cleared, so I'd say there's (probably) > nothing to worry about if there aren't any other issues. > > Regards, > Eugen > > > [1] https://tracker.ceph.com/issues/39555 > [2] https://tracker.ceph.com/issues/41255 > > > Zitat von Simone Lazzaris <simone.lazza...@qcom.it>: > > > Hi all; > > Long story short, I have a cluster of 26 OSD in 3 nodes (8+9+9). One > > of the disk is showing > > some read error, so I''ve added an OSD in the faulty node (OSD.26) > > and set the (re)weight of > > the faulty OSD (OSD.12) to zero. > > > > The cluster is now rebalancing, which is fine, but I have now 2 PG > > in "backfill_toofull" state, so > > the cluster health is "ERR": > > > > cluster: > > id: 9ec27b0f-acfd-40a3-b35d-db301ac5ce8c > > health: HEALTH_ERR > > Degraded data redundancy (low space): 2 pgs backfill_toofull > > > > services: > > mon: 3 daemons, quorum s1,s2,s3 (age 7d) > > mgr: s1(active, since 7d), standbys: s2, s3 > > osd: 27 osds: 27 up (since 2h), 26 in (since 2h); 262 remapped pgs > > rgw: 3 daemons active (s1, s2, s3) > > > > data: > > pools: 10 pools, 1200 pgs > > objects: 11.72M objects, 37 TiB > > usage: 57 TiB used, 42 TiB / 98 TiB avail > > pgs: 2618510/35167194 objects misplaced (7.446%) > > 938 active+clean > > 216 active+remapped+backfill_wait > > 44 active+remapped+backfilling > > 2 active+remapped+backfill_wait+backfill_toofull > > > > io: > > recovery: 163 MiB/s, 50 objects/s > > > > progress: > > Rebalancing after osd.12 marked out > > [=====.........................] > > > > As you can see, there is plenty of space and none of my OSD is in > > full or near full state: > > > > > +----+------+-------+-------+--------+---------+--------+---------+-----------+ > > | id | host | used | avail | wr ops | wr data | rd ops | rd data | > > state | > > > +----+------+-------+-------+--------+---------+--------+---------+-----------+ > > | 0 | s1 | 2415G | 1310G | 0 | 0 | 0 | 0 | > > exists,up | > > | 1 | s2 | 2009G | 1716G | 0 | 0 | 0 | 0 | > > exists,up | > > | 2 | s3 | 2183G | 1542G | 0 | 0 | 0 | 0 | > > exists,up | > > | 3 | s1 | 2680G | 1045G | 0 | 0 | 0 | 0 | > > exists,up | > > | 4 | s2 | 2063G | 1662G | 0 | 0 | 0 | 0 | > > exists,up | > > | 5 | s3 | 2269G | 1456G | 0 | 0 | 0 | 0 | > > exists,up | > > | 6 | s1 | 2523G | 1202G | 0 | 0 | 0 | 0 | > > exists,up | > > | 7 | s2 | 1973G | 1752G | 0 | 0 | 0 | 0 | > > exists,up | > > | 8 | s3 | 2007G | 1718G | 0 | 0 | 1 | 0 | > > exists,up | > > | 9 | s1 | 2485G | 1240G | 0 | 0 | 0 | 0 | > > exists,up | > > | 10 | s2 | 2385G | 1340G | 0 | 0 | 0 | 0 | > > exists,up | > > | 11 | s3 | 2079G | 1646G | 0 | 0 | 0 | 0 | > > exists,up | > > | 12 | s1 | 2272G | 1453G | 0 | 0 | 0 | 0 | > > exists,up | > > | 13 | s2 | 2381G | 1344G | 0 | 0 | 0 | 0 | > > exists,up | > > | 14 | s3 | 1923G | 1802G | 0 | 0 | 0 | 0 | > > exists,up | > > | 15 | s1 | 2617G | 1108G | 0 | 0 | 0 | 0 | > > exists,up | > > | 16 | s2 | 2099G | 1626G | 0 | 0 | 0 | 0 | > > exists,up | > > | 17 | s3 | 2336G | 1389G | 0 | 0 | 0 | 0 | > > exists,up | > > | 18 | s1 | 2435G | 1290G | 0 | 0 | 0 | 0 | > > exists,up | > > | 19 | s2 | 2198G | 1527G | 0 | 0 | 0 | 0 | > > exists,up | > > | 20 | s3 | 2159G | 1566G | 0 | 0 | 0 | 0 | > > exists,up | > > | 21 | s1 | 2128G | 1597G | 0 | 0 | 0 | 0 | > > exists,up | > > | 22 | s3 | 2064G | 1661G | 0 | 0 | 0 | 0 | > > exists,up | > > | 23 | s2 | 1943G | 1782G | 0 | 0 | 0 | 0 | > > exists,up | > > | 24 | s3 | 2168G | 1557G | 0 | 0 | 0 | 0 | > > exists,up | > > | 25 | s2 | 2113G | 1612G | 0 | 0 | 0 | 0 | > > exists,up | > > | 26 | s1 | 68.9G | 3657G | 0 | 0 | 0 | 0 | > > exists,up | > > > +----+------+-------+-------+--------+---------+--------+---------+-----------+ > > > > > > > > root@s1:~# ceph pg dump|egrep 'toofull|PG_STAT' > > PG_STAT OBJECTS MISSING_ON_PRIMARY DEGRADED MISPLACED UNFOUND BYTES > > OMAP_BYTES* OMAP_KEYS* LOG DISK_LOG STATE > > STATE_STAMP > > VERSION REPORTED UP UP_PRIMARY ACTING > > ACTING_PRIMARY LAST_SCRUB > > SCRUB_STAMP LAST_DEEP_SCRUB DEEP_SCRUB_STAMP > > SNAPTRIMQ_LEN > > 6.212 11110 0 0 22220 0 > > 38145321727 0 0 3023 3023 > > active+remapped+backfill_wait+backfill_toofull 2019-12-09 > > 11:11:39.093042 13598'212053 > > 13713:1179718 [6,19,24] 6 [13,0,24] 13 > > 13549'211985 2019-12-08 19:46:10.461113 > > 11644'211779 2019-12-06 07:37:42.864325 0 > > 6.bc 11057 0 0 22114 0 > > 37733931136 0 0 3032 3032 > > active+remapped+backfill_wait+backfill_toofull 2019-12-09 > > 10:42:25.534277 13549'212110 > > 13713:1229839 [15,25,17] 15 [19,18,17] 19 > > 13549'211983 2019-12-08 11:02:45.846031 > > 11644'211854 2019-12-06 06:22:43.565313 0 > > > > Any hints? I'm not worried because I think that the cluster will > > heal himself, but this is not > > clear and logic. > > > > -- > > *Simone Lazzaris* > > *Qcom S.p.A.* > > simone.lazza...@qcom.it[1] | www.qcom.it[2] > > * LinkedIn[3]* | *Facebook*[4] > > > > > > > > -------- > > [1] mailto:simone.lazza...@qcom.it > > [2] https://www.qcom.it > > [3] https://www.linkedin.com/company/qcom-spa > > [4] http://www.facebook.com/qcomspa > > > > _______________________________________________ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com