Re: [ceph-users] Cluster in ERR status when rebalancing

Paul Emmerich Mon, 09 Dec 2019 04:34:18 -0800

This is a (harmless) bug that existed since Mimic and will be fixed in
14.2.5 (I think?). The health error will clear up without any intervention.



Paul

-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90


On Mon, Dec 9, 2019 at 12:03 PM Eugen Block <ebl...@nde.ag> wrote:

> Hi,
>
> since we upgraded our cluster to Nautilus we also see those messages
> sometimes when it's rebalancing. There are several reports about this
> [1] [2], we didn't see it in Luminous. But eventually the rebalancing
> finished and the error message cleared, so I'd say there's (probably)
> nothing to worry about if there aren't any other issues.
>
> Regards,
> Eugen
>
>
> [1] https://tracker.ceph.com/issues/39555
> [2] https://tracker.ceph.com/issues/41255
>
>
> Zitat von Simone Lazzaris <simone.lazza...@qcom.it>:
>
> > Hi all;
> > Long story short, I have a cluster of 26 OSD in 3 nodes (8+9+9). One
> > of the disk is showing
> > some read error, so I''ve added an OSD in the faulty node (OSD.26)
> > and set the (re)weight of
> > the faulty OSD (OSD.12) to zero.
> >
> > The cluster is now rebalancing, which is fine, but I have now 2 PG
> > in "backfill_toofull" state, so
> > the cluster health is "ERR":
> >
> >   cluster:
> >     id:     9ec27b0f-acfd-40a3-b35d-db301ac5ce8c
> >     health: HEALTH_ERR
> >             Degraded data redundancy (low space): 2 pgs backfill_toofull
> >
> >   services:
> >     mon: 3 daemons, quorum s1,s2,s3 (age 7d)
> >     mgr: s1(active, since 7d), standbys: s2, s3
> >     osd: 27 osds: 27 up (since 2h), 26 in (since 2h); 262 remapped pgs
> >     rgw: 3 daemons active (s1, s2, s3)
> >
> >   data:
> >     pools:   10 pools, 1200 pgs
> >     objects: 11.72M objects, 37 TiB
> >     usage:   57 TiB used, 42 TiB / 98 TiB avail
> >     pgs:     2618510/35167194 objects misplaced (7.446%)
> >              938 active+clean
> >              216 active+remapped+backfill_wait
> >              44  active+remapped+backfilling
> >              2   active+remapped+backfill_wait+backfill_toofull
> >
> >   io:
> >     recovery: 163 MiB/s, 50 objects/s
> >
> >   progress:
> >     Rebalancing after osd.12 marked out
> >       [=====.........................]
> >
> > As you can see, there is plenty of space and none of my OSD  is in
> > full or near full state:
> >
> >
> +----+------+-------+-------+--------+---------+--------+---------+-----------+
> > | id | host |  used | avail | wr ops | wr data | rd ops | rd data |
> >  state   |
> >
> +----+------+-------+-------+--------+---------+--------+---------+-----------+
> > | 0  |  s1  | 2415G | 1310G |    0   |     0   |    0   |     0   |
> > exists,up |
> > | 1  |  s2  | 2009G | 1716G |    0   |     0   |    0   |     0   |
> > exists,up |
> > | 2  |  s3  | 2183G | 1542G |    0   |     0   |    0   |     0   |
> > exists,up |
> > | 3  |  s1  | 2680G | 1045G |    0   |     0   |    0   |     0   |
> > exists,up |
> > | 4  |  s2  | 2063G | 1662G |    0   |     0   |    0   |     0   |
> > exists,up |
> > | 5  |  s3  | 2269G | 1456G |    0   |     0   |    0   |     0   |
> > exists,up |
> > | 6  |  s1  | 2523G | 1202G |    0   |     0   |    0   |     0   |
> > exists,up |
> > | 7  |  s2  | 1973G | 1752G |    0   |     0   |    0   |     0   |
> > exists,up |
> > | 8  |  s3  | 2007G | 1718G |    0   |     0   |    1   |     0   |
> > exists,up |
> > | 9  |  s1  | 2485G | 1240G |    0   |     0   |    0   |     0   |
> > exists,up |
> > | 10 |  s2  | 2385G | 1340G |    0   |     0   |    0   |     0   |
> > exists,up |
> > | 11 |  s3  | 2079G | 1646G |    0   |     0   |    0   |     0   |
> > exists,up |
> > | 12 |  s1  | 2272G | 1453G |    0   |     0   |    0   |     0   |
> > exists,up |
> > | 13 |  s2  | 2381G | 1344G |    0   |     0   |    0   |     0   |
> > exists,up |
> > | 14 |  s3  | 1923G | 1802G |    0   |     0   |    0   |     0   |
> > exists,up |
> > | 15 |  s1  | 2617G | 1108G |    0   |     0   |    0   |     0   |
> > exists,up |
> > | 16 |  s2  | 2099G | 1626G |    0   |     0   |    0   |     0   |
> > exists,up |
> > | 17 |  s3  | 2336G | 1389G |    0   |     0   |    0   |     0   |
> > exists,up |
> > | 18 |  s1  | 2435G | 1290G |    0   |     0   |    0   |     0   |
> > exists,up |
> > | 19 |  s2  | 2198G | 1527G |    0   |     0   |    0   |     0   |
> > exists,up |
> > | 20 |  s3  | 2159G | 1566G |    0   |     0   |    0   |     0   |
> > exists,up |
> > | 21 |  s1  | 2128G | 1597G |    0   |     0   |    0   |     0   |
> > exists,up |
> > | 22 |  s3  | 2064G | 1661G |    0   |     0   |    0   |     0   |
> > exists,up |
> > | 23 |  s2  | 1943G | 1782G |    0   |     0   |    0   |     0   |
> > exists,up |
> > | 24 |  s3  | 2168G | 1557G |    0   |     0   |    0   |     0   |
> > exists,up |
> > | 25 |  s2  | 2113G | 1612G |    0   |     0   |    0   |     0   |
> > exists,up |
> > | 26 |  s1  | 68.9G | 3657G |    0   |     0   |    0   |     0   |
> > exists,up |
> >
> +----+------+-------+-------+--------+---------+--------+---------+-----------+
> >
> >
> >
> > root@s1:~# ceph pg dump|egrep 'toofull|PG_STAT'
> > PG_STAT OBJECTS MISSING_ON_PRIMARY DEGRADED MISPLACED UNFOUND BYTES
> > OMAP_BYTES* OMAP_KEYS* LOG  DISK_LOG STATE
> >                STATE_STAMP
> > VERSION       REPORTED       UP         UP_PRIMARY ACTING
> > ACTING_PRIMARY LAST_SCRUB
> > SCRUB_STAMP                LAST_DEEP_SCRUB DEEP_SCRUB_STAMP
> >  SNAPTRIMQ_LEN
> > 6.212     11110                  0        0     22220       0
> > 38145321727           0          0 3023     3023
> > active+remapped+backfill_wait+backfill_toofull 2019-12-09
> > 11:11:39.093042  13598'212053
> > 13713:1179718  [6,19,24]          6  [13,0,24]             13
> > 13549'211985 2019-12-08 19:46:10.461113
> > 11644'211779 2019-12-06 07:37:42.864325             0
> > 6.bc      11057                  0        0     22114       0
> > 37733931136           0          0 3032     3032
> > active+remapped+backfill_wait+backfill_toofull 2019-12-09
> > 10:42:25.534277  13549'212110
> > 13713:1229839 [15,25,17]         15 [19,18,17]             19
> > 13549'211983 2019-12-08 11:02:45.846031
> > 11644'211854 2019-12-06 06:22:43.565313             0
> >
> > Any hints? I'm not worried because I think that the cluster will
> > heal himself, but this is not
> > clear and logic.
> >
> > --
> > *Simone Lazzaris*
> > *Qcom S.p.A.*
> > simone.lazza...@qcom.it[1] | www.qcom.it[2]
> > * LinkedIn[3]* | *Facebook*[4]
> >
> >
> >
> > --------
> > [1] mailto:simone.lazza...@qcom.it
> > [2] https://www.qcom.it
> > [3] https://www.linkedin.com/company/qcom-spa
> > [4] http://www.facebook.com/qcomspa
>
>
>
> _______________________________________________
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Cluster in ERR status when rebalancing

Reply via email to