[ceph-users] Cluster in ERR status when rebalancing

Simone Lazzaris Mon, 09 Dec 2019 02:38:45 -0800

Hi all;
Long story short, I have a cluster of 26 OSD in 3 nodes (8+9+9). One of the 
disk is showing 
some read error, so I''ve added an OSD in the faulty node (OSD.26) and set the 
(re)weight of 
the faulty OSD (OSD.12) to zero.


The cluster is now rebalancing, which is fine, but I have now 2 PG in 
"backfill_toofull" state, so 
the cluster health is "ERR":

  cluster:
    id:     9ec27b0f-acfd-40a3-b35d-db301ac5ce8c
    health: HEALTH_ERR
            Degraded data redundancy (low space): 2 pgs backfill_toofull
 
  services:
    mon: 3 daemons, quorum s1,s2,s3 (age 7d)
    mgr: s1(active, since 7d), standbys: s2, s3
    osd: 27 osds: 27 up (since 2h), 26 in (since 2h); 262 remapped pgs
    rgw: 3 daemons active (s1, s2, s3)
 
  data:
    pools:   10 pools, 1200 pgs
    objects: 11.72M objects, 37 TiB
    usage:   57 TiB used, 42 TiB / 98 TiB avail
    pgs:     2618510/35167194 objects misplaced (7.446%)
             938 active+clean
             216 active+remapped+backfill_wait
             44  active+remapped+backfilling
             2   active+remapped+backfill_wait+backfill_toofull
 
  io:
    recovery: 163 MiB/s, 50 objects/s
 
  progress:
    Rebalancing after osd.12 marked out
      [=====.........................]
 
As you can see, there is plenty of space and none of my OSD  is in full or near 
full state:

+----+------+-------+-------+--------+---------+--------+---------+-----------+
| id | host |  used | avail | wr ops | wr data | rd ops | rd data |   state   |
+----+------+-------+-------+--------+---------+--------+---------+-----------+
| 0  |  s1  | 2415G | 1310G |    0   |     0   |    0   |     0   | exists,up |
| 1  |  s2  | 2009G | 1716G |    0   |     0   |    0   |     0   | exists,up |
| 2  |  s3  | 2183G | 1542G |    0   |     0   |    0   |     0   | exists,up |
| 3  |  s1  | 2680G | 1045G |    0   |     0   |    0   |     0   | exists,up |
| 4  |  s2  | 2063G | 1662G |    0   |     0   |    0   |     0   | exists,up |
| 5  |  s3  | 2269G | 1456G |    0   |     0   |    0   |     0   | exists,up |
| 6  |  s1  | 2523G | 1202G |    0   |     0   |    0   |     0   | exists,up |
| 7  |  s2  | 1973G | 1752G |    0   |     0   |    0   |     0   | exists,up |
| 8  |  s3  | 2007G | 1718G |    0   |     0   |    1   |     0   | exists,up |
| 9  |  s1  | 2485G | 1240G |    0   |     0   |    0   |     0   | exists,up |
| 10 |  s2  | 2385G | 1340G |    0   |     0   |    0   |     0   | exists,up |
| 11 |  s3  | 2079G | 1646G |    0   |     0   |    0   |     0   | exists,up |
| 12 |  s1  | 2272G | 1453G |    0   |     0   |    0   |     0   | exists,up |
| 13 |  s2  | 2381G | 1344G |    0   |     0   |    0   |     0   | exists,up |
| 14 |  s3  | 1923G | 1802G |    0   |     0   |    0   |     0   | exists,up |
| 15 |  s1  | 2617G | 1108G |    0   |     0   |    0   |     0   | exists,up |
| 16 |  s2  | 2099G | 1626G |    0   |     0   |    0   |     0   | exists,up |
| 17 |  s3  | 2336G | 1389G |    0   |     0   |    0   |     0   | exists,up |
| 18 |  s1  | 2435G | 1290G |    0   |     0   |    0   |     0   | exists,up |
| 19 |  s2  | 2198G | 1527G |    0   |     0   |    0   |     0   | exists,up |
| 20 |  s3  | 2159G | 1566G |    0   |     0   |    0   |     0   | exists,up |
| 21 |  s1  | 2128G | 1597G |    0   |     0   |    0   |     0   | exists,up |
| 22 |  s3  | 2064G | 1661G |    0   |     0   |    0   |     0   | exists,up |
| 23 |  s2  | 1943G | 1782G |    0   |     0   |    0   |     0   | exists,up |
| 24 |  s3  | 2168G | 1557G |    0   |     0   |    0   |     0   | exists,up |
| 25 |  s2  | 2113G | 1612G |    0   |     0   |    0   |     0   | exists,up |
| 26 |  s1  | 68.9G | 3657G |    0   |     0   |    0   |     0   | exists,up |
+----+------+-------+-------+--------+---------+--------+---------+-----------+



root@s1:~# ceph pg dump|egrep 'toofull|PG_STAT'
PG_STAT OBJECTS MISSING_ON_PRIMARY DEGRADED MISPLACED UNFOUND BYTES       
OMAP_BYTES* OMAP_KEYS* LOG  DISK_LOG STATE                                      
    STATE_STAMP                
VERSION       REPORTED       UP         UP_PRIMARY ACTING     ACTING_PRIMARY 
LAST_SCRUB    
SCRUB_STAMP                LAST_DEEP_SCRUB DEEP_SCRUB_STAMP           
SNAPTRIMQ_LEN 
6.212     11110                  0        0     22220       0 38145321727       
    0          0 3023     3023 
active+remapped+backfill_wait+backfill_toofull 2019-12-09 11:11:39.093042  
13598'212053  
13713:1179718  [6,19,24]          6  [13,0,24]             13  13549'211985 
2019-12-08 19:46:10.461113    
11644'211779 2019-12-06 07:37:42.864325             0 
6.bc      11057                  0        0     22114       0 37733931136       
    0          0 3032     3032 
active+remapped+backfill_wait+backfill_toofull 2019-12-09 10:42:25.534277  
13549'212110  
13713:1229839 [15,25,17]         15 [19,18,17]             19  13549'211983 
2019-12-08 11:02:45.846031    
11644'211854 2019-12-06 06:22:43.565313             0 

Any hints? I'm not worried because I think that the cluster will heal himself, 
but this is not 
clear and logic.

-- 
*Simone Lazzaris*
*Qcom S.p.A.*
simone.lazza...@qcom.it[1] | www.qcom.it[2]
* LinkedIn[3]* | *Facebook*[4]



--------
[1] mailto:simone.lazza...@qcom.it
[2] https://www.qcom.it
[3] https://www.linkedin.com/company/qcom-spa
[4] http://www.facebook.com/qcomspa

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] Cluster in ERR status when rebalancing

Reply via email to