[ceph-users] Re: Ceph does not recover from OSD restart

Frank Schilder Mon, 03 Aug 2020 10:21:06 -0700

After moving the newly added OSDs out of the crush tree and back in again, I 
get to exactly what I want to see:


  cluster:
    id:     e4ece518-f2cb-4708-b00f-b6bf511e91d9
    health: HEALTH_WARN
            norebalance,norecover flag(s) set
            53030026/1492404361 objects misplaced (3.553%)
            1 pools nearfull

  services:
    mon: 3 daemons, quorum ceph-01,ceph-02,ceph-03
    mgr: ceph-01(active), standbys: ceph-03, ceph-02
    mds: con-fs2-1/1/1 up  {0=ceph-08=up:active}, 1 up:standby-replay
    osd: 297 osds: 272 up, 272 in; 307 remapped pgs
         flags norebalance,norecover

  data:
    pools:   11 pools, 3215 pgs
    objects: 177.3 M objects, 489 TiB
    usage:   696 TiB used, 1.2 PiB / 1.9 PiB avail
    pgs:     53030026/1492404361 objects misplaced (3.553%)
             2902 active+clean
             299  active+remapped+backfill_wait
             8    active+remapped+backfilling
             5    active+clean+scrubbing+deep
             1    active+clean+snaptrim

  io:
    client:   69 MiB/s rd, 117 MiB/s wr, 399 op/s rd, 856 op/s wr

Why does a cluster with remapped PGs not survive OSD restarts without loosing 
track of objects?
Why is it not finding the objects by itself?

A power outage of 3 hosts will halt everything for no reason until manual 
intervention. How can I avoid this problem?

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Frank Schilder <fr...@dtu.dk>
Sent: 03 August 2020 15:03:05
To: ceph-users
Subject: [ceph-users] Ceph does not recover from OSD restart

Dear cephers,

I have a serious issue with degraded objects after an OSD restart. The cluster 
was in a state of re-balancing after adding disks to each host. Before restart 
I had "X/Y objects misplaced". Apart from that, health was OK. I now restarted 
all OSDs of one host and the cluster does not recover from that:

  cluster:
    id:     xxx
    health: HEALTH_ERR
            45813194/1492348700 objects misplaced (3.070%)
            Degraded data redundancy: 6798138/1492348700 objects degraded 
(0.456%), 85 pgs degraded, 86 pgs undersized
            Degraded data redundancy (low space): 17 pgs backfill_toofull
            1 pools nearfull

  services:
    mon: 3 daemons, quorum ceph-01,ceph-02,ceph-03
    mgr: ceph-01(active), standbys: ceph-03, ceph-02
    mds: con-fs2-1/1/1 up  {0=ceph-08=up:active}, 1 up:standby-replay
    osd: 297 osds: 272 up, 272 in; 307 remapped pgs

  data:
    pools:   11 pools, 3215 pgs
    objects: 177.3 M objects, 489 TiB
    usage:   696 TiB used, 1.2 PiB / 1.9 PiB avail
    pgs:     6798138/1492348700 objects degraded (0.456%)
             45813194/1492348700 objects misplaced (3.070%)
             2903 active+clean
             209  active+remapped+backfill_wait
             73   active+undersized+degraded+remapped+backfill_wait
             9    active+remapped+backfill_wait+backfill_toofull
             8    
active+undersized+degraded+remapped+backfill_wait+backfill_toofull
             4    active+undersized+degraded+remapped+backfilling
             3    active+remapped+backfilling
             3    active+clean+scrubbing+deep
             1    active+clean+scrubbing
             1    active+undersized+remapped+backfilling
             1    active+clean+snaptrim

  io:
    client:   47 MiB/s rd, 61 MiB/s wr, 732 op/s rd, 792 op/s wr
    recovery: 195 MiB/s, 48 objects/s

After restarting there should only be a small number of degraded objects, the 
ones that received writes during OSD restart. What I see, however, is that the 
cluster seems to have lost track of a huge amount of objects, the 0.456% 
degraded are 1-2 days worth of I/O. I did reboots before and saw only a few 
thousand objects degraded at most. The output of ceph health detail shows a lot 
of lines like these:

[root@gnosis ~]# ceph health detail
HEALTH_ERR 45804316/1492356704 objects misplaced (3.069%); Degraded data 
redundancy: 6792562/1492356704 objects degraded (0.455%), 85 pgs degraded, 86 
pgs undersized; Degraded data redundancy (low space): 17 pgs backfill_toofull; 
1 pools nearfull
OBJECT_MISPLACED 45804316/1492356704 objects misplaced (3.069%)
PG_DEGRADED Degraded data redundancy: 6792562/1492356704 objects degraded 
(0.455%), 85 pgs degraded, 86 pgs undersized
    pg 11.9 is stuck undersized for 815.188981, current state 
active+undersized+degraded+remapped+backfill_wait, last acting 
[60,148,2147483647,263,76,230,87,169]
8...9
    pg 11.48 is active+undersized+degraded+remapped+backfill_wait, acting 
[159,60,180,263,237,3,2147483647,72]
    pg 11.4a is stuck undersized for 851.162862, current state 
active+undersized+degraded+remapped+backfill_wait, last acting 
[182,233,87,228,2,180,63,2147483647]
[...]
    pg 11.22e is stuck undersized for 851.162402, current state 
active+undersized+degraded+remapped+backfill_wait+backfill_toofull, last acting 
[234,183,239,2147483647,170,229,1,86]
PG_DEGRADED_FULL Degraded data redundancy (low space): 17 pgs backfill_toofull
    pg 11.24 is 
active+undersized+degraded+remapped+backfill_wait+backfill_toofull, acting 
[230,259,2147483647,1,144,159,233,146]
[...]
    pg 11.1d9 is active+remapped+backfill_wait+backfill_toofull, acting 
[84,259,183,170,85,234,233,2]
    pg 11.225 is 
active+undersized+degraded+remapped+backfill_wait+backfill_toofull, acting 
[236,183,1,2147483647,2147483647,169,229,230]
    pg 11.22e is 
active+undersized+degraded+remapped+backfill_wait+backfill_toofull, acting 
[234,183,239,2147483647,170,229,1,86]
POOL_NEAR_FULL 1 pools nearfull
    pool 'sr-rbd-data-one-hdd' has 164 TiB (max 200 TiB)

It looks like a lot of PGs are not receiving theire complete crush map 
placement, as if the peering is incomplete. This is a serious issue, it looks 
like the cluster will see a total storage loss if just 2 more hosts reboot - 
without actually having lost any storage. The pool in question is a 6+2 EC pool.

What is going on here? Why are the PG-maps not restored to their values from 
before the OSD reboot? The degraded PGs should receive the missing OSD IDs, 
everything is up exactly as it was before the reboot.

Thanks for your help and best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Ceph does not recover from OSD restart

Reply via email to