[ceph-users] Octopus 15.2.8 slow ops causing inactive PGs upon disk replacement

Justin Goetz Wed, 23 Jun 2021 07:29:29 -0700

Hello!

We are in the process of expanding our CEPH cluster (by both adding OSDhosts and replacing smaller-sized HDDs on our existing hosts). So far wehave gone host by host, removing the old OSDs, swapping the physicalHDDs, and re-adding them. This process has gone smooth, aside from oneissue: upon any action taken on the cluster (adding new OSDs, replacingold, etc), we have PGs get stuck "activating"which causes around 3.5% ofPGs go inactive, causing IO to stop.


Here is a current look at our ceph -s command:

cluster:
    id:     e8ffe2eb-f8fc-4110-a4bc-1715e878fb7b
    health: HEALTH_WARN
            Reduced data availability: 166 pgs inactive

Degraded data redundancy: 137153907/3658405707 objectsdegraded (3.749%), 930 pgs degraded, 928 pgs undersized

            10 pgs not deep-scrubbed in time

33709 slow ops, oldest one blocked for 35956 sec, daemons[osd.103,osd.104,osd.105,osd.106,osd.107,osd.109,osd.111,osd.112,osd.113,osd.114]...have slow ops.


  services:
    mon: 3 daemons, quorum lb3,lb2,lb1 (age 8w)
    mgr: lb1(active, since 6w), standbys: lb3, lb2

osd: 117 osds: 117 up (since 15m), 117 in (since 10h); 2033remapped pgs

    rgw: 3 daemons active (lb1.rgw0, lb2.rgw0, lb3.rgw0)

  task status:

  data:
    pools:   8 pools, 5793 pgs
    objects: 609.74M objects, 169 TiB
    usage:   308 TiB used, 430 TiB / 738 TiB avail
    pgs:     2.866% pgs not active
             137153907/3658405707 objects degraded (3.749%)
             262215404/3658405707 objects misplaced (7.167%)
             3754 active+clean
             963  active+remapped+backfill_wait
             892 active+undersized+degraded+remapped+backfill_wait
             136  activating+remapped
             27   activating+undersized+degraded+remapped
             8 active+undersized+degraded+remapped+backfilling
             6    active+clean+scrubbing+deep
             3    activating+degraded+remapped
             3    active+remapped+backfilling
             1    active+undersized+remapped+backfill_wait

  io:
    client:   94 KiB/s rd, 94 op/s rd, 0 op/s wr
    recovery: 112 MiB/s, 372 objects/s

  progress:
    Rebalancing after osd.20 marked in (10h)
      [............................] (remaining: 11d)
    Rebalancing after osd.41 marked in (10h)
      [=...........................] (remaining: 8d)
    Rebalancing after osd.30 marked in (10h)
      [=...........................] (remaining: 9d)
    Rebalancing after osd.1 marked in (10h)
      [=======.....................] (remaining: 2h)
    Rebalancing after osd.10 marked in (10h)
      [............................] (remaining: 12d)
    Rebalancing after osd.50 marked in (10h)
      [............................] (remaining: 2w)
    Rebalancing after osd.71 marked out (10h)
      [==..........................] (remaining: 5d)

What you may find interesting is the "slow ops" warnings. This is whereour inactive PGs become stuck. Once the cluster gets into this state,I'm able to recover IO usually by restarting the OSDs with slow ops.However, what's extremely strange, is this workaround only works afterabout 12 hours since the last OSD addition. Restarting the slow ops OSDsbefore roughly 12 hours results in the slow ops returning immediately.

Our first thought was hardware issues, however we ruled this out afterthe slow ops warnings appeared on brand new HDDs and OSD hosts.Monitoring the IO saturation of the OSDs reporting slow ops shows actualusage nowhere near saturation, and no hardware issues are present on thedrives themselves.

Looking at the journalctl logs of one of the affected OSDs above, we seethe following repeated multiple times:

osd.103 56934 get_health_metrics reporting 2 slow ops, oldest isosd_op(client.467952.0:1520304537 8.6fbs0 8.1e6826fb (undecoded)ondisk+retry+write+known_if_redirected e56923


So far my procedure for the disk swaps have been as follows:

1. Set noout,norebalance, and norecover on the cluster.
2. Use ceph-ansible to remove the old disk OSD IDs
3. Swap physical HDDs, re-add with ceph-ansible
4. Unset noout,norebalance,norecover

I should note this issue appears even with simple OSD additions (notremovals), as we added 2 brand new hosts to the cluster and saw the sameissue.

I've been trying to think of any possible cause of this issue, I shouldmention our cluster is messy at the moment hardware-wise (we have a mixof 7T HDDs, 4T HDDs, and 10T HDDs - moving to all 10T HDDs but theprocess to swap has been taking a while). One warning I've noticedduring the old disk removals is a warning about too many PGs per OSD,however this warning clears once the new OSDs are added, which is to beexpected I assume.

If anyone would be willing to provide any hints of where to look, itwould be much appreciated!


Thanks for your time.
--

Justin Goetz
Systems Engineer, TeraSwitch Inc.
jgo...@teraswitch.com
412-945-7045 (NOC) | 412-459-7945 (Direct)

_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Octopus 15.2.8 slow ops causing inactive PGs upon disk replacement

Reply via email to