I'm running a cepf fs with an 8+2 EC data pool. Disks are on 10 hosts and failure domain is host. Version is mimic 13.2.2. Today I added a few OSDs to one of the hosts and observed that a lot of PGs became inactive even though 9 out of 10 hosts were up all the time. After getting the 10th host and all disks up, I still ended up with a large amount of undersized PGs and degraded objects, which I don't understand as no OSD was removed.
Here some details about the steps taken on the host with new disks, main questions at the end: - shut down OSDs (systemctl stop docker) - reboot host (this is necessary due to OS deployment via warewulf) Devices got renamed and not all disks came back up (4 OSDs remained down). This is expected, I need to re-deploy the containers to adjust for device name changes. Around this point PGs started peering and some failed waiting for 1 of the down OSDs. I don't understand why they didn't just remain active with 9 out of 10 disks. Until this moment of some OSDs coming up, all PGs were active. With min_size=9 I would expect all PGs to remain active with no changes to 9 out of the 10 hosts. - redeploy docker containers - all disks/OSDs come up, including the 4 OSDs from above - inactive PGs complete peering and become active - now I have a los of degraded Objects and undersized PGs even though not a single OSD was removed I don't understand why I have degraded objects. I should just have misplaced objects: HEALTH_ERR 22995992/145698909 objects misplaced (15.783%) Degraded data redundancy: 5213734/145698909 objects degraded (3.578%), 208 pgs degraded, 208 pgs undersized Degraded data redundancy (low space): 169 pgs backfill_toofull Note: The backfill_toofull with low utilization (usage: 38 TiB used, 1.5 PiB / 1.5 PiB avail) is a known issue in ceph (https://tracker.ceph.com/issues/39555) Also, I should be able to do whatever with 1 out of 10 hosts without loosing data access. What could be the problem here? Questions summary: Why does peering not succeed to keep all PGs active with 9 out of 10 OSDs up and in? Why do undersized PGs arise even though all OSDs are up? Why do degraded objects arise even though no OSD was removed? Thanks! ================= Frank Schilder AIT Risø Campus Bygning 109, rum S14 _______________________________________________ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io