On Mon, Oct 13, 2025 at 4:18 AM kefu chai <[email protected]> wrote:

> hi Eugen,
>
> On Mon, Oct 13, 2025 at 2:03 AM Eugen Block <[email protected]> wrote:
>
> > Hi,
> >
> > just a couple of days ago someone had the same issue:
> >
> >
> >
> https://lists.ceph.io/hyperkitty/list/[email protected]/thread/4D5QQGOKJNUITFVTZERGJXC5K3WY6FM4/#LAJSJRMXSFXFP4LUDYF5BHB4Y3MAZDSQ
> >
> > Apparently, the cluster was upgraded while a pool deletion was in
> > progress. Is that the same case here? OP of the other thread patched
> >
>
> Actually, that was my initial suspicion as well. I specifically asked them
> about this possibility, but they confirmed that no pools were deleted
> during the upgrade. Additionally, they mentioned that the system was
> experiencing relatively low load at the time since the upgrade occurred
> over the weekend.
>
> However, the puzzling aspect is that several 'ghost' PGs appeared after the
> upgrade. These weren't created due to misplacement—they seemingly
> materialized out of nowhere. And some PGs disappeared. The only plausible
> explanation is a corrupted objectstore. This scares me.
>
>
> > his OSD code to skip the check, not sure how risky that is. But I'm
> > also not sure how to get out of this situation, one idea was to delete
> > the PGs from the affected OSDs, but that can be risky as well.
> >
> > Btw., skipping a major release is supported and has been for a long
> > time, so upgrading from O to Q is in general totally okay. But one
> > should only upgrade if the cluster is healthy (all PGs active+clean).
> >


By this story, it sounds like every single rocksdb instance in the cluster
got corrupted. And not just corrupted, but seemingly parts of them were
sent ages back in time?
1) The monitors didn’t peer, so they brought down all but one, and when it
had a rocksdb failure they rebuilt it from the OSDs? Why didn’t they just
use the other monitors? What was preventing them peering, anyway?
2) all the OSD rocksdb instances that failed to start.

I’d ask a lot of questions about the technology stack that is supporting
this — are they running Ceph on top of another storage technology that
might have done that? I’ve seen people running in VMware (Rook) have
somewhat similar issues when something goes wrong with the VMware
administration.
Some other questions that might point to something useful:
Are the referenced “deleted pool” PGs really present?
Do the running OSDs actually make sense from a human level, or does their
PG state look strange in a way that isn’t triggering crashes?
Are the other 4 monitors still available to turn on, and what do they say
about things? (If not, why not? The missing bits about how the OSDs were
crashing on the last ten upgrades, and how the monitors went wrong, is
pretty crucial to a story like this.)
Did you look to see if the centos upgrade could have done something weird
to the disk arrangement?
-Greg


>
> Thanks for pointing this out. I also realized that we do have a test for +2
> upgrade
> see https://github.com/ceph/ceph/tree/main/qa/suites/upgrade/reef-x.
>
>
> > Regards,
> > Eugen
> >
> > Zitat von kefu chai <[email protected]>:
> >
> > > Hello Ceph community,
> > >
> > > I'm writing on behalf of a friend who is experiencing a critical
> cluster
> > > issue after upgrading and would appreciate any assistance.
> > >
> > > Environment:
> > >
> > >    - 5 MON nodes, 2 MGR nodes, 40 OSD servers (306 OSDs total)
> > >    - OS: CentOS 8.2 upgraded to 8.4
> > >    - Ceph: 15.2.17 upgraded to 17.2.7
> > >    - Upgrade method: yum update in rolling batches
> > >
> > > Timeline: The upgrade started on October 8th at 1:00 PM. We upgraded
> > > MON/MGR servers first, and then upgraded OSD nodes in batches of 5
> nodes.
> > > The process appeared normal initially, but when approximately 10 OSD
> > > servers remained, OSDs began going down.
> > >
> > > MON Quorum Issue: When the OSDs began failing, the monitors failed to
> > form
> > > a quorum. In an attempt to recover, we stopped 4 out of 5 monitors.
> > > However, the remaining monitor (mbjson20010) then failed to start due
> to
> > a
> > > missing .ldb file. We eventually recovered this single monitor from OSD
> > > using the instructions at
> > >
> >
> https://docs.ceph.com/en/quincy/rados/troubleshooting/troubleshooting-mon/#mon-store-recovery-using-osds
> > ,
> > > so
> > > we now have only 1 MON in the cluster instead of the original 5.
> > >
> > > However, rebuilding the MON store did not help, and restarting the OSD
> > > servers also failed to resolve the issue. The cluster status remains
> > > problematic.
> > >
> > > Current Cluster Status:
> > >
> > >    - Only 1 MON daemon active (quorum: mbjson20010) - down from 5 MONs
> > >    - OSDs: 91 up / 229 in (out of 306 total)
> > >    - 88.872% of PGs are not active
> > >    - 4.779% of PGs are unknown
> > >    - 3,918 PGs down
> > >    - 1,311 PGs stale+down
> > >    - Only 12 PGs active+clean
> > >
> > > Critical Error: When examining OSD logs, we discovered that some OSDs
> are
> > > failing to start with the following error:
> > >
> > > osd.43 39677784 init missing pg_pool_t for deleted pool 9 for pg
> 9.3ds3;
> > > please downgrade to luminous and allow pg deletion to complete before
> > > upgrading
> > >
> > > Full error context from one of the failing OSDs:
> > >
> > > # tail  /var/log/ceph/ceph-osd.43.log
> > >
> > >     -7> 2025-10-12T13:40:05.987+0800 7fdd13259540  1
> > > bluestore(/var/lib/ceph/osd/ceph-43) _upgrade_super from 4, latest 4
> > >
> > >     -6> 2025-10-12T13:40:05.987+0800 7fdd13259540  1
> > > bluestore(/var/lib/ceph/osd/ceph-43) _upgrade_super done
> > >
> > >     -5> 2025-10-12T13:40:05.987+0800 7fdd13259540  2 osd.43 0 journal
> > looks
> > > like ssd
> > >
> > >     -4> 2025-10-12T13:40:05.987+0800 7fdd13259540  2 osd.43 0 boot
> > >
> > >     -3> 2025-10-12T13:40:05.987+0800 7fdceb2cc700  5
> > > bluestore.MempoolThread(0x55c7b0c66b40) _resize_shards cache_size:
> > > 8589934592 kv_alloc: 1717986918 kv_used: 91136 kv_onode_alloc:
> 343597383
> > > kv_onode_used: 23328 meta_alloc: 6871947673 meta_used: 2984
> data_alloc: 0
> > > data_used: 0
> > >
> > >     -2> 2025-10-12T13:40:05.989+0800 7fdd13259540 -1 osd.43 39677784
> init
> > > missing pg_pool_t for deleted pool 9 for pg 9.3ds3; please downgrade to
> > > luminous and allow pg deletion to complete before upgrading
> > >
> > >     -1> 2025-10-12T13:40:05.991+0800 7fdd13259540 -1
> > >
> >
> /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.2.7/rpm/el8/BUILD/ceph-17.2.7/src/osd/OSD.cc:
> > > In function 'int OSD::init()' thread 7fdd13259540 time
> > > 2025-10-12T13:40:05.990845+0800
> > >
> > >
> >
> /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.2.7/rpm/el8/BUILD/ceph-17.2.7/src/osd/OSD.cc:
> > > 3735: ceph_abort_msg("abort() called")
> > >
> > > # tail  /var/log/ceph/ceph-osd.51.log
> > >   -7> 2025-10-12T13:39:36.739+0800 7f603e5f7540  1
> > > bluestore(/var/lib/ceph/osd/ceph-51) _upgrade_super from 4, latest 4
> > >     -6> 2025-10-12T13:39:36.739+0800 7f603e5f7540  1
> > > bluestore(/var/lib/ceph/osd/ceph-51) _upgrade_super done
> > >     -5> 2025-10-12T13:39:36.739+0800 7f603e5f7540  2 osd.51 0 journal
> > looks
> > > like ssd
> > >     -4> 2025-10-12T13:39:36.739+0800 7f603e5f7540  2 osd.51 0 boot
> > >     -3> 2025-10-12T13:39:36.739+0800 7f6016669700  5
> > > bluestore.MempoolThread(0x55e839d4cb40) _resize_shards cache_size:
> > > 8589934592 kv_alloc: 1717986918 kv_used: 31232 kv_onode_alloc:
> 343597383
> > > kv_onode_used: 21584 meta_alloc: 6871947673 meta_used: 1168
> data_alloc: 0
> > > data_used: 0
> > >     -2> 2025-10-12T13:39:36.741+0800 7f603e5f7540 -1 osd.51 39677784
> init
> > > missing pg_pool_t for deleted pool 6 for pg 6.1f; please downgrade to
> > > luminous and allow pg deletion to complete before upgrading
> > >     -1> 2025-10-12T13:39:36.742+0800 7f603e5f7540 -1
> > >
> >
> /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.2.7/rpm/el8/BUILD/ceph-17.2.7/src/osd/OSD.cc:
> > > In function 'int OSD::init()' thread 7f603e5f7540 time
> > > 2025-10-12T13:39:36.742527+0800
> > >
> >
> /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.2.7/rpm/el8/BUILD/ceph-17.2.7/src/osd/OSD.cc:
> > > 3735: ceph_abort_msg("abort() called")
> > >
> > > Investigation Findings: We examined all OSD instances that failed to
> > start.
> > > All of them exhibit the same error pattern in their logs and all
> contain
> > PG
> > > references to non-existent pools. For example, running
> > > "ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-51 --op
> > list-pgs"
> > > shows PG references to pools that no longer exist (e.g., pool 9, pool
> 10,
> > > pool 4, pool 6, pool 8), while the current pools are numbered 101, 140,
> > > 141, 149, 212, 213, 216, 217, 218, 219. Notably, each affected OSD
> > contains
> > > only 2-3 PGs referencing these non-existent pools, which is
> significantly
> > > fewer than the hundreds of PGs a regular OSD typically contains. It
> > appears
> > > the OSD metadata has been corrupted or overwritten with stale
> references
> > to
> > > deleted pools from previous operations, preventing these OSDs from
> > starting
> > > and causing widespread PG state abnormalities across the cluster.
> > >
> > > 2 PGs referencing non-existent pools were found in osd.51:
> > >
> > > # ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-51 --op
> > list-pgs
> > > 1.0
> > > 6.1f
> > >
> > > # ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-51 --op list
> > > Error getting attr on : 1.0_head,#1:00000000::::head#, (61) No data
> > > available
> > > Error getting attr on : 6.1f_head,#6:f8000000::::head#, (61) No data
> > > available
> > >
> >
> ["1.0",{"oid":"","key":"","snapid":-2,"hash":0,"max":0,"pool":1,"namespace":"","max":0}]
> > >
> >
> ["1.0",{"oid":"main.db-journal.0000000000000000","key":"","snapid":-2,"hash":1969844440,"max":0,"pool":1,"namespace":"devicehealth","max":0}]
> > >
> >
> ["1.0",{"oid":"main.db.0000000000000000","key":"","snapid":-2,"hash":1315310604,"max":0,"pool":1,"namespace":"devicehealth","max":0}]
> > >
> >
> ["6.1f",{"oid":"","key":"","snapid":-2,"hash":31,"max":0,"pool":6,"namespace":"","max":0}]
> > >
> > > We also performed a comprehensive check by listing all PGs from all OSD
> > > nodes using "ceph-objectstore-tool --op list-pgs" and comparing the
> > results
> > > with the output of "ceph pg dump". This comparison revealed that quite
> a
> > > few PGs are missing from the OSD listings. We suspect that some OSDs
> that
> > > previously held these missing PGs may now be corrupted, which would
> > explain
> > > both the missing PGs and the widespread cluster degradation. It appears
> > the
> > > OSD metadata has been corrupted or overwritten with stale references to
> > > deleted pools from previous operations, preventing these OSDs from
> > starting
> > > and causing widespread PG state abnormalities across the cluster.
> > >
> > > It appears the OSD objectstore's metadata has been corrupted or
> > overwritten
> > > with stale references to deleted pools from previous operations,
> > preventing
> > > these OSDs from starting and causing widespread PG state abnormalities
> > > across the cluster.
> > >
> > > Questions:
> > >
> > >    1. How can we safely restore the missing PGs from the OSD without
> data
> > >    loss?
> > >    2. Has anyone encountered similar issues when upgrading from Octopus
> > >    (15.2.x) to Quincy (17.2.x)?
> > >
> > > We understand that skipping major versions may not be officially
> > supported,
> > > but we urgently need guidance on the safest recovery path at this
> point.
> > >
> > > Any help would be greatly appreciated. Thank you in advance.
> > >
> > > --
> > > Regards
> > > Kefu Chai
> > > _______________________________________________
> > > ceph-users mailing list -- [email protected]
> > > To unsubscribe send an email to [email protected]
> >
> >
> > _______________________________________________
> > ceph-users mailing list -- [email protected]
> > To unsubscribe send an email to [email protected]
> >
>
>
> --
> Regards
> Kefu Chai
> _______________________________________________
> ceph-users mailing list -- [email protected]
> To unsubscribe send an email to [email protected]
>
_______________________________________________
ceph-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]

Reply via email to