Thank you Greg! Reply inlined.

On Mon, Oct 13, 2025 at 11:49 PM Gregory Farnum <[email protected]> wrote:

> On Mon, Oct 13, 2025 at 4:18 AM kefu chai <[email protected]> wrote:
>
>> hi Eugen,
>>
>> On Mon, Oct 13, 2025 at 2:03 AM Eugen Block <[email protected]> wrote:
>>
>> > Hi,
>> >
>> > just a couple of days ago someone had the same issue:
>> >
>> >
>> >
>> https://lists.ceph.io/hyperkitty/list/[email protected]/thread/4D5QQGOKJNUITFVTZERGJXC5K3WY6FM4/#LAJSJRMXSFXFP4LUDYF5BHB4Y3MAZDSQ
>> >
>> > Apparently, the cluster was upgraded while a pool deletion was in
>> > progress. Is that the same case here? OP of the other thread patched
>> >
>>
>> Actually, that was my initial suspicion as well. I specifically asked them
>> about this possibility, but they confirmed that no pools were deleted
>> during the upgrade. Additionally, they mentioned that the system was
>> experiencing relatively low load at the time since the upgrade occurred
>> over the weekend.
>>
>> However, the puzzling aspect is that several 'ghost' PGs appeared after
>> the
>> upgrade. These weren't created due to misplacement—they seemingly
>> materialized out of nowhere. And some PGs disappeared. The only plausible
>> explanation is a corrupted objectstore. This scares me.
>>
>>
>> > his OSD code to skip the check, not sure how risky that is. But I'm
>> > also not sure how to get out of this situation, one idea was to delete
>> > the PGs from the affected OSDs, but that can be risky as well.
>> >
>> > Btw., skipping a major release is supported and has been for a long
>> > time, so upgrading from O to Q is in general totally okay. But one
>> > should only upgrade if the cluster is healthy (all PGs active+clean).
>> >
>
>
> By this story, it sounds like every single rocksdb instance in the cluster
> got corrupted. And not just corrupted, but seemingly parts of them were
> sent ages back in time?
> 1) The monitors didn’t peer, so they brought down all but one, and when it
> had a rocksdb failure they rebuilt it from the OSDs? Why didn’t they just
> use the other monitors? What was preventing them peering, anyway?
> 2) all the OSD rocksdb instances that failed to start.
>
> I’d ask a lot of questions about the technology stack that is supporting
> this — are they running Ceph on top of another storage technology that
> might have done that? I’ve seen people running in VMware (Rook) have
> somewhat similar issues when something goes wrong with the VMware
> administration.
> Some other questions that might point to something useful:
> Are the referenced “deleted pool” PGs really present?
>

No, they don't exist. The cluster currently has only 10 pools (listed
below), and none of them appear to be deleted or missing:

*#ceph osd pool ls detail*

pool 101 'spice1' replicated size 4 min_size 2 crush_rule 3 object_hash
rjenkins pg_num 64 pgp_num 64 autoscale_mode warn last_change 39679863 lfor
0/962169/37086416 flags hashpspool,nodelete,selfmanaged_snaps stripe_width
0 application rbd

pool 140 'spice2' replicated size 4 min_size 2 crush_rule 2 object_hash
rjenkins pg_num 64 pgp_num 64 autoscale_mode warn last_change 39679863 lfor
0/962209/37086425 flags hashpspool,nodelete,selfmanaged_snaps stripe_width
0 application rbd

pool 141 'spice2_ec42' erasure profile ec42_8k size 6 min_size 4 crush_rule
9 object_hash rjenkins pg_num 1024 pgp_num 1024 autoscale_mode warn
last_change 39679863 lfor 0/0/3102042 flags
hashpspool,ec_overwrites,nodelete,selfmanaged_snaps stripe_width 32768
fast_read 1 application rbd

pool 149 'spice1_ec42' erasure profile ec42_8k size 6 min_size 4 crush_rule
9 object_hash rjenkins pg_num 512 pgp_num 512 autoscale_mode warn
last_change 39679863 flags
hashpspool,ec_overwrites,nodelete,selfmanaged_snaps stripe_width 32768
fast_read 1 application rbd

pool 212 'cephfs_data' erasure profile ec42_8k size 6 min_size 4 crush_rule
9 object_hash rjenkins pg_num 64 pgp_num 64 autoscale_mode warn last_change
38744544 lfor 0/25551924/25551921 flags hashpspool,ec_overwrites max_bytes
28587302322176 stripe_width 32768 application cephfs

pool 213 'cephfs_metadata' replicated size 3 min_size 1 crush_rule 2
object_hash rjenkins pg_num 64 pgp_num 64 autoscale_mode warn last_change
38744544 lfor 0/962515/962513 flags hashpspool stripe_width 0
pg_autoscale_bias 4 pg_num_min 16 recovery_priority 5 application cephfs

pool 216 'spice3' replicated size 4 min_size 2 crush_rule 3 object_hash
rjenkins pg_num 32 pgp_num 32 autoscale_mode warn last_change 39679863 lfor
958149/961657/10634461 flags
hashpspool,incomplete_clones,nodelete,selfmanaged_snaps stripe_width 0
application rbd

pool 217 'spice3_ec42' erasure profile ec42 size 6 min_size 4 crush_rule 9
object_hash rjenkins pg_num 4096 pgp_num 4096 autoscale_mode warn
last_change 39679863 lfor 958149/32139909/37087589 flags
hashpspool,ec_overwrites,nodelete,selfmanaged_snaps stripe_width 16384
fast_read 1 application rbd

pool 218 '.mgr' replicated size 3 min_size 2 crush_rule 2 object_hash
rjenkins pg_num 1 pgp_num 1 autoscale_mode warn last_change 38744544 flags
hashpspool stripe_width 0 pg_num_min 1 application mgr,mgr_devicehealth

pool 219 'device_health_metrics' replicated size 4 min_size 2 crush_rule 2
object_hash rjenkins pg_num 1 pgp_num 1 autoscale_mode on last_change
39679854 flags hashpspool stripe_width 0 pg_num_min 1 application
mgr_devicehealth

What concerns me is the output of "--op list" on objecstore. not only
because these pgs should not exist at all, also because they metadata
collections are missing:

# ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-51 --op list
Error getting attr on : 1.0_head,#1:00000000::::head#, (61) No data
available
Error getting attr on : 6.1f_head,#6:f8000000::::head#, (61) No data
available
["1.0",{"oid":"","key":"","snapid":-2,"hash":0,"max":0,"pool":1,"namespace":"","max":0}]
["1.0",{"oid":"main.db-journal.0000000000000000","key":"","snapid":-2,"hash":1969844440,"max":0,"pool":1,"namespace":"devicehealth","max":0}]
["1.0",{"oid":"main.db.0000000000000000","key":"","snapid":-2,"hash":1315310604,"max":0,"pool":1,"namespace":"devicehealth","max":0}]
["6.1f",{"oid":"","key":"","snapid":-2,"hash":31,"max":0,"pool":6,"namespace":"","max":0}]
in which, only two pgs are listed. One of them belongs to pool 1, and the
other belongs to pool 6. neither of these pools are listed in the output of
"ceph osd pool ls detail".

Do the running OSDs actually make sense from a human level, or does their
> PG state look strange in a way that isn’t triggering crashes?
>

No, the OSD state has been problematic. During the upgrade process, we
observed OSD IDs appearing on incorrect hosts in the "ceph osd tree"
output. This wasn't just a display issue - it indicated actual OSD ID
conflicts where different hosts had the same OSD IDs.


> Are the other 4 monitors still available to turn on, and what do they say
> about things? (If not, why not? The missing bits about how the OSDs were
> crashing on the last ten upgrades, and how the monitors went wrong, is
> pretty crucial to a story like this.)
>

The monitors are in a "not in quorum" state. We attempted multiple
restarts, but they failed to recover and establish quorum.


> Did you look to see if the centos upgrade could have done something weird
> to the disk arrangement?
>

We haven't investigated this yet. The machines had been running for 500+
days without reboot before the upgrade.

This cluster has a complex history that's relevant:

- Pre-existing issue: Earlier (before this upgrade), OSDs from a different
Ceph cluster were mistakenly added to this cluster, causing OSD ID
conflicts across hosts. We cleaned this up by removing the incorrectly
added OSDs.
- Previous recovery: After that incident, we recovered the cluster by
running mon/mgr services via containers on version 17.2.3, while keeping
OSDs on version 15.2. The cluster ran stably in this mixed-version state
for 500+ days until this weekend's upgrade.
- Upgrade attempt: During this weekend's upgrade, we attempted to upgrade
from this mixed state (mon/mgr 17.2.3, OSD 15.2). We tried multiple version
paths (15.2, 17.2.6, 18.3), but mon/mgr/osd services failed to recover
properly. At one point during the upgrade, ceph osd versions showed version
15, indicating version inconsistencies.

> -Greg
>
>
>>
>> Thanks for pointing this out. I also realized that we do have a test for
>> +2
>> upgrade
>> see https://github.com/ceph/ceph/tree/main/qa/suites/upgrade/reef-x.
>>
>>
>> > Regards,
>> > Eugen
>> >
>> > Zitat von kefu chai <[email protected]>:
>> >
>> > > Hello Ceph community,
>> > >
>> > > I'm writing on behalf of a friend who is experiencing a critical
>> cluster
>> > > issue after upgrading and would appreciate any assistance.
>> > >
>> > > Environment:
>> > >
>> > >    - 5 MON nodes, 2 MGR nodes, 40 OSD servers (306 OSDs total)
>> > >    - OS: CentOS 8.2 upgraded to 8.4
>> > >    - Ceph: 15.2.17 upgraded to 17.2.7
>> > >    - Upgrade method: yum update in rolling batches
>> > >
>> > > Timeline: The upgrade started on October 8th at 1:00 PM. We upgraded
>> > > MON/MGR servers first, and then upgraded OSD nodes in batches of 5
>> nodes.
>> > > The process appeared normal initially, but when approximately 10 OSD
>> > > servers remained, OSDs began going down.
>> > >
>> > > MON Quorum Issue: When the OSDs began failing, the monitors failed to
>> > form
>> > > a quorum. In an attempt to recover, we stopped 4 out of 5 monitors.
>> > > However, the remaining monitor (mbjson20010) then failed to start due
>> to
>> > a
>> > > missing .ldb file. We eventually recovered this single monitor from
>> OSD
>> > > using the instructions at
>> > >
>> >
>> https://docs.ceph.com/en/quincy/rados/troubleshooting/troubleshooting-mon/#mon-store-recovery-using-osds
>> > ,
>> > > so
>> > > we now have only 1 MON in the cluster instead of the original 5.
>> > >
>> > > However, rebuilding the MON store did not help, and restarting the OSD
>> > > servers also failed to resolve the issue. The cluster status remains
>> > > problematic.
>> > >
>> > > Current Cluster Status:
>> > >
>> > >    - Only 1 MON daemon active (quorum: mbjson20010) - down from 5 MONs
>> > >    - OSDs: 91 up / 229 in (out of 306 total)
>> > >    - 88.872% of PGs are not active
>> > >    - 4.779% of PGs are unknown
>> > >    - 3,918 PGs down
>> > >    - 1,311 PGs stale+down
>> > >    - Only 12 PGs active+clean
>> > >
>> > > Critical Error: When examining OSD logs, we discovered that some OSDs
>> are
>> > > failing to start with the following error:
>> > >
>> > > osd.43 39677784 init missing pg_pool_t for deleted pool 9 for pg
>> 9.3ds3;
>> > > please downgrade to luminous and allow pg deletion to complete before
>> > > upgrading
>> > >
>> > > Full error context from one of the failing OSDs:
>> > >
>> > > # tail  /var/log/ceph/ceph-osd.43.log
>> > >
>> > >     -7> 2025-10-12T13:40:05.987+0800 7fdd13259540  1
>> > > bluestore(/var/lib/ceph/osd/ceph-43) _upgrade_super from 4, latest 4
>> > >
>> > >     -6> 2025-10-12T13:40:05.987+0800 7fdd13259540  1
>> > > bluestore(/var/lib/ceph/osd/ceph-43) _upgrade_super done
>> > >
>> > >     -5> 2025-10-12T13:40:05.987+0800 7fdd13259540  2 osd.43 0 journal
>> > looks
>> > > like ssd
>> > >
>> > >     -4> 2025-10-12T13:40:05.987+0800 7fdd13259540  2 osd.43 0 boot
>> > >
>> > >     -3> 2025-10-12T13:40:05.987+0800 7fdceb2cc700  5
>> > > bluestore.MempoolThread(0x55c7b0c66b40) _resize_shards cache_size:
>> > > 8589934592 kv_alloc: 1717986918 kv_used: 91136 kv_onode_alloc:
>> 343597383
>> > > kv_onode_used: 23328 meta_alloc: 6871947673 meta_used: 2984
>> data_alloc: 0
>> > > data_used: 0
>> > >
>> > >     -2> 2025-10-12T13:40:05.989+0800 7fdd13259540 -1 osd.43 39677784
>> init
>> > > missing pg_pool_t for deleted pool 9 for pg 9.3ds3; please downgrade
>> to
>> > > luminous and allow pg deletion to complete before upgrading
>> > >
>> > >     -1> 2025-10-12T13:40:05.991+0800 7fdd13259540 -1
>> > >
>> >
>> /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.2.7/rpm/el8/BUILD/ceph-17.2.7/src/osd/OSD.cc:
>> > > In function 'int OSD::init()' thread 7fdd13259540 time
>> > > 2025-10-12T13:40:05.990845+0800
>> > >
>> > >
>> >
>> /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.2.7/rpm/el8/BUILD/ceph-17.2.7/src/osd/OSD.cc:
>> > > 3735: ceph_abort_msg("abort() called")
>> > >
>> > > # tail  /var/log/ceph/ceph-osd.51.log
>> > >   -7> 2025-10-12T13:39:36.739+0800 7f603e5f7540  1
>> > > bluestore(/var/lib/ceph/osd/ceph-51) _upgrade_super from 4, latest 4
>> > >     -6> 2025-10-12T13:39:36.739+0800 7f603e5f7540  1
>> > > bluestore(/var/lib/ceph/osd/ceph-51) _upgrade_super done
>> > >     -5> 2025-10-12T13:39:36.739+0800 7f603e5f7540  2 osd.51 0 journal
>> > looks
>> > > like ssd
>> > >     -4> 2025-10-12T13:39:36.739+0800 7f603e5f7540  2 osd.51 0 boot
>> > >     -3> 2025-10-12T13:39:36.739+0800 7f6016669700  5
>> > > bluestore.MempoolThread(0x55e839d4cb40) _resize_shards cache_size:
>> > > 8589934592 kv_alloc: 1717986918 kv_used: 31232 kv_onode_alloc:
>> 343597383
>> > > kv_onode_used: 21584 meta_alloc: 6871947673 meta_used: 1168
>> data_alloc: 0
>> > > data_used: 0
>> > >     -2> 2025-10-12T13:39:36.741+0800 7f603e5f7540 -1 osd.51 39677784
>> init
>> > > missing pg_pool_t for deleted pool 6 for pg 6.1f; please downgrade to
>> > > luminous and allow pg deletion to complete before upgrading
>> > >     -1> 2025-10-12T13:39:36.742+0800 7f603e5f7540 -1
>> > >
>> >
>> /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.2.7/rpm/el8/BUILD/ceph-17.2.7/src/osd/OSD.cc:
>> > > In function 'int OSD::init()' thread 7f603e5f7540 time
>> > > 2025-10-12T13:39:36.742527+0800
>> > >
>> >
>> /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.2.7/rpm/el8/BUILD/ceph-17.2.7/src/osd/OSD.cc:
>> > > 3735: ceph_abort_msg("abort() called")
>> > >
>> > > Investigation Findings: We examined all OSD instances that failed to
>> > start.
>> > > All of them exhibit the same error pattern in their logs and all
>> contain
>> > PG
>> > > references to non-existent pools. For example, running
>> > > "ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-51 --op
>> > list-pgs"
>> > > shows PG references to pools that no longer exist (e.g., pool 9, pool
>> 10,
>> > > pool 4, pool 6, pool 8), while the current pools are numbered 101,
>> 140,
>> > > 141, 149, 212, 213, 216, 217, 218, 219. Notably, each affected OSD
>> > contains
>> > > only 2-3 PGs referencing these non-existent pools, which is
>> significantly
>> > > fewer than the hundreds of PGs a regular OSD typically contains. It
>> > appears
>> > > the OSD metadata has been corrupted or overwritten with stale
>> references
>> > to
>> > > deleted pools from previous operations, preventing these OSDs from
>> > starting
>> > > and causing widespread PG state abnormalities across the cluster.
>> > >
>> > > 2 PGs referencing non-existent pools were found in osd.51:
>> > >
>> > > # ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-51 --op
>> > list-pgs
>> > > 1.0
>> > > 6.1f
>> > >
>> > > # ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-51 --op
>> list
>> > > Error getting attr on : 1.0_head,#1:00000000::::head#, (61) No data
>> > > available
>> > > Error getting attr on : 6.1f_head,#6:f8000000::::head#, (61) No data
>> > > available
>> > >
>> >
>> ["1.0",{"oid":"","key":"","snapid":-2,"hash":0,"max":0,"pool":1,"namespace":"","max":0}]
>> > >
>> >
>> ["1.0",{"oid":"main.db-journal.0000000000000000","key":"","snapid":-2,"hash":1969844440,"max":0,"pool":1,"namespace":"devicehealth","max":0}]
>> > >
>> >
>> ["1.0",{"oid":"main.db.0000000000000000","key":"","snapid":-2,"hash":1315310604,"max":0,"pool":1,"namespace":"devicehealth","max":0}]
>> > >
>> >
>> ["6.1f",{"oid":"","key":"","snapid":-2,"hash":31,"max":0,"pool":6,"namespace":"","max":0}]
>> > >
>> > > We also performed a comprehensive check by listing all PGs from all
>> OSD
>> > > nodes using "ceph-objectstore-tool --op list-pgs" and comparing the
>> > results
>> > > with the output of "ceph pg dump". This comparison revealed that
>> quite a
>> > > few PGs are missing from the OSD listings. We suspect that some OSDs
>> that
>> > > previously held these missing PGs may now be corrupted, which would
>> > explain
>> > > both the missing PGs and the widespread cluster degradation. It
>> appears
>> > the
>> > > OSD metadata has been corrupted or overwritten with stale references
>> to
>> > > deleted pools from previous operations, preventing these OSDs from
>> > starting
>> > > and causing widespread PG state abnormalities across the cluster.
>> > >
>> > > It appears the OSD objectstore's metadata has been corrupted or
>> > overwritten
>> > > with stale references to deleted pools from previous operations,
>> > preventing
>> > > these OSDs from starting and causing widespread PG state abnormalities
>> > > across the cluster.
>> > >
>> > > Questions:
>> > >
>> > >    1. How can we safely restore the missing PGs from the OSD without
>> data
>> > >    loss?
>> > >    2. Has anyone encountered similar issues when upgrading from
>> Octopus
>> > >    (15.2.x) to Quincy (17.2.x)?
>> > >
>> > > We understand that skipping major versions may not be officially
>> > supported,
>> > > but we urgently need guidance on the safest recovery path at this
>> point.
>> > >
>> > > Any help would be greatly appreciated. Thank you in advance.
>> > >
>> > > --
>> > > Regards
>> > > Kefu Chai
>> > > _______________________________________________
>> > > ceph-users mailing list -- [email protected]
>> > > To unsubscribe send an email to [email protected]
>> >
>> >
>> > _______________________________________________
>> > ceph-users mailing list -- [email protected]
>> > To unsubscribe send an email to [email protected]
>> >
>>
>>
>> --
>> Regards
>> Kefu Chai
>> _______________________________________________
>> ceph-users mailing list -- [email protected]
>> To unsubscribe send an email to [email protected]
>>
>

-- 
Regards
Kefu Chai
_______________________________________________
ceph-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]

Reply via email to