Hello. Today we've experienced a complete CEPH cluster outage - total loss of power in the whole infrastructure. 6 osd nodes and 3 monitors went down at the same time. CEPH 14.2.10
This resulted in unfound objects, which were "reverted" in a hurry with ceph pg <pg_id> mark_unfound_lost revert In retrospect that was probably a mistake as the "have" part stated 0'0. But then deep-scrubs started and they found inconsistent PGs. We tried repairing them, but they just switched to failed_repair. Here's a log example: 2021-06-25 00:08:07.693645 osd.0 [ERR] 3.c shard 6 3:3163e703:::rbd_data.be08c566ef438d.0000000000002445:head : missing 2021-06-25 00:08:07.693710 osd.0 [ERR] repair 3.c 3:3163e2ee:::rbd_data.efa86358d15f4a.000000000000004b:6ab1 : is an unexpected clone 2021-06-25 00:11:55.128951 osd.0 [ERR] 3.c repair 1 missing, 0 inconsistent objects 2021-06-25 00:11:55.128969 osd.0 [ERR] 3.c repair 2 errors, 1 fixed I tried manually deleting conflicting objects from secondary osds with ceph-objectstore-tool like this ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-22 --pgid 3.c rbd_data.efa86358d15f4a.000000000000004b:6ab1 remove it removes it but without any positive impact. Pretty sure I don't understand the concept. So currently I have the following thoughts: - is there any doc on the object placement specifics and what all of those numbers in their name mean? I've seen objects with similar prefix/mid but different suffix and I have no idea what does it mean; - I'm actually not sure what the production impact is at that point because everything seems to work so far. So I'm thinking if it's possible to kill replicas on secondary OSDd with ceph-objectstore-tool and just let CEPH create a replica from primary PG? I have 8 scrub errors and 4 inconsistent+failed_repair PGs, and I'm afraid that further deep scrubs will reveal more errors. Any thoughts appreciated. _______________________________________________ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io