[ceph-users] Extremally need help. Openshift cluster is down :c

kreept . sama Fri, 10 Feb 2023 14:58:14 -0800

Hello everyone and sorry. Maybe someone has already faced this problem. 
A day ago, we restored our Openshift cluster, however, at the moment, the PVCs 
cannot connect to the pod. We looked at the status of the ceph and found that 
our MDS were in standby mode, then found that the metadata was corrupted. After 
some manipulations, we were able to turn on our MDS daemons, but there is still 
no record on the cluster, the ceph status command shows the following.


sh-4.4$ ceph -s
  cluster:
    id:     9213604e-b0b6-49d5-bcb3-f55ab3d79119
    health: HEALTH_ERR
            1 MDSs report damaged metadata
            1 MDSs are read only
            6 daemons have recently crashed
  services:
    mon: 5 daemons, quorum bd,bj,bm,bn,bo (age 26h)
    mgr: a(active, since 25h)
    mds: 1/1 daemons up, 1 hot standby
    osd: 9 osds: 9 up (since 41h), 9 in (since 42h)
    rgw: 1 daemon active (1 hosts, 1 zones)
  data:
    volumes: 1/1 healthy
    pools:   10 pools, 225 pgs
    objects: 1.60M objects, 234 GiB
    usage:   606 GiB used, 594 GiB / 1.2 TiB avail
    pgs:     225 active+clean
  io:
    client:   852 B/s rd, 1 op/s rd, 0 op/s wr

Now we trying to follow this instructions:
https://docs.ceph.com/en/latest/cephfs/disaster-recovery-experts/#recovery-from-missing-metadata-objects

What else have we tried:

cephfs-journal-tool --rank=1:0 event recover_dentries summary
cephfs-journal-tool --rank=1:0 journal reset
cephfs-table-tool all reset session
ceph tell mds.gml--cephfs-a scrub start / recursive repair force
ceph tell mds.gml--cephfs-b scrub start / recursive repair force
ceph mds repaired 0

ceph tell mds.gml--cephfs-a damage ls

[
    {
        "damage_type": "dir_frag",
        "id": 26851730,
        "ino": 1100162409473,
        "frag": "*",
        "path": 
"/volumes/csi/csi-vol-5ad18c03-3205-11ed-9ba7-0a580a810206/e5664004-51e0-4bff-85c8-029944b431d8/store/096/096a1497-78ab-4802-a5a7-d09e011fd3a5/202301_1027796_1027796_0"
    },
………

    {
        "damage_type": "dir_frag",
        "id": 118336643,
        "ino": 1100162424469,
        "frag": "*",
        "path": 
"/volumes/csi/csi-vol-5ad18c03-3205-11ed-9ba7-0a580a810206/e5664004-51e0-4bff-85c8-029944b431d8/store/096/096a1497-78ab-4802-a5a7-d09e011fd3a5/202301_1027832_1027832_0"
    },

Now we trying: 

# Session table
cephfs-table-tool 0 reset session
# SnapServer
cephfs-table-tool 0 reset snap
# InoTable
cephfs-table-tool 0 reset inode
# Journal
cephfs-journal-tool --rank=0 journal reset
# Root inodes ("/" and MDS directory)
cephfs-data-scan init

cephfs-data-scan scan_extents <data pool>
cephfs-data-scan scan_inodes <data pool>
cephfs-data-scan scan_links

Is it right way and cant it be our salvation? 
Thank you!
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Extremally need help. Openshift cluster is down :c

Reply via email to