Hi,

without any outage/disaster cephFS (17.2.5/cephadm) reports damaged metadata:

[root@ceph106 ~]# zcat 
/var/log/ceph/3cacfa58-55cf-11ed-abaf-5cba2c03dec0/ceph-mds.disklib.ceph106.kbzjbg.log-20221211.gz
2022-12-10T10:12:35.161+0000 7fa46779d700  1 mds.disklib.ceph106.kbzjbg 
Updating MDS map to version 958 from mon.1
2022-12-10T10:12:50.974+0000 7fa46779d700  1 mds.disklib.ceph106.kbzjbg 
Updating MDS map to version 959 from mon.1
2022-12-10T15:18:36.609+0000 7fa461791700  0 mds.0.cache.dir(0x100001516b1) 
_fetched missing object for [dir 0x100001516b1 
/volumes/_nogroup/ec-pool4p2/aa36abb9-a22e-405f-921c-76152599c6ba/LQ1WYG_10.28.2022_04.50/CV_MAGNETIC/V_7770505/
 [2,head] auth v=0 cv=0/0 ap=1+0 state=1073741888|fetching f() n() 
hs=0+0,ss=0+0 | waiter=1 authpin=1 0x56541d3c5a80]
2022-12-10T15:18:36.615+0000 7fa461791700 -1 log_channel(cluster) log [ERR] : 
dir 0x100001516b1 object missing on disk; some files may be lost 
(/volumes/_nogroup/ec-pool4p2/aa36abb9-a22e-405f-921c-76152599c6ba/LQ1WYG_10.28.2022_04.50/CV_MAGNETIC/V_7770505)
2022-12-10T15:18:40.010+0000 7fa46779d700  1 mds.disklib.ceph106.kbzjbg 
Updating MDS map to version 960 from mon.1
2022-12-11T02:32:01.474+0000 7fa468fa0700 -1 received  signal: Hangup from 
Kernel ( Could be generated by pthread_kill(), raise(), abort(), alarm() ) UID: 0

[root@ceph101 ~]# ceph tell mds.disklib:0 damage ls
2022-12-12T10:20:42.484+0100 7fa9e37fe700  0 client.165258 ms_handle_reset on 
v2:xxx.xxx.xxx.xxx:6800/519677707
2022-12-12T10:20:42.504+0100 7fa9e37fe700  0 client.165264 ms_handle_reset on 
v2:xxx.xxx.xxx.xxx:6800/519677707
[
    {
        "damage_type": "dir_frag",
        "id": 2085830739,
        "ino": 1099513009841,
        "frag": "*",
        "path": 
"/volumes/_nogroup/ec-pool4p2/aa36abb9-a22e-405f-921c-76152599c6ba/LQ1WYG_10.28.2022_04.50/CV_MAGNETIC/V_7770505"
    }
]

The mentioned path CV_MAGNETIC/V_7770505 is not visible, but I can't tell whether this is due to being lost, or removed by the application using the cephFS.

Data is on EC4+2 pool, ROOT and METADATA are on replica=3 pools.

Questions are: What happened? And how to fix the problem?

Is running "ceph tell mds.disklib:0 scrub start /what/path? recursive,repair" the right thing? Is this a safe command? How is the impact on production?

Can the file-system stay mounted/used by clients? How long will it take for 340T? What is a dir_frag damage?

TIA, Sascha.
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

Reply via email to