Hello, I have a Ceph deployment using CephFS. Recently MDS failed and cannot start. Attempting to start MDS for this filesystem results in nearly immediate segfault in MDS. Logs below.
cephfs-journal-tool shows Overall journal integrity state OK root@proxmox-2:/var/log/ceph# cephfs-journal-tool --rank=galaxy:all journal inspect Overall journal integrity: OK Stack dump / log from MDS: -14> 2023-05-26T15:01:09.204-0500 7f27c24b2700 1 mds.0.journaler.mdlog(ro) probing for end of the log -13> 2023-05-26T15:01:09.208-0500 7f27c34b4700 1 mds.0.journaler.pq(ro) _finish_read_head loghead(trim 4194304, expire 4194607, write 4194607, stream_format 1). probing for end of log (from 4194607)... -12> 2023-05-26T15:01:09.208-0500 7f27c34b4700 1 mds.0.journaler.pq(ro) probing for end of the log -11> 2023-05-26T15:01:09.412-0500 7f27c24b2700 1 mds.0.journaler.mdlog(ro) _finish_probe_end write_pos = 2388235687 (header had 2388213543). recovered. -10> 2023-05-26T15:01:09.412-0500 7f27c34b4700 1 mds.0.journaler.pq(ro) _finish_probe_end write_pos = 4194607 (header had 4194607). recovered. -9> 2023-05-26T15:01:09.412-0500 7f27c34b4700 4 mds.0.purge_queue operator(): open complete -8> 2023-05-26T15:01:09.412-0500 7f27c34b4700 1 mds.0.journaler.pq(ro) set_writeable -7> 2023-05-26T15:01:09.412-0500 7f27c1cb1700 4 mds.0.log Journal 0x200 recovered. -6> 2023-05-26T15:01:09.412-0500 7f27c1cb1700 4 mds.0.log Recovered journal 0x200 in format 1 -5> 2023-05-26T15:01:09.412-0500 7f27c1cb1700 2 mds.0.6403 Booting: 1: loading/discovering base inodes -4> 2023-05-26T15:01:09.412-0500 7f27c1cb1700 0 mds.0.cache creating system inode with ino:0x100 -3> 2023-05-26T15:01:09.412-0500 7f27c1cb1700 0 mds.0.cache creating system inode with ino:0x1 -2> 2023-05-26T15:01:09.416-0500 7f27c24b2700 2 mds.0.6403 Booting: 2: replaying mds log -1> 2023-05-26T15:01:09.416-0500 7f27c24b2700 2 mds.0.6403 Booting: 2: waiting for purge queue recovered 0> 2023-05-26T15:01:09.428-0500 7f27c0caf700 -1 *** Caught signal (Segmentation fault) ** in thread 7f27c0caf700 thread_name:md_log_replay ceph version 17.2.6 (995dec2cdae920da21db2d455e55efbc339bde24) quincy (stable) 1: /lib/x86_64-linux-gnu/libpthread.so.0(+0x13140) [0x7f27cd70c140] 2: (EMetaBlob::replay(MDSRank*, LogSegment*, MDPeerUpdate*)+0x66c2) [0x563540fc7372] 3: (EUpdate::replay(MDSRank*)+0x3c) [0x563540fc8abc] 4: (MDLog::_replay_thread()+0x7cb) [0x563540f4d0fb] 5: (MDLog::ReplayThread::entry()+0xd) [0x563540c1fbfd] 6: /lib/x86_64-linux-gnu/libpthread.so.0(+0x7ea7) [0x7f27cd700ea7] 7: clone() NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. What are the safest steps to recovery at this point? Thanks, Al _______________________________________________ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io