I followed the docs from here: http://docs.ceph.com/docs/nautilus/cephfs/disaster-recovery-experts/#disaster-recovery-experts
I exported the journals as a backup for both ranks. I was running 2 active MDS daemons at the time. cephfs-journal-tool --rank=combined:0 journal export cephfs-journal-0-201905161412.bin cephfs-journal-tool --rank=combined:1 journal export cephfs-journal-1-201905161412.bin I recovered the Dentries on both ranks cephfs-journal-tool --rank=combined:0 event recover_dentries summary cephfs-journal-tool --rank=combined:1 event recover_dentries summary I reset the journals of both ranks: cephfs-journal-tool --rank=combined:1 journal reset cephfs-journal-tool --rank=combined:0 journal reset Then I reset the session table cephfs-table-tool all reset session Once that was done, reboot all machines that were talking to cephfs (or at least unmount/remount). On Fri, May 17, 2019 at 2:30 AM <wangzhig...@uniview.com> wrote: > > Hi > Can you tell me the detail recovery cmd ? > > I just started learning cephfs ,I would be grateful. > > > > 发件人: Adam Tygart <mo...@ksu.edu> > 收件人: Ceph Users <ceph-users@lists.ceph.com> > 日期: 2019/05/17 09:04 > 主题: [lists.ceph.com代发]Re: [ceph-users] MDS Crashing 14.2.1 > 发件人: "ceph-users" <ceph-users-boun...@lists.ceph.com> > ________________________________ > > > > I ended up backing up the journals of the MDS ranks, recover_dentries for > both of them, resetting the journals and session table. It is back up. The > recover dentries stage didn't show any errors, so I'm not even sure why the > MDS was asserting about duplicate inodes. > > -- > Adam > > On Thu, May 16, 2019, 13:52 Adam Tygart <mo...@ksu.edu> wrote: > Hello all, > > The rank 0 mds is still asserting. Is this duplicate inode situation > one that I should be considering using the cephfs-journal-tool to > export, recover dentries and reset? > > Thanks, > Adam > > On Thu, May 16, 2019 at 12:51 AM Adam Tygart <mo...@ksu.edu> wrote: > > > > Hello all, > > > > I've got a 30 node cluster serving up lots of CephFS data. > > > > We upgraded to Nautilus 14.2.1 from Luminous 12.2.11 on Monday earlier > > this week. > > > > We've been running 2 MDS daemons in an active-active setup. Tonight > > one of the metadata daemons crashed with the following several times: > > > > -1> 2019-05-16 00:20:56.775 7f9f22405700 -1 > > /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.2.1/rpm/el7/BUILD/ceph-14.2.1/src/mds/CInode.h: > > In function 'void CIn > > ode::set_primary_parent(CDentry*)' thread 7f9f22405700 time 2019-05-16 > > 00:20:56.775021 > > /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.2.1/rpm/el7/BUILD/ceph-14.2.1/src/mds/CInode.h: > > 1114: FAILED ceph_assert(parent == 0 || g_conf().get_val<bool>("mds_h > > ack_allow_loading_invalid_metadata")) > > > > I made a quick decision to move to a single MDS because I saw > > set_primary_parent, and I thought it might be related to auto > > balancing between the metadata servers. > > > > This caused one MDS to fail, the other crashed, and now rank 0 loads, > > goes active and then crashes with the following: > > -1> 2019-05-16 00:29:21.151 7fe315e8d700 -1 > > /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.2.1/rpm/el7/BUILD/ceph-14.2.1/src/mds/MDCache.cc: > > In function 'void M > > DCache::add_inode(CInode*)' thread 7fe315e8d700 time 2019-05-16 > > 00:29:21.149531 > > /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.2.1/rpm/el7/BUILD/ceph-14.2.1/src/mds/MDCache.cc: > > 258: FAILED ceph_assert(!p) > > > > It now looks like we somehow have a duplicate inode in the MDS journal? > > > > https://people.cs.ksu.edu/~mozes/ceph-mds.melinoe.log <- was rank 0 > > then became rank one after the crash and attempted drop to one active > > MDS > > https://people.cs.ksu.edu/~mozes/ceph-mds.mormo.log <- current rank 0 > > and crashed > > > > Anyone have any thoughts on this? > > > > Thanks, > > Adam > _______________________________________________ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com_______________________________________________ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- > > 本邮件及其附件含有浙江宇视科技有限公司的保密信息,仅限于发送给上面地址中列出的个人或群组。禁止任何其他人以任何形式使用(包括但不限于全部或部分地泄露、复制、或散发本邮件中的信息。如果您错收了本邮件请您立即电话或邮件通知发件人并删除本邮件! > This e-mail and its attachments contain confidential information from > Uniview, which is intended only for the person or entity whose address is > listed above. Any use of the information contained herein in any way > (including, but not limited to, total or partial disclosure, reproduction, or > dissemination) by persons other than the intended recipient(s) is prohibited. > If you receive this e-mail in error, please notify the sender by phone or > email immediately and delete it! _______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com