Re: [ceph-users] [lists.ceph.com代发]Re: MDS Crashing 14.2.1

Adam Tygart Fri, 17 May 2019 05:41:18 -0700

I followed the docs from here:
http://docs.ceph.com/docs/nautilus/cephfs/disaster-recovery-experts/#disaster-recovery-experts


I exported the journals as a backup for both ranks. I was running 2
active MDS daemons at the time.

cephfs-journal-tool --rank=combined:0 journal export
cephfs-journal-0-201905161412.bin
cephfs-journal-tool --rank=combined:1 journal export
cephfs-journal-1-201905161412.bin

I recovered the Dentries on both ranks
cephfs-journal-tool --rank=combined:0 event recover_dentries summary
cephfs-journal-tool --rank=combined:1 event recover_dentries summary

I reset the journals of both ranks:
cephfs-journal-tool --rank=combined:1 journal reset
cephfs-journal-tool --rank=combined:0 journal reset

Then I reset the session table
cephfs-table-tool all reset session

Once that was done, reboot all machines that were talking to cephfs
(or at least unmount/remount).

On Fri, May 17, 2019 at 2:30 AM <wangzhig...@uniview.com> wrote:
>
> Hi
>    Can you tell me the detail recovery cmd ?
>
> I just started learning cephfs ,I would be grateful.
>
>
>
> 发件人:         Adam Tygart <mo...@ksu.edu>
> 收件人:         Ceph Users <ceph-users@lists.ceph.com>
> 日期:         2019/05/17 09:04
> 主题:        [lists.ceph.com代发]Re: [ceph-users] MDS Crashing 14.2.1
> 发件人:        "ceph-users" <ceph-users-boun...@lists.ceph.com>
> ________________________________
>
>
>
> I ended up backing up the journals of the MDS ranks, recover_dentries for 
> both of them, resetting the journals and session table. It is back up. The 
> recover dentries stage didn't show any errors, so I'm not even sure why the 
> MDS was asserting about duplicate inodes.
>
> --
> Adam
>
> On Thu, May 16, 2019, 13:52 Adam Tygart <mo...@ksu.edu> wrote:
> Hello all,
>
> The rank 0 mds is still asserting. Is this duplicate inode situation
> one that I should be considering using the cephfs-journal-tool to
> export, recover dentries and reset?
>
> Thanks,
> Adam
>
> On Thu, May 16, 2019 at 12:51 AM Adam Tygart <mo...@ksu.edu> wrote:
> >
> > Hello all,
> >
> > I've got a 30 node cluster serving up lots of CephFS data.
> >
> > We upgraded to Nautilus 14.2.1 from Luminous 12.2.11 on Monday earlier
> > this week.
> >
> > We've been running 2 MDS daemons in an active-active setup. Tonight
> > one of the metadata daemons crashed with the following several times:
> >
> >     -1> 2019-05-16 00:20:56.775 7f9f22405700 -1
> > /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.2.1/rpm/el7/BUILD/ceph-14.2.1/src/mds/CInode.h:
> > In function 'void CIn
> > ode::set_primary_parent(CDentry*)' thread 7f9f22405700 time 2019-05-16
> > 00:20:56.775021
> > /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.2.1/rpm/el7/BUILD/ceph-14.2.1/src/mds/CInode.h:
> > 1114: FAILED ceph_assert(parent == 0 || g_conf().get_val<bool>("mds_h
> > ack_allow_loading_invalid_metadata"))
> >
> > I made a quick decision to move to a single MDS because I saw
> > set_primary_parent, and I thought it might be related to auto
> > balancing between the metadata servers.
> >
> > This caused one MDS to fail, the other crashed, and now rank 0 loads,
> > goes active and then crashes with the following:
> >     -1> 2019-05-16 00:29:21.151 7fe315e8d700 -1
> > /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.2.1/rpm/el7/BUILD/ceph-14.2.1/src/mds/MDCache.cc:
> > In function 'void M
> > DCache::add_inode(CInode*)' thread 7fe315e8d700 time 2019-05-16 
> > 00:29:21.149531
> > /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.2.1/rpm/el7/BUILD/ceph-14.2.1/src/mds/MDCache.cc:
> > 258: FAILED ceph_assert(!p)
> >
> > It now looks like we somehow have a duplicate inode in the MDS journal?
> >
> > https://people.cs.ksu.edu/~mozes/ceph-mds.melinoe.log <- was rank 0
> > then became rank one after the crash and attempted drop to one active
> > MDS
> > https://people.cs.ksu.edu/~mozes/ceph-mds.mormo.log <- current rank 0
> > and crashed
> >
> > Anyone have any thoughts on this?
> >
> > Thanks,
> > Adam
> _______________________________________________
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com_______________________________________________
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
> ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>  
> 本邮件及其附件含有浙江宇视科技有限公司的保密信息，仅限于发送给上面地址中列出的个人或群组。禁止任何其他人以任何形式使用（包括但不限于全部或部分地泄露、复制、或散发本邮件中的信息。如果您错收了本邮件请您立即电话或邮件通知发件人并删除本邮件！
>  This e-mail and its attachments contain confidential information from 
> Uniview, which is intended only for the person or entity whose address is 
> listed above. Any use of the information contained herein in any way 
> (including, but not limited to, total or partial disclosure, reproduction, or 
> dissemination) by persons other than the intended recipient(s) is prohibited. 
> If you receive this e-mail in error, please notify the sender by phone or 
> email immediately and delete it!
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] [lists.ceph.com代发]Re: MDS Crashing 14.2.1

Reply via email to