Hello all, I've got a 30 node cluster serving up lots of CephFS data.
We upgraded to Nautilus 14.2.1 from Luminous 12.2.11 on Monday earlier this week. We've been running 2 MDS daemons in an active-active setup. Tonight one of the metadata daemons crashed with the following several times: -1> 2019-05-16 00:20:56.775 7f9f22405700 -1 /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.2.1/rpm/el7/BUILD/ceph-14.2.1/src/mds/CInode.h: In function 'void CIn ode::set_primary_parent(CDentry*)' thread 7f9f22405700 time 2019-05-16 00:20:56.775021 /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.2.1/rpm/el7/BUILD/ceph-14.2.1/src/mds/CInode.h: 1114: FAILED ceph_assert(parent == 0 || g_conf().get_val<bool>("mds_h ack_allow_loading_invalid_metadata")) I made a quick decision to move to a single MDS because I saw set_primary_parent, and I thought it might be related to auto balancing between the metadata servers. This caused one MDS to fail, the other crashed, and now rank 0 loads, goes active and then crashes with the following: -1> 2019-05-16 00:29:21.151 7fe315e8d700 -1 /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.2.1/rpm/el7/BUILD/ceph-14.2.1/src/mds/MDCache.cc: In function 'void M DCache::add_inode(CInode*)' thread 7fe315e8d700 time 2019-05-16 00:29:21.149531 /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.2.1/rpm/el7/BUILD/ceph-14.2.1/src/mds/MDCache.cc: 258: FAILED ceph_assert(!p) It now looks like we somehow have a duplicate inode in the MDS journal? https://people.cs.ksu.edu/~mozes/ceph-mds.melinoe.log <- was rank 0 then became rank one after the crash and attempted drop to one active MDS https://people.cs.ksu.edu/~mozes/ceph-mds.mormo.log <- current rank 0 and crashed Anyone have any thoughts on this? Thanks, Adam _______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com