Re: [ceph-users] mds servers in endless segfault loop

2019-10-11 Thread Pickett, Neale T
I have created an anonymized crash log at 
https://pastebin.ubuntu.com/p/YsVXQQTBCM/ in the hopes that it can help someone 
understand what's leading to our MDS outage.


Thanks in advance for any assistance.



From: Pickett, Neale T
Sent: Thursday, October 10, 2019 21:46
To: ceph-users@lists.ceph.com
Subject: mds servers in endless segfault loop


Hello, ceph-users.


Our mds servers keep segfaulting from a failed assertion, and for the first 
time I can't find anyone else who's posted about this problem. None of them are 
able to stay up, so our cephfs is down.


We recently had to truncate the journal log after an upgrade to nautilus, and 
now we have lots of dup inodes, failed to open inode, and badness: got (but i 
already had) messages in the recent event dump, if that's relevant. I don't 
know which parts of that are going to be the most relevant, but here are the 
last ten:


  -10> 2019-10-11 03:30:35.258 7fd080a69700  0 mds.0.cache  failed to open ino 
0x1a1843c err -22/0
-9> 2019-10-11 03:30:35.260 7fd080a69700  0 mds.0.cache  failed to open ino 
0x1a1843c err -22/0
-8> 2019-10-11 03:30:35.260 7fd080a69700  0 mds.0.cache  failed to open ino 
0x1a1843d err -22/-22
-7> 2019-10-11 03:30:35.260 7fd080a69700  0 mds.0.cache  failed to open ino 
0x1a1843e err -22/-22
-6> 2019-10-11 03:30:35.261 7fd080a69700  0 mds.0.cache  failed to open ino 
0x1a1843f err -22/-22
-5> 2019-10-11 03:30:35.261 7fd080a69700  0 mds.0.cache  failed to open ino 
0x1a1845a err -22/-22
-4> 2019-10-11 03:30:35.262 7fd080a69700  0 mds.0.cache  failed to open ino 
0x1a1845e err -22/-22
-3> 2019-10-11 03:30:35.262 7fd080a69700  0 mds.0.cache  failed to open ino 
0x1a1846f err -22/-22
-2> 2019-10-11 03:30:35.263 7fd080a69700  0 mds.0.cache  failed to open ino 
0x1a18470 err -22/-22
-1> 2019-10-11 03:30:35.273 7fd080a69700 -1 
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.2.4/rpm/el7/BUILD/ceph-14.2.4/src/mds/CInode.cc:
 In function 'CDir* CInode::get_or_open_dirfrag(MDCache*, frag_t)' thread 
7fd080a69700 time 2019-10-11 03:30:35.273849


I'm happy to provide any other information that would help diagnose the issue. 
I don't have any guesses about what else would be helpful, though.


Thanks in advance for any help!



Neale Pickett 
A-4: Advanced Research in Cyber Systems
Los Alamos National Laboratory
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] mds servers in endless segfault loop

2019-10-10 Thread Pickett, Neale T
Hello, ceph-users.


Our mds servers keep segfaulting from a failed assertion, and for the first 
time I can't find anyone else who's posted about this problem. None of them are 
able to stay up, so our cephfs is down.


We recently had to truncate the journal log after an upgrade to nautilus, and 
now we have lots of dup inodes, failed to open inode, and badness: got (but i 
already had) messages in the recent event dump, if that's relevant. I don't 
know which parts of that are going to be the most relevant, but here are the 
last ten:


  -10> 2019-10-11 03:30:35.258 7fd080a69700  0 mds.0.cache  failed to open ino 
0x1a1843c err -22/0
-9> 2019-10-11 03:30:35.260 7fd080a69700  0 mds.0.cache  failed to open ino 
0x1a1843c err -22/0
-8> 2019-10-11 03:30:35.260 7fd080a69700  0 mds.0.cache  failed to open ino 
0x1a1843d err -22/-22
-7> 2019-10-11 03:30:35.260 7fd080a69700  0 mds.0.cache  failed to open ino 
0x1a1843e err -22/-22
-6> 2019-10-11 03:30:35.261 7fd080a69700  0 mds.0.cache  failed to open ino 
0x1a1843f err -22/-22
-5> 2019-10-11 03:30:35.261 7fd080a69700  0 mds.0.cache  failed to open ino 
0x1a1845a err -22/-22
-4> 2019-10-11 03:30:35.262 7fd080a69700  0 mds.0.cache  failed to open ino 
0x1a1845e err -22/-22
-3> 2019-10-11 03:30:35.262 7fd080a69700  0 mds.0.cache  failed to open ino 
0x1a1846f err -22/-22
-2> 2019-10-11 03:30:35.263 7fd080a69700  0 mds.0.cache  failed to open ino 
0x1a18470 err -22/-22
-1> 2019-10-11 03:30:35.273 7fd080a69700 -1 
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.2.4/rpm/el7/BUILD/ceph-14.2.4/src/mds/CInode.cc:
 In function 'CDir* CInode::get_or_open_dirfrag(MDCache*, frag_t)' thread 
7fd080a69700 time 2019-10-11 03:30:35.273849


I'm happy to provide any other information that would help diagnose the issue. 
I don't have any guesses about what else would be helpful, though.


Thanks in advance for any help!



Neale Pickett 
A-4: Advanced Research in Cyber Systems
Los Alamos National Laboratory
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com