Hi All.

I came in this morning to find that one of my cephfs file systems was read only 
and that the MDS was replaying the log but the MDS processes kept crashing with 
out of memory.
I have had to increase the memory on the VM's hosting the mds and the mds 
process now gets to ~76GB before it comes online briefly. I also had to set the 
standby_count_wanted to 0 to get the daemon up but then it promptly crashes 
again with the errors below. My research suggests I might be hitting this bug 
https://github.com/ceph/ceph/pull/25519/files.

Any suggestions on how I can recover from this situation

-10001> 2019-07-08 14:13:16.659 7f90df693700  5 -- 10.137.0.134:6800/1608067295 
>> 10.120.0.58:0/4242249126 conn(0x563f8a8d6300 :6800 
s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=46 cs=1 l=0). rx client.47532 
seq 20 0x564d120b13c0 client_session(request_renewcaps seq 96425)
-10001> 2019-07-08 14:13:17.043 7f90d9687700  1 heartbeat_map is_healthy 
'MDSRank' had timed out after 15
-10001> 2019-07-08 14:13:17.043 7f90d9687700  0 mds.beacon.ceph-b-3 Skipping 
beacon heartbeat to monitors (last acked 14.5042s ago); MDS internal heartbeat 
is not healthy!
-10001> 2019-07-08 14:13:17.159 7f90d9e88700 -1 
/build/ceph-13.2.6/src/include/elist.h: In function 'elist<T>::item::~item() 
[with T = CDentry*]' thread 7f90d9e88700 time 2019-07-08 14:13:17.162533
/build/ceph-13.2.6/src/include/elist.h: 39: FAILED assert(!is_on_list())

ceph version 13.2.6 (7b695f835b03642f85998b2ae7b6dd093d9fbce4) mimic (stable)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x14e) 
[0x7f90e4947b5e]
2: (()+0x2c4cb7) [0x7f90e4947cb7]
3: (CDentry::~CDentry()+0x372) [0x563a8cde1ee2]
4: (CDentry::~CDentry()+0x9) [0x563a8cde1f19]
5: (CDir::remove_dentry(CDentry*)+0x165) [0x563a8cdee215]
6: (MDCache::trim_dentry(CDentry*, std::map<int, MCacheExpire*, std::less<int>, 
std::allocator<std::pair<int const, MCacheExpire*> > >&)+0xfe) [0x563a8cd14bbe]
7: (MDCache::trim_lru(unsigned long, std::map<int, MCacheExpire*, 
std::less<int>, std::allocator<std::pair<int const, MCacheExpire*> > >&)+0x85d) 
[0x563a8cd1616d]
8: (MDCache::trim(unsigned long)+0x24a) [0x563a8cd1712a]
9: (MDSRankDispatcher::tick()+0xd9) [0x563a8cc35979]
10: (FunctionContext::finish(int)+0x2c) [0x563a8cc1badc]
11: (Context::complete(int)+0x9) [0x563a8cc19f89]
12: (SafeTimer::timer_thread()+0xf9) [0x7f90e4944329]
13: (SafeTimerThread::entry()+0xd) [0x7f90e4945a3d]
14: (()+0x76db) [0x7f90e41fb6db]
15: (clone()+0x3f) [0x7f90e33e188f]

-10001> 2019-07-08 14:13:17.163 7f90d9e88700 -1 *** Caught signal (Aborted) **
in thread 7f90d9e88700 thread_name:safe_timer

Regards
Robert Ruge


Important Notice: The contents of this email are intended solely for the named 
addressee and are confidential; any unauthorised use, reproduction or storage 
of the contents is expressly prohibited. If you have received this email in 
error, please delete it and any attachments immediately and advise the sender 
by return email or telephone.

Deakin University does not warrant that this email and any attachments are 
error or virus free.
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to