[ceph-users] Re: Reasonable MDS rejoin time?

Felix Lee Mon, 16 May 2022 01:00:37 -0700

Hi, Jos,
Many thanks for your reply.
And sorry, I missed to mention the version, which is 14.2.22.


Here is the log:
https://drive.google.com/drive/folders/1qzPf64qw16VJDKSzcDoixZ690KL8XSoc?usp=sharing

Here, the ceph01(active) and ceph11(standby-replay) were the ones whatsuffered crash. The log didn't tell us much but several slow requestwere occurring. And, the ceph11 had "cache is too large" warning by thetime it went crashed, suppose it could happen when doing recovery. (eachMDS has 64GB memory, BTW )The ceph16 is current rejoin one, I've turned debug_mds to 20 for awhile as ceph-mds.ceph16.log-20220516.gz



Thanks
&
Best regards,
Felix Lee ~



On 5/16/22 14:45, Jos Collin wrote:

It's hard to suggest without the logs. Do verbose logging debug_mds=20.What's the ceph version? Do you have the logs why the MDS crashed?
On 16/05/22 11:20, Felix Lee wrote:
Dear all,
We currently have 7 multi-active MDS, with another 7 standby-replay.
We thought this should cover most of disasters, and it actually did.But things just got happened, here is the story:One of MDS crashed and standby-replay took over, but got stuck atresolve state.Then, the other two MDS(rank 0 and 5) received tones of slow requests,and my colleague restarted them, thinking the standby-replay wouldtake over immediately (this seemed to be wrong or at least unnecessaryaction, I guess...). Then, it resulted three of them in resolve state...In the meanwhile, I realized that the first failed rank(rank 2) hadabnormal memory usage and kept getting crashed, after couplerestarting, the memory usage was back to normal, and then, those treeMDS entered into rejoin state.Now, this rejoin state is there for three days and keeps going aswe're speaking. Here, no significant error message shows up even with"debug_mds 10", so, we have no idea when it's gonna end and if it'sreally running on the track.So, I am wondering how do we check MDS rejoin progress/status to makesure if it's running normally? Or, how do we estimate the rejoin timeand maybe improve it? because we always need to tell user the timeestimation of its recovery.
Thanks
&
Best regards,
Felix Lee ~
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


--
Felix H.T Lee                           Academia Sinica Grid & Cloud.
Tel: +886-2-27898308

Office: Room P111, Institute of Physics, 128 Academia Road, Section 2,Nankang, Taipei 115, Taiwan


--
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.

_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Reasonable MDS rejoin time?

Reply via email to