[ceph-users] Re: MDS stuck in rejoin

Xiubo Li Sun, 30 Jul 2023 19:13:38 -0700

Hi Frank,

On 7/30/23 16:52, Frank Schilder wrote:

Hi Xiubo,


it happened again. This time, we might be able to pull logs from the client 
node. Please take a look at my intermediate action below - thanks!

I am in a bit of a calamity, I'm on holidays with terrible network connection 
and can't do much. My first priority is securing the cluster to avoid damage 
caused by this issue. I did an MDS evict by client ID on the MDS reporting the 
warning with the client ID reported in the warning. For some reason the client 
got blocked on 2 MDSes after this command, one of these is an ordinary stand-by 
daemon. Not sure if this is expected.

Main question: is this sufficient to prevent any damaging IO on the cluster? 
I'm thinking here about the MDS eating through all its RAM until it crashes 
hard in an irrecoverable state (that was described as a consequence in an old 
post about this warning). If this is a safe state, I can keep it in this state 
until I return from holidays.


Yeah, I think so.

BTW, are u using the kclients or user space clients ? I checked bothkclient and libcephfs, it seems buggy in libcephfs, which could causethis issue. But for kclient it's okay till now.


Thanks

- Xiubo


Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Xiubo Li <xiu...@redhat.com>
Sent: Friday, July 28, 2023 11:37 AM
To: Frank Schilder; ceph-users@ceph.io
Subject: Re: [ceph-users] Re: MDS stuck in rejoin


On 7/26/23 22:13, Frank Schilder wrote:

Hi Xiubo.

... I am more interested in the kclient side logs. Just want to
know why that oldest request got stuck so long.

I'm afraid I'm a bad admin in this case. I don't have logs from the host any 
more, I would have needed the output of dmesg and this is gone. In case it 
happens again I will try to pull the info out.

The tracker https://tracker.ceph.com/issues/22885 sounds a lot more violent 
than our situation. We had no problems with the MDSes, the cache didn't grow 
and the relevant one was also not put into read-only mode. It was just this 
warning showing all the time, health was OK otherwise. I think the warning was 
there for at least 16h before I failed the MDS.

The MDS log contains nothing, this is the only line mentioning this client:

2023-07-20T00:22:05.518+0200 7fe13df59700  0 log_channel(cluster) log [WRN] : 
client.145678382 does not advance its oldest_client_tid (16121616), 100000 
completed requests recorded in session

Okay, if so it's hard to say and dig out what has happened in client why
it didn't advance the tid.

Thanks

- Xiubo

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: MDS stuck in rejoin

Reply via email to