Hello,

When connection is lost between kernel client, a few things happen:

1.
Caps become stale:

Aug 11 11:08:14 admin-cap kernel: [308405.227718] ceph: mds0 caps stale

2.
MDS evicts client for being unresponsive:

MDS log: 2020-08-11 11:12:08.923 7fd1f45ae700  0 log_channel(cluster) log [WRN] 
: evicting unresponsive client admin-cap.cf.ha.cyberfusion.cloud:DB0001-cap 
(144786749), after 300.978 seconds
Client log: Aug 11 11:12:11 admin-cap kernel: [308643.051006] ceph: mds0 hung

3.
Socket is closed:

Aug 11 11:22:57 admin-cap kernel: [309289.192705] libceph: mds0 
[fdb7:b01e:7b8e:0:10:10:10:1]:6849 socket closed (con state OPEN)

I am not sure whether the kernel client or MDS closes the connection. I think 
the kernel client does so, because nothing is logged at the MDS side at 11:22:57

4.
Connection is reset by MDS:

MDS log: 2020-08-11 11:22:58.831 7fd1f9e49700  0 --1- 
[v2:[fdb7:b01e:7b8e:0:10:10:10:1]:6800/3619156441,v1:[fdb7:b01e:7b8e:0:10:10:10:1]:6849/3619156441]
 >> v1:[fc00:b6d:cfc:951::7]:0/133007863 conn(0x55bfaf1c2880 0x55c16cb47000 
:6849 s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 
l=0).handle_connect_message_2 accept we reset (peer sent cseq 1), sending 
RESETSESSION
Client log: Aug 11 11:22:58 admin-cap kernel: [309290.058222] libceph: mds0 
[fdb7:b01e:7b8e:0:10:10:10:1]:6849 connection reset

5.
Kernel client reconnects:

Aug 11 11:22:58 admin-cap kernel: [309290.058972] ceph: mds0 closed our session
Aug 11 11:22:58 admin-cap kernel: [309290.058973] ceph: mds0 reconnect start
Aug 11 11:22:58 admin-cap kernel: [309290.069979] ceph: mds0 reconnect denied
Aug 11 11:22:58 admin-cap kernel: [309290.069996] ceph: dropping file locks for 
000000006a23d9dd 1099625041446
Aug 11 11:22:58 admin-cap kernel: [309290.071135] libceph: mds0 
[fdb7:b01e:7b8e:0:10:10:10:1]:6849 socket closed (con state NEGOTIATING)

Question:

As you can see, there's 10 minutes between losing the connection and the 
reconnection attempt (11:12:08 - 11:22:58). I could not find any settings 
related to the period after which reconnection is attempted. I would like to 
change this value from 10 minutes to something like 1 minute. I also tried 
searching the Ceph docs for the string '600' (10 minutes), but did not find 
anything useful.

Hope someone can help.

Environment details:

Client kernel: 4.19.0-10-amd64
Ceph version: ceph version 14.2.9 (bed944f8c45b9c98485e99b70e11bbcec6f6659a) 
nautilus (stable)


Met vriendelijke groeten,

William Edwards

_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

Reply via email to