[ceph-users] Re: Client failing to respond to capability release

2023-09-01 Thread Patrick Donnelly
Hello Frank,

On Tue, Aug 22, 2023 at 11:42 AM Frank Schilder  wrote:
>
> Hi all,
>
> I have this warning the whole day already (octopus latest cluster):
>
> HEALTH_WARN 4 clients failing to respond to capability release; 1 pgs not 
> deep-scrubbed in time
> [WRN] MDS_CLIENT_LATE_RELEASE: 4 clients failing to respond to capability 
> release
> mds.ceph-24(mds.1): Client sn352.hpc.ait.dtu.dk:con-fs2-hpc failing to 
> respond to capability release client_id: 145698301
> mds.ceph-24(mds.1): Client sn463.hpc.ait.dtu.dk:con-fs2-hpc failing to 
> respond to capability release client_id: 189511877
> mds.ceph-24(mds.1): Client sn350.hpc.ait.dtu.dk:con-fs2-hpc failing to 
> respond to capability release client_id: 189511887
> mds.ceph-24(mds.1): Client sn403.hpc.ait.dtu.dk:con-fs2-hpc failing to 
> respond to capability release client_id: 231250695
>
> If I look at the session info from mds.1 for these clients I see this:
>
> # ceph tell mds.1 session ls | jq -c '[.[] | {id: .id, h: 
> .client_metadata.hostname, addr: .inst, fs: .client_metadata.root, caps: 
> .num_caps, req: .request_load_avg}]|sort_by(.caps)|.[]' | grep -e 145698301 
> -e 189511877 -e 189511887 -e 231250695
> {"id":189511887,"h":"sn350.hpc.ait.dtu.dk","addr":"client.189511887 
> v1:192.168.57.221:0/4262844211","fs":"/hpc/groups","caps":2,"req":0}
> {"id":231250695,"h":"sn403.hpc.ait.dtu.dk","addr":"client.231250695 
> v1:192.168.58.18:0/1334540218","fs":"/hpc/groups","caps":3,"req":0}
> {"id":189511877,"h":"sn463.hpc.ait.dtu.dk","addr":"client.189511877 
> v1:192.168.58.78:0/3535879569","fs":"/hpc/groups","caps":4,"req":0}
> {"id":145698301,"h":"sn352.hpc.ait.dtu.dk","addr":"client.145698301 
> v1:192.168.57.223:0/2146607320","fs":"/hpc/groups","caps":7,"req":0}
>
> We have mds_min_caps_per_client=4096, so it looks like the limit is well 
> satisfied. Also, the file system is pretty idle at the moment.
>
> Why and what exactly is the MDS complaining about here?

These days, you'll generally see this because the client is "quiet"
and the MDS is opportunistically recalling caps to reduce future work
when shrinking its cache is necessary. This would be indicated by:

* The MDS is not complaining about an oversized cache.
* The session listing shows the session is quiet (the
"session_cache_liveness" is near 0).

However, the MDS should respect mds_min_caps_per_client by (a) not
recalling more caps than mds_min_caps_per_client and (b) not
complaining the client has caps < mds_min_caps_per_client when it's
quiet.

So, you may have found a bug. The next time this happens, a `ceph tell
mds.X config diff`, `ceph tell mds.X perf dump`, and selection of the
relevant `ceph tell mds.X session ls` will help debug this I think.

-- 
Patrick Donnelly, Ph.D.
He / Him / His
Red Hat Partner Engineer
IBM, Inc.
GPG: 19F28A586F808C2402351B93C3301A3E258DD79D
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Client failing to respond to capability release

2023-08-23 Thread Eugen Block
I see, that was also SUSE's recommendation [2] but without a real  
explanation, just some assumptions about a possible network disconnect.


[2] https://www.suse.com/support/kb/doc/?id=19628

Zitat von Frank Schilder :


Hi Eugen, thanks for that :D

This time it was something different. Possibly a bug in the kclient.  
On these nodes I found sync commands stuck in D-state. I guess a  
file/dir was not possible to sync or there was some kind of  
corruption of buffer data. We had to reboot the servers to clear  
that out.


On first inspection these clients looked OK. Only some deeper  
debugging revealed that something was off.


Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Eugen Block 
Sent: Wednesday, August 23, 2023 8:55 AM
To: ceph-users@ceph.io
Subject: [ceph-users] Re: Client failing to respond to capability release

Hi,

pointing you to your own thread [1] ;-)

[1]
https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/HFILR5NMUCEZH7TJSGSACPI4P23XTULI/

Zitat von Frank Schilder :


Hi all,

I have this warning the whole day already (octopus latest cluster):

HEALTH_WARN 4 clients failing to respond to capability release; 1
pgs not deep-scrubbed in time
[WRN] MDS_CLIENT_LATE_RELEASE: 4 clients failing to respond to
capability release
mds.ceph-24(mds.1): Client sn352.hpc.ait.dtu.dk:con-fs2-hpc
failing to respond to capability release client_id: 145698301
mds.ceph-24(mds.1): Client sn463.hpc.ait.dtu.dk:con-fs2-hpc
failing to respond to capability release client_id: 189511877
mds.ceph-24(mds.1): Client sn350.hpc.ait.dtu.dk:con-fs2-hpc
failing to respond to capability release client_id: 189511887
mds.ceph-24(mds.1): Client sn403.hpc.ait.dtu.dk:con-fs2-hpc
failing to respond to capability release client_id: 231250695

If I look at the session info from mds.1 for these clients I see this:

# ceph tell mds.1 session ls | jq -c '[.[] | {id: .id, h:
.client_metadata.hostname, addr: .inst, fs: .client_metadata.root,
caps: .num_caps, req: .request_load_avg}]|sort_by(.caps)|.[]' | grep
-e 145698301 -e 189511877 -e 189511887 -e 231250695
{"id":189511887,"h":"sn350.hpc.ait.dtu.dk","addr":"client.189511887
v1:192.168.57.221:0/4262844211","fs":"/hpc/groups","caps":2,"req":0}
{"id":231250695,"h":"sn403.hpc.ait.dtu.dk","addr":"client.231250695
v1:192.168.58.18:0/1334540218","fs":"/hpc/groups","caps":3,"req":0}
{"id":189511877,"h":"sn463.hpc.ait.dtu.dk","addr":"client.189511877
v1:192.168.58.78:0/3535879569","fs":"/hpc/groups","caps":4,"req":0}
{"id":145698301,"h":"sn352.hpc.ait.dtu.dk","addr":"client.145698301
v1:192.168.57.223:0/2146607320","fs":"/hpc/groups","caps":7,"req":0}

We have mds_min_caps_per_client=4096, so it looks like the limit is
well satisfied. Also, the file system is pretty idle at the moment.

Why and what exactly is the MDS complaining about here?

Thanks and best regards.
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Client failing to respond to capability release

2023-08-23 Thread Frank Schilder
Hi Dhairya,

this is the thing, the client appeared to be responsive and worked fine (file 
system was on-line and responsive as usual). There was something off though; 
see my response to Eugen.

Thanks and best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Dhairya Parmar 
Sent: Wednesday, August 23, 2023 9:05 AM
To: Frank Schilder
Cc: ceph-users@ceph.io
Subject: Re: [ceph-users] Client failing to respond to capability release

Hi Frank,

This usually happens when the client is buggy/unresponsive. This warning is 
triggered when the client fails to respond to MDS's request to release caps in 
time which is determined by session_timeout(defaults to 60 secs). Did you make 
any config changes?


Dhairya Parmar

Associate Software Engineer, CephFS

Red Hat Inc.

dpar...@redhat.com

[https://static.redhat.com/libs/redhat/brand-assets/2/corp/logo--200.png]


On Tue, Aug 22, 2023 at 9:12 PM Frank Schilder 
mailto:fr...@dtu.dk>> wrote:
Hi all,

I have this warning the whole day already (octopus latest cluster):

HEALTH_WARN 4 clients failing to respond to capability release; 1 pgs not 
deep-scrubbed in time
[WRN] MDS_CLIENT_LATE_RELEASE: 4 clients failing to respond to capability 
release
mds.ceph-24(mds.1): Client sn352.hpc.ait.dtu.dk:con-fs2-hpc failing to 
respond to capability release client_id: 145698301
mds.ceph-24(mds.1): Client sn463.hpc.ait.dtu.dk:con-fs2-hpc failing to 
respond to capability release client_id: 189511877
mds.ceph-24(mds.1): Client sn350.hpc.ait.dtu.dk:con-fs2-hpc failing to 
respond to capability release client_id: 189511887
mds.ceph-24(mds.1): Client sn403.hpc.ait.dtu.dk:con-fs2-hpc failing to 
respond to capability release client_id: 231250695

If I look at the session info from mds.1 for these clients I see this:

# ceph tell mds.1 session ls | jq -c '[.[] | {id: .id, h: 
.client_metadata.hostname, addr: .inst, fs: .client_metadata.root, caps: 
.num_caps, req: .request_load_avg}]|sort_by(.caps)|.[]' | grep -e 145698301 -e 
189511877 -e 189511887 -e 231250695
{"id":189511887,"h":"sn350.hpc.ait.dtu.dk","addr":"client.189511887
 
v1:192.168.57.221:0/4262844211","fs":"/hpc/groups","caps":2,"req":0}
{"id":231250695,"h":"sn403.hpc.ait.dtu.dk","addr":"client.231250695
 
v1:192.168.58.18:0/1334540218","fs":"/hpc/groups","caps":3,"req":0}
{"id":189511877,"h":"sn463.hpc.ait.dtu.dk","addr":"client.189511877
 
v1:192.168.58.78:0/3535879569","fs":"/hpc/groups","caps":4,"req":0}
{"id":145698301,"h":"sn352.hpc.ait.dtu.dk","addr":"client.145698301
 
v1:192.168.57.223:0/2146607320","fs":"/hpc/groups","caps":7,"req":0}

We have mds_min_caps_per_client=4096, so it looks like the limit is well 
satisfied. Also, the file system is pretty idle at the moment.

Why and what exactly is the MDS complaining about here?

Thanks and best regards.
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to 
ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Client failing to respond to capability release

2023-08-23 Thread Frank Schilder
Hi Eugen, thanks for that :D

This time it was something different. Possibly a bug in the kclient. On these 
nodes I found sync commands stuck in D-state. I guess a file/dir was not 
possible to sync or there was some kind of corruption of buffer data. We had to 
reboot the servers to clear that out.

On first inspection these clients looked OK. Only some deeper debugging 
revealed that something was off.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Eugen Block 
Sent: Wednesday, August 23, 2023 8:55 AM
To: ceph-users@ceph.io
Subject: [ceph-users] Re: Client failing to respond to capability release

Hi,

pointing you to your own thread [1] ;-)

[1]
https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/HFILR5NMUCEZH7TJSGSACPI4P23XTULI/

Zitat von Frank Schilder :

> Hi all,
>
> I have this warning the whole day already (octopus latest cluster):
>
> HEALTH_WARN 4 clients failing to respond to capability release; 1
> pgs not deep-scrubbed in time
> [WRN] MDS_CLIENT_LATE_RELEASE: 4 clients failing to respond to
> capability release
> mds.ceph-24(mds.1): Client sn352.hpc.ait.dtu.dk:con-fs2-hpc
> failing to respond to capability release client_id: 145698301
> mds.ceph-24(mds.1): Client sn463.hpc.ait.dtu.dk:con-fs2-hpc
> failing to respond to capability release client_id: 189511877
> mds.ceph-24(mds.1): Client sn350.hpc.ait.dtu.dk:con-fs2-hpc
> failing to respond to capability release client_id: 189511887
> mds.ceph-24(mds.1): Client sn403.hpc.ait.dtu.dk:con-fs2-hpc
> failing to respond to capability release client_id: 231250695
>
> If I look at the session info from mds.1 for these clients I see this:
>
> # ceph tell mds.1 session ls | jq -c '[.[] | {id: .id, h:
> .client_metadata.hostname, addr: .inst, fs: .client_metadata.root,
> caps: .num_caps, req: .request_load_avg}]|sort_by(.caps)|.[]' | grep
> -e 145698301 -e 189511877 -e 189511887 -e 231250695
> {"id":189511887,"h":"sn350.hpc.ait.dtu.dk","addr":"client.189511887
> v1:192.168.57.221:0/4262844211","fs":"/hpc/groups","caps":2,"req":0}
> {"id":231250695,"h":"sn403.hpc.ait.dtu.dk","addr":"client.231250695
> v1:192.168.58.18:0/1334540218","fs":"/hpc/groups","caps":3,"req":0}
> {"id":189511877,"h":"sn463.hpc.ait.dtu.dk","addr":"client.189511877
> v1:192.168.58.78:0/3535879569","fs":"/hpc/groups","caps":4,"req":0}
> {"id":145698301,"h":"sn352.hpc.ait.dtu.dk","addr":"client.145698301
> v1:192.168.57.223:0/2146607320","fs":"/hpc/groups","caps":7,"req":0}
>
> We have mds_min_caps_per_client=4096, so it looks like the limit is
> well satisfied. Also, the file system is pretty idle at the moment.
>
> Why and what exactly is the MDS complaining about here?
>
> Thanks and best regards.
> =
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Client failing to respond to capability release

2023-08-23 Thread Dhairya Parmar
Hi Frank,

This usually happens when the client is buggy/unresponsive. This warning is
triggered when the client fails to respond to MDS's request to release caps
in time which is determined by session_timeout(defaults to 60 secs). Did
you make any config changes?


*Dhairya Parmar*

Associate Software Engineer, CephFS

Red Hat Inc. 

dpar...@redhat.com



On Tue, Aug 22, 2023 at 9:12 PM Frank Schilder  wrote:

> Hi all,
>
> I have this warning the whole day already (octopus latest cluster):
>
> HEALTH_WARN 4 clients failing to respond to capability release; 1 pgs not
> deep-scrubbed in time
> [WRN] MDS_CLIENT_LATE_RELEASE: 4 clients failing to respond to capability
> release
> mds.ceph-24(mds.1): Client sn352.hpc.ait.dtu.dk:con-fs2-hpc failing
> to respond to capability release client_id: 145698301
> mds.ceph-24(mds.1): Client sn463.hpc.ait.dtu.dk:con-fs2-hpc failing
> to respond to capability release client_id: 189511877
> mds.ceph-24(mds.1): Client sn350.hpc.ait.dtu.dk:con-fs2-hpc failing
> to respond to capability release client_id: 189511887
> mds.ceph-24(mds.1): Client sn403.hpc.ait.dtu.dk:con-fs2-hpc failing
> to respond to capability release client_id: 231250695
>
> If I look at the session info from mds.1 for these clients I see this:
>
> # ceph tell mds.1 session ls | jq -c '[.[] | {id: .id, h:
> .client_metadata.hostname, addr: .inst, fs: .client_metadata.root, caps:
> .num_caps, req: .request_load_avg}]|sort_by(.caps)|.[]' | grep -e 145698301
> -e 189511877 -e 189511887 -e 231250695
> {"id":189511887,"h":"sn350.hpc.ait.dtu.dk","addr":"client.189511887 v1:
> 192.168.57.221:0/4262844211","fs":"/hpc/groups","caps":2,"req":0}
> {"id":231250695,"h":"sn403.hpc.ait.dtu.dk","addr":"client.231250695 v1:
> 192.168.58.18:0/1334540218","fs":"/hpc/groups","caps":3,"req":0}
> {"id":189511877,"h":"sn463.hpc.ait.dtu.dk","addr":"client.189511877 v1:
> 192.168.58.78:0/3535879569","fs":"/hpc/groups","caps":4,"req":0}
> {"id":145698301,"h":"sn352.hpc.ait.dtu.dk","addr":"client.145698301 v1:
> 192.168.57.223:0/2146607320","fs":"/hpc/groups","caps":7,"req":0}
>
> We have mds_min_caps_per_client=4096, so it looks like the limit is well
> satisfied. Also, the file system is pretty idle at the moment.
>
> Why and what exactly is the MDS complaining about here?
>
> Thanks and best regards.
> =
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Client failing to respond to capability release

2023-08-22 Thread Eugen Block

Hi,

pointing you to your own thread [1] ;-)

[1]  
https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/HFILR5NMUCEZH7TJSGSACPI4P23XTULI/


Zitat von Frank Schilder :


Hi all,

I have this warning the whole day already (octopus latest cluster):

HEALTH_WARN 4 clients failing to respond to capability release; 1  
pgs not deep-scrubbed in time
[WRN] MDS_CLIENT_LATE_RELEASE: 4 clients failing to respond to  
capability release
mds.ceph-24(mds.1): Client sn352.hpc.ait.dtu.dk:con-fs2-hpc  
failing to respond to capability release client_id: 145698301
mds.ceph-24(mds.1): Client sn463.hpc.ait.dtu.dk:con-fs2-hpc  
failing to respond to capability release client_id: 189511877
mds.ceph-24(mds.1): Client sn350.hpc.ait.dtu.dk:con-fs2-hpc  
failing to respond to capability release client_id: 189511887
mds.ceph-24(mds.1): Client sn403.hpc.ait.dtu.dk:con-fs2-hpc  
failing to respond to capability release client_id: 231250695


If I look at the session info from mds.1 for these clients I see this:

# ceph tell mds.1 session ls | jq -c '[.[] | {id: .id, h:  
.client_metadata.hostname, addr: .inst, fs: .client_metadata.root,  
caps: .num_caps, req: .request_load_avg}]|sort_by(.caps)|.[]' | grep  
-e 145698301 -e 189511877 -e 189511887 -e 231250695
{"id":189511887,"h":"sn350.hpc.ait.dtu.dk","addr":"client.189511887  
v1:192.168.57.221:0/4262844211","fs":"/hpc/groups","caps":2,"req":0}
{"id":231250695,"h":"sn403.hpc.ait.dtu.dk","addr":"client.231250695  
v1:192.168.58.18:0/1334540218","fs":"/hpc/groups","caps":3,"req":0}
{"id":189511877,"h":"sn463.hpc.ait.dtu.dk","addr":"client.189511877  
v1:192.168.58.78:0/3535879569","fs":"/hpc/groups","caps":4,"req":0}
{"id":145698301,"h":"sn352.hpc.ait.dtu.dk","addr":"client.145698301  
v1:192.168.57.223:0/2146607320","fs":"/hpc/groups","caps":7,"req":0}


We have mds_min_caps_per_client=4096, so it looks like the limit is  
well satisfied. Also, the file system is pretty idle at the moment.


Why and what exactly is the MDS complaining about here?

Thanks and best regards.
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io