[ceph-users] Re: CephFS warning: clients laggy due to laggy OSDs

2023-09-26 Thread Janek Bevendorff
I have had defer_client_eviction_on_laggy_osds set to false for a while 
and I haven't had any further warnings so far (obviously), but also all 
the other problems with laggy clients bringing our MDS to a crawl over 
time seem to have gone. So at least on our cluster, the new configurable 
seems to do more harm than good. I can see why it's there, but the 
implementation appears to be rather buggy.


I also set mds_session_blocklist_on_timeout to false, because I had the 
impression that clients where being blocklisted too quickly.



On 21/09/2023 09:24, Janek Bevendorff wrote:


Hi,

I took a snapshot of MDS.0's logs. We have five active MDS in total, 
each one reporting laggy OSDs/clients, but I cannot find anything 
related to that in the log snippet. Anyhow, I uploaded the log for 
your reference with ceph-post-file ID 
79b5138b-61d7-4ba7-b0a9-c6f02f47b881.


This is what ceph status looks like after a couple of days. This is 
not normal:


HEALTH_WARN
55 client(s) laggy due to laggy OSDs
8 clients failing to respond to capability release
1 clients failing to advance oldest client/flush tid
5 MDSs report slow requests

(55 clients are actually "just" 11 unique client IDs, but each MDS 
makes their own report.)


osd mon_osd_laggy_halflife is not configured on our cluster, so it's 
the default of 3600.



Janek


On 20/09/2023 13:17, Dhairya Parmar wrote:

Hi Janek,

The PR venky mentioned makes use of OSD's laggy parameters 
(laggy_interval and
laggy_probability) to find if any OSD is laggy or not. These laggy 
parameters
can reset to 0 if the interval between the last modification done to 
OSDMap and
the time stamp when OSD was marked down exceeds the grace interval 
threshold

which is the value we get by `mon_osd_laggy_halflife * 48` where
mon_osd_laggy_halflife is a configurable value which is by default 
3600 so only
if the interval I talked about exceeds 172800; the laggy parameters 
would reset
to 0. I'd recommend taking a look at what your configured value 
is(using cmd:

ceph config get osd mon_osd_laggy_halflife).

There is also a "hack" to reset the parameters manually(*Not 
recommended, just

for info*): set mon_osd_laggy_weight to 1 using `ceph config set osd
mon_osd_laggy_weight 1` and reboot the OSD(s) which is/are being said 
laggy and

you will see the lagginess go away.


*Dhairya Parmar*

Associate Software Engineer, CephFS

Red Hat Inc. 

dpar...@redhat.com





On Wed, Sep 20, 2023 at 3:25 PM Venky Shankar  
wrote:


Hey Janek,

I took a closer look at various places where the MDS would consider a
client as laggy and it seems like a wide variety of reasons are taken
into consideration and not all of them might be a reason to defer
client
eviction, so the warning is a bit misleading. I'll post a PR for
this. In
the meantime, could you share the debug logs stated in my
previous email?

On Wed, Sep 20, 2023 at 3:07 PM Venky Shankar
 wrote:

> Hi Janek,
>
> On Tue, Sep 19, 2023 at 4:44 PM Janek Bevendorff <
> janek.bevendo...@uni-weimar.de> wrote:
>
>> Hi Venky,
>>
>> As I said: There are no laggy OSDs. The maximum ping I have
for any OSD
>> in ceph osd perf is around 60ms (just a handful, probably
aging disks). The
>> vast majority of OSDs have ping times of less than 1ms. Same
for the host
>> machines, yet I'm still seeing this message. It seems that the
affected
>> hosts are usually the same, but I have absolutely no clue why.
>>
>
> It's possible that you are running into a bug which does not
clear the
> laggy clients list which the MDS sends to monitors via beacons.
Could you
> help us out with debug mds logs (by setting debug_mds=20) for
the active
> mds for around 15-20 seconds and share the logs please? Also
reset the log
> level once done since it can hurt performance.
>
> # ceph config set mds.<> debug_mds 20
>
> and reset via
>
> # ceph config rm mds.<> debug_mds
>
>
>> Janek
>>
>>
>> On 19/09/2023 12:36, Venky Shankar wrote:
>>
>> Hi Janek,
>>
>> On Mon, Sep 18, 2023 at 9:52 PM Janek Bevendorff <
>> janek.bevendo...@uni-weimar.de> wrote:
>>
>>> Thanks! However, I still don't really understand why I am
seeing this.
>>>
>>
>> This is due to a changes that was merged recently in pacific
>>
>> https://github.com/ceph/ceph/pull/52270
>>
>> The MDS would not evict laggy clients if the OSDs report as
laggy. Laggy
>> OSDs can cause cephfs clients to not flush dirty data (during
cap revokes
>> by the MDS) and thereby showing up as laggy and getting
evicted by the MDS.
>> This behaviour was changed and therefore you get warnings that
some client
>> are laggy but they are not evicted since the OSDs are laggy.
>>
>>
>>> The first time 

[ceph-users] Re: CephFS warning: clients laggy due to laggy OSDs

2023-09-21 Thread Janek Bevendorff

Hi,

I took a snapshot of MDS.0's logs. We have five active MDS in total, 
each one reporting laggy OSDs/clients, but I cannot find anything 
related to that in the log snippet. Anyhow, I uploaded the log for your 
reference with ceph-post-file ID 79b5138b-61d7-4ba7-b0a9-c6f02f47b881.


This is what ceph status looks like after a couple of days. This is not 
normal:


HEALTH_WARN
55 client(s) laggy due to laggy OSDs
8 clients failing to respond to capability release
1 clients failing to advance oldest client/flush tid
5 MDSs report slow requests

(55 clients are actually "just" 11 unique client IDs, but each MDS makes 
their own report.)


osd mon_osd_laggy_halflife is not configured on our cluster, so it's the 
default of 3600.



Janek


On 20/09/2023 13:17, Dhairya Parmar wrote:

Hi Janek,

The PR venky mentioned makes use of OSD's laggy parameters 
(laggy_interval and
laggy_probability) to find if any OSD is laggy or not. These laggy 
parameters
can reset to 0 if the interval between the last modification done to 
OSDMap and
the time stamp when OSD was marked down exceeds the grace interval 
threshold

which is the value we get by `mon_osd_laggy_halflife * 48` where
mon_osd_laggy_halflife is a configurable value which is by default 
3600 so only
if the interval I talked about exceeds 172800; the laggy parameters 
would reset
to 0. I'd recommend taking a look at what your configured value 
is(using cmd:

ceph config get osd mon_osd_laggy_halflife).

There is also a "hack" to reset the parameters manually(*Not 
recommended, just

for info*): set mon_osd_laggy_weight to 1 using `ceph config set osd
mon_osd_laggy_weight 1` and reboot the OSD(s) which is/are being said 
laggy and

you will see the lagginess go away.


*Dhairya Parmar*

Associate Software Engineer, CephFS

Red Hat Inc. 

dpar...@redhat.com





On Wed, Sep 20, 2023 at 3:25 PM Venky Shankar  wrote:

Hey Janek,

I took a closer look at various places where the MDS would consider a
client as laggy and it seems like a wide variety of reasons are taken
into consideration and not all of them might be a reason to defer
client
eviction, so the warning is a bit misleading. I'll post a PR for
this. In
the meantime, could you share the debug logs stated in my previous
email?

On Wed, Sep 20, 2023 at 3:07 PM Venky Shankar
 wrote:

> Hi Janek,
>
> On Tue, Sep 19, 2023 at 4:44 PM Janek Bevendorff <
> janek.bevendo...@uni-weimar.de> wrote:
>
>> Hi Venky,
>>
>> As I said: There are no laggy OSDs. The maximum ping I have for
any OSD
>> in ceph osd perf is around 60ms (just a handful, probably aging
disks). The
>> vast majority of OSDs have ping times of less than 1ms. Same
for the host
>> machines, yet I'm still seeing this message. It seems that the
affected
>> hosts are usually the same, but I have absolutely no clue why.
>>
>
> It's possible that you are running into a bug which does not
clear the
> laggy clients list which the MDS sends to monitors via beacons.
Could you
> help us out with debug mds logs (by setting debug_mds=20) for
the active
> mds for around 15-20 seconds and share the logs please? Also
reset the log
> level once done since it can hurt performance.
>
> # ceph config set mds.<> debug_mds 20
>
> and reset via
>
> # ceph config rm mds.<> debug_mds
>
>
>> Janek
>>
>>
>> On 19/09/2023 12:36, Venky Shankar wrote:
>>
>> Hi Janek,
>>
>> On Mon, Sep 18, 2023 at 9:52 PM Janek Bevendorff <
>> janek.bevendo...@uni-weimar.de> wrote:
>>
>>> Thanks! However, I still don't really understand why I am
seeing this.
>>>
>>
>> This is due to a changes that was merged recently in pacific
>>
>> https://github.com/ceph/ceph/pull/52270
>>
>> The MDS would not evict laggy clients if the OSDs report as
laggy. Laggy
>> OSDs can cause cephfs clients to not flush dirty data (during
cap revokes
>> by the MDS) and thereby showing up as laggy and getting evicted
by the MDS.
>> This behaviour was changed and therefore you get warnings that
some client
>> are laggy but they are not evicted since the OSDs are laggy.
>>
>>
>>> The first time I had this, one of the clients was a remote
user dialling
>>> in via VPN, which could indeed be laggy. But I am also seeing
it from
>>> neighbouring hosts that are on the same physical network with
reliable ping
>>> times way below 1ms. How is that considered laggy?
>>>
>>  Are some of your OSDs reporting laggy? This can be check via
`perf dump`
>>
>> > ceph tell mds.<> perf dump
>> (search for op_laggy/osd_laggy)
>>
>>
>>> On 18/09/2023 18:07, Laura Flores wrote:
>>>
>>> Hi Janek,
>>>
>>> There was some 

[ceph-users] Re: CephFS warning: clients laggy due to laggy OSDs

2023-09-20 Thread Dhairya Parmar
Hi Janek,

The PR venky mentioned makes use of OSD's laggy parameters (laggy_interval
and
laggy_probability) to find if any OSD is laggy or not. These laggy
parameters
can reset to 0 if the interval between the last modification done to OSDMap
and
the time stamp when OSD was marked down exceeds the grace interval threshold
which is the value we get by `mon_osd_laggy_halflife * 48` where
mon_osd_laggy_halflife is a configurable value which is by default 3600 so
only
if the interval I talked about exceeds 172800; the laggy parameters would
reset
to 0. I'd recommend taking a look at what your configured value is(using
cmd:
ceph config get osd mon_osd_laggy_halflife).

There is also a "hack" to reset the parameters manually(
*Not recommended, justfor info*): set mon_osd_laggy_weight to 1 using `ceph
config set osd
mon_osd_laggy_weight 1` and reboot the OSD(s) which is/are being said laggy
and
you will see the lagginess go away.


*Dhairya Parmar*

Associate Software Engineer, CephFS

Red Hat Inc. 

dpar...@redhat.com



On Wed, Sep 20, 2023 at 3:25 PM Venky Shankar  wrote:

> Hey Janek,
>
> I took a closer look at various places where the MDS would consider a
> client as laggy and it seems like a wide variety of reasons are taken
> into consideration and not all of them might be a reason to defer client
> eviction, so the warning is a bit misleading. I'll post a PR for this. In
> the meantime, could you share the debug logs stated in my previous email?
>
> On Wed, Sep 20, 2023 at 3:07 PM Venky Shankar  wrote:
>
> > Hi Janek,
> >
> > On Tue, Sep 19, 2023 at 4:44 PM Janek Bevendorff <
> > janek.bevendo...@uni-weimar.de> wrote:
> >
> >> Hi Venky,
> >>
> >> As I said: There are no laggy OSDs. The maximum ping I have for any OSD
> >> in ceph osd perf is around 60ms (just a handful, probably aging disks).
> The
> >> vast majority of OSDs have ping times of less than 1ms. Same for the
> host
> >> machines, yet I'm still seeing this message. It seems that the affected
> >> hosts are usually the same, but I have absolutely no clue why.
> >>
> >
> > It's possible that you are running into a bug which does not clear the
> > laggy clients list which the MDS sends to monitors via beacons. Could you
> > help us out with debug mds logs (by setting debug_mds=20) for the active
> > mds for around 15-20 seconds and share the logs please? Also reset the
> log
> > level once done since it can hurt performance.
> >
> > # ceph config set mds.<> debug_mds 20
> >
> > and reset via
> >
> > # ceph config rm mds.<> debug_mds
> >
> >
> >> Janek
> >>
> >>
> >> On 19/09/2023 12:36, Venky Shankar wrote:
> >>
> >> Hi Janek,
> >>
> >> On Mon, Sep 18, 2023 at 9:52 PM Janek Bevendorff <
> >> janek.bevendo...@uni-weimar.de> wrote:
> >>
> >>> Thanks! However, I still don't really understand why I am seeing this.
> >>>
> >>
> >> This is due to a changes that was merged recently in pacific
> >>
> >> https://github.com/ceph/ceph/pull/52270
> >>
> >> The MDS would not evict laggy clients if the OSDs report as laggy. Laggy
> >> OSDs can cause cephfs clients to not flush dirty data (during cap
> revokes
> >> by the MDS) and thereby showing up as laggy and getting evicted by the
> MDS.
> >> This behaviour was changed and therefore you get warnings that some
> client
> >> are laggy but they are not evicted since the OSDs are laggy.
> >>
> >>
> >>> The first time I had this, one of the clients was a remote user
> dialling
> >>> in via VPN, which could indeed be laggy. But I am also seeing it from
> >>> neighbouring hosts that are on the same physical network with reliable
> ping
> >>> times way below 1ms. How is that considered laggy?
> >>>
> >>  Are some of your OSDs reporting laggy? This can be check via `perf
> dump`
> >>
> >> > ceph tell mds.<> perf dump
> >> (search for op_laggy/osd_laggy)
> >>
> >>
> >>> On 18/09/2023 18:07, Laura Flores wrote:
> >>>
> >>> Hi Janek,
> >>>
> >>> There was some documentation added about it here:
> >>> https://docs.ceph.com/en/pacific/cephfs/health-messages/
> >>>
> >>> There is a description of what it means, and it's tied to an mds
> >>> configurable.
> >>>
> >>> On Mon, Sep 18, 2023 at 10:51 AM Janek Bevendorff <
> >>> janek.bevendo...@uni-weimar.de> wrote:
> >>>
>  Hey all,
> 
>  Since the upgrade to Ceph 16.2.14, I keep seeing the following
> warning:
> 
>  10 client(s) laggy due to laggy OSDs
> 
>  ceph health detail shows it as:
> 
>  [WRN] MDS_CLIENTS_LAGGY: 10 client(s) laggy due to laggy OSDs
>   mds.***(mds.3): Client *** is laggy; not evicted because some
>  OSD(s) is/are laggy
>   more of this...
> 
>  When I restart the client(s) or the affected MDS daemons, the message
>  goes away and then comes back after a while. ceph osd perf does not
>  list
>  any laggy OSDs (a few with 10-60ms ping, but overwhelmingly < 1ms), so
>  I'm on a total loss what this even means.
> 

[ceph-users] Re: CephFS warning: clients laggy due to laggy OSDs

2023-09-20 Thread Venky Shankar
Hey Janek,

I took a closer look at various places where the MDS would consider a
client as laggy and it seems like a wide variety of reasons are taken
into consideration and not all of them might be a reason to defer client
eviction, so the warning is a bit misleading. I'll post a PR for this. In
the meantime, could you share the debug logs stated in my previous email?

On Wed, Sep 20, 2023 at 3:07 PM Venky Shankar  wrote:

> Hi Janek,
>
> On Tue, Sep 19, 2023 at 4:44 PM Janek Bevendorff <
> janek.bevendo...@uni-weimar.de> wrote:
>
>> Hi Venky,
>>
>> As I said: There are no laggy OSDs. The maximum ping I have for any OSD
>> in ceph osd perf is around 60ms (just a handful, probably aging disks). The
>> vast majority of OSDs have ping times of less than 1ms. Same for the host
>> machines, yet I'm still seeing this message. It seems that the affected
>> hosts are usually the same, but I have absolutely no clue why.
>>
>
> It's possible that you are running into a bug which does not clear the
> laggy clients list which the MDS sends to monitors via beacons. Could you
> help us out with debug mds logs (by setting debug_mds=20) for the active
> mds for around 15-20 seconds and share the logs please? Also reset the log
> level once done since it can hurt performance.
>
> # ceph config set mds.<> debug_mds 20
>
> and reset via
>
> # ceph config rm mds.<> debug_mds
>
>
>> Janek
>>
>>
>> On 19/09/2023 12:36, Venky Shankar wrote:
>>
>> Hi Janek,
>>
>> On Mon, Sep 18, 2023 at 9:52 PM Janek Bevendorff <
>> janek.bevendo...@uni-weimar.de> wrote:
>>
>>> Thanks! However, I still don't really understand why I am seeing this.
>>>
>>
>> This is due to a changes that was merged recently in pacific
>>
>> https://github.com/ceph/ceph/pull/52270
>>
>> The MDS would not evict laggy clients if the OSDs report as laggy. Laggy
>> OSDs can cause cephfs clients to not flush dirty data (during cap revokes
>> by the MDS) and thereby showing up as laggy and getting evicted by the MDS.
>> This behaviour was changed and therefore you get warnings that some client
>> are laggy but they are not evicted since the OSDs are laggy.
>>
>>
>>> The first time I had this, one of the clients was a remote user dialling
>>> in via VPN, which could indeed be laggy. But I am also seeing it from
>>> neighbouring hosts that are on the same physical network with reliable ping
>>> times way below 1ms. How is that considered laggy?
>>>
>>  Are some of your OSDs reporting laggy? This can be check via `perf dump`
>>
>> > ceph tell mds.<> perf dump
>> (search for op_laggy/osd_laggy)
>>
>>
>>> On 18/09/2023 18:07, Laura Flores wrote:
>>>
>>> Hi Janek,
>>>
>>> There was some documentation added about it here:
>>> https://docs.ceph.com/en/pacific/cephfs/health-messages/
>>>
>>> There is a description of what it means, and it's tied to an mds
>>> configurable.
>>>
>>> On Mon, Sep 18, 2023 at 10:51 AM Janek Bevendorff <
>>> janek.bevendo...@uni-weimar.de> wrote:
>>>
 Hey all,

 Since the upgrade to Ceph 16.2.14, I keep seeing the following warning:

 10 client(s) laggy due to laggy OSDs

 ceph health detail shows it as:

 [WRN] MDS_CLIENTS_LAGGY: 10 client(s) laggy due to laggy OSDs
  mds.***(mds.3): Client *** is laggy; not evicted because some
 OSD(s) is/are laggy
  more of this...

 When I restart the client(s) or the affected MDS daemons, the message
 goes away and then comes back after a while. ceph osd perf does not
 list
 any laggy OSDs (a few with 10-60ms ping, but overwhelmingly < 1ms), so
 I'm on a total loss what this even means.

 I have never seen this message before nor was I able to find anything
 about it. Do you have any idea what this message actually means and how
 I can get rid of it?

 Thanks
 Janek

 ___
 ceph-users mailing list -- ceph-users@ceph.io
 To unsubscribe send an email to ceph-users-le...@ceph.io

>>>
>>>
>>> --
>>>
>>> Laura Flores
>>>
>>> She/Her/Hers
>>>
>>> Software Engineer, Ceph Storage 
>>>
>>> Chicago, IL
>>>
>>> lflo...@ibm.com | lflo...@redhat.com 
>>> M: +17087388804
>>>
>>>
>>> --
>>> Bauhaus-Universität Weimar
>>> Bauhausstr. 9a, R308
>>> 99423 Weimar, Germany
>>>
>>> Phone: +49 3643 58 3577www.webis.de
>>>
>>> ___
>>> ceph-users mailing list -- ceph-users@ceph.io
>>> To unsubscribe send an email to ceph-users-le...@ceph.io
>>>
>>
>>
>> --
>> Cheers,
>> Venky
>>
>> --
>> Bauhaus-Universität Weimar
>> Bauhausstr. 9a, R308
>> 99423 Weimar, Germany
>>
>> Phone: +49 3643 58 3577www.webis.de
>>
>>
>
> --
> Cheers,
> Venky
>


-- 
Cheers,
Venky
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: CephFS warning: clients laggy due to laggy OSDs

2023-09-20 Thread Venky Shankar
Hi Janek,

On Tue, Sep 19, 2023 at 4:44 PM Janek Bevendorff <
janek.bevendo...@uni-weimar.de> wrote:

> Hi Venky,
>
> As I said: There are no laggy OSDs. The maximum ping I have for any OSD in
> ceph osd perf is around 60ms (just a handful, probably aging disks). The
> vast majority of OSDs have ping times of less than 1ms. Same for the host
> machines, yet I'm still seeing this message. It seems that the affected
> hosts are usually the same, but I have absolutely no clue why.
>

It's possible that you are running into a bug which does not clear the
laggy clients list which the MDS sends to monitors via beacons. Could you
help us out with debug mds logs (by setting debug_mds=20) for the active
mds for around 15-20 seconds and share the logs please? Also reset the log
level once done since it can hurt performance.

# ceph config set mds.<> debug_mds 20

and reset via

# ceph config rm mds.<> debug_mds


> Janek
>
>
> On 19/09/2023 12:36, Venky Shankar wrote:
>
> Hi Janek,
>
> On Mon, Sep 18, 2023 at 9:52 PM Janek Bevendorff <
> janek.bevendo...@uni-weimar.de> wrote:
>
>> Thanks! However, I still don't really understand why I am seeing this.
>>
>
> This is due to a changes that was merged recently in pacific
>
> https://github.com/ceph/ceph/pull/52270
>
> The MDS would not evict laggy clients if the OSDs report as laggy. Laggy
> OSDs can cause cephfs clients to not flush dirty data (during cap revokes
> by the MDS) and thereby showing up as laggy and getting evicted by the MDS.
> This behaviour was changed and therefore you get warnings that some client
> are laggy but they are not evicted since the OSDs are laggy.
>
>
>> The first time I had this, one of the clients was a remote user dialling
>> in via VPN, which could indeed be laggy. But I am also seeing it from
>> neighbouring hosts that are on the same physical network with reliable ping
>> times way below 1ms. How is that considered laggy?
>>
>  Are some of your OSDs reporting laggy? This can be check via `perf dump`
>
> > ceph tell mds.<> perf dump
> (search for op_laggy/osd_laggy)
>
>
>> On 18/09/2023 18:07, Laura Flores wrote:
>>
>> Hi Janek,
>>
>> There was some documentation added about it here:
>> https://docs.ceph.com/en/pacific/cephfs/health-messages/
>>
>> There is a description of what it means, and it's tied to an mds
>> configurable.
>>
>> On Mon, Sep 18, 2023 at 10:51 AM Janek Bevendorff <
>> janek.bevendo...@uni-weimar.de> wrote:
>>
>>> Hey all,
>>>
>>> Since the upgrade to Ceph 16.2.14, I keep seeing the following warning:
>>>
>>> 10 client(s) laggy due to laggy OSDs
>>>
>>> ceph health detail shows it as:
>>>
>>> [WRN] MDS_CLIENTS_LAGGY: 10 client(s) laggy due to laggy OSDs
>>>  mds.***(mds.3): Client *** is laggy; not evicted because some
>>> OSD(s) is/are laggy
>>>  more of this...
>>>
>>> When I restart the client(s) or the affected MDS daemons, the message
>>> goes away and then comes back after a while. ceph osd perf does not list
>>> any laggy OSDs (a few with 10-60ms ping, but overwhelmingly < 1ms), so
>>> I'm on a total loss what this even means.
>>>
>>> I have never seen this message before nor was I able to find anything
>>> about it. Do you have any idea what this message actually means and how
>>> I can get rid of it?
>>>
>>> Thanks
>>> Janek
>>>
>>> ___
>>> ceph-users mailing list -- ceph-users@ceph.io
>>> To unsubscribe send an email to ceph-users-le...@ceph.io
>>>
>>
>>
>> --
>>
>> Laura Flores
>>
>> She/Her/Hers
>>
>> Software Engineer, Ceph Storage 
>>
>> Chicago, IL
>>
>> lflo...@ibm.com | lflo...@redhat.com 
>> M: +17087388804
>>
>>
>> --
>> Bauhaus-Universität Weimar
>> Bauhausstr. 9a, R308
>> 99423 Weimar, Germany
>>
>> Phone: +49 3643 58 3577www.webis.de
>>
>> ___
>> ceph-users mailing list -- ceph-users@ceph.io
>> To unsubscribe send an email to ceph-users-le...@ceph.io
>>
>
>
> --
> Cheers,
> Venky
>
> --
> Bauhaus-Universität Weimar
> Bauhausstr. 9a, R308
> 99423 Weimar, Germany
>
> Phone: +49 3643 58 3577www.webis.de
>
>

-- 
Cheers,
Venky
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: CephFS warning: clients laggy due to laggy OSDs

2023-09-19 Thread Janek Bevendorff

Hi Venky,

As I said: There are no laggy OSDs. The maximum ping I have for any OSD 
in ceph osd perf is around 60ms (just a handful, probably aging disks). 
The vast majority of OSDs have ping times of less than 1ms. Same for the 
host machines, yet I'm still seeing this message. It seems that the 
affected hosts are usually the same, but I have absolutely no clue why.


Janek


On 19/09/2023 12:36, Venky Shankar wrote:

Hi Janek,

On Mon, Sep 18, 2023 at 9:52 PM Janek Bevendorff 
 wrote:


Thanks! However, I still don't really understand why I am seeing this.


This is due to a changes that was merged recently in pacific

https://github.com/ceph/ceph/pull/52270

The MDS would not evict laggy clients if the OSDs report as laggy. 
Laggy OSDs can cause cephfs clients to not flush dirty data (during 
cap revokes by the MDS) and thereby showing up as laggy and getting 
evicted by the MDS. This behaviour was changed and therefore you get 
warnings that some client are laggy but they are not evicted since the 
OSDs are laggy.


The first time I had this, one of the clients was a remote user
dialling in via VPN, which could indeed be laggy. But I am also
seeing it from neighbouring hosts that are on the same physical
network with reliable ping times way below 1ms. How is that
considered laggy?

 Are some of your OSDs reporting laggy? This can be check via `perf dump`

> ceph tell mds.<> perf dump
(search for op_laggy/osd_laggy)


On 18/09/2023 18:07, Laura Flores wrote:

Hi Janek,

There was some documentation added about it here:
https://docs.ceph.com/en/pacific/cephfs/health-messages/

There is a description of what it means, and it's tied to an mds
configurable.

On Mon, Sep 18, 2023 at 10:51 AM Janek Bevendorff
 wrote:

Hey all,

Since the upgrade to Ceph 16.2.14, I keep seeing the
following warning:

10 client(s) laggy due to laggy OSDs

ceph health detail shows it as:

[WRN] MDS_CLIENTS_LAGGY: 10 client(s) laggy due to laggy OSDs
 mds.***(mds.3): Client *** is laggy; not evicted because
some
OSD(s) is/are laggy
 more of this...

When I restart the client(s) or the affected MDS daemons, the
message
goes away and then comes back after a while. ceph osd perf
does not list
any laggy OSDs (a few with 10-60ms ping, but overwhelmingly <
1ms), so
I'm on a total loss what this even means.

I have never seen this message before nor was I able to find
anything
about it. Do you have any idea what this message actually
means and how
I can get rid of it?

Thanks
Janek

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io



-- 


Laura Flores

She/Her/Hers

Software Engineer, Ceph Storage 

Chicago, IL

lflo...@ibm.com | lflo...@redhat.com 
M: +17087388804 



-- 
Bauhaus-Universität Weimar

Bauhausstr. 9a, R308
99423 Weimar, Germany

Phone: +49 3643 58 3577
www.webis.de  

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io



--
Cheers,
Venky


--
Bauhaus-Universität Weimar
Bauhausstr. 9a, R308
99423 Weimar, Germany

Phone: +49 3643 58 3577
www.webis.de


smime.p7s
Description: S/MIME Cryptographic Signature
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: CephFS warning: clients laggy due to laggy OSDs

2023-09-19 Thread Venky Shankar
Hi Janek,

On Mon, Sep 18, 2023 at 9:52 PM Janek Bevendorff <
janek.bevendo...@uni-weimar.de> wrote:

> Thanks! However, I still don't really understand why I am seeing this.
>

This is due to a changes that was merged recently in pacific

https://github.com/ceph/ceph/pull/52270

The MDS would not evict laggy clients if the OSDs report as laggy. Laggy
OSDs can cause cephfs clients to not flush dirty data (during cap revokes
by the MDS) and thereby showing up as laggy and getting evicted by the MDS.
This behaviour was changed and therefore you get warnings that some client
are laggy but they are not evicted since the OSDs are laggy.


> The first time I had this, one of the clients was a remote user dialling
> in via VPN, which could indeed be laggy. But I am also seeing it from
> neighbouring hosts that are on the same physical network with reliable ping
> times way below 1ms. How is that considered laggy?
>
 Are some of your OSDs reporting laggy? This can be check via `perf dump`

> ceph tell mds.<> perf dump
(search for op_laggy/osd_laggy)


> On 18/09/2023 18:07, Laura Flores wrote:
>
> Hi Janek,
>
> There was some documentation added about it here:
> https://docs.ceph.com/en/pacific/cephfs/health-messages/
>
> There is a description of what it means, and it's tied to an mds
> configurable.
>
> On Mon, Sep 18, 2023 at 10:51 AM Janek Bevendorff <
> janek.bevendo...@uni-weimar.de> wrote:
>
>> Hey all,
>>
>> Since the upgrade to Ceph 16.2.14, I keep seeing the following warning:
>>
>> 10 client(s) laggy due to laggy OSDs
>>
>> ceph health detail shows it as:
>>
>> [WRN] MDS_CLIENTS_LAGGY: 10 client(s) laggy due to laggy OSDs
>>  mds.***(mds.3): Client *** is laggy; not evicted because some
>> OSD(s) is/are laggy
>>  more of this...
>>
>> When I restart the client(s) or the affected MDS daemons, the message
>> goes away and then comes back after a while. ceph osd perf does not list
>> any laggy OSDs (a few with 10-60ms ping, but overwhelmingly < 1ms), so
>> I'm on a total loss what this even means.
>>
>> I have never seen this message before nor was I able to find anything
>> about it. Do you have any idea what this message actually means and how
>> I can get rid of it?
>>
>> Thanks
>> Janek
>>
>> ___
>> ceph-users mailing list -- ceph-users@ceph.io
>> To unsubscribe send an email to ceph-users-le...@ceph.io
>>
>
>
> --
>
> Laura Flores
>
> She/Her/Hers
>
> Software Engineer, Ceph Storage 
>
> Chicago, IL
>
> lflo...@ibm.com | lflo...@redhat.com 
> M: +17087388804
>
>
> --
> Bauhaus-Universität Weimar
> Bauhausstr. 9a, R308
> 99423 Weimar, Germany
>
> Phone: +49 3643 58 3577www.webis.de
>
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>


-- 
Cheers,
Venky
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: CephFS warning: clients laggy due to laggy OSDs

2023-09-18 Thread Janek Bevendorff

Thanks! However, I still don't really understand why I am seeing this.

The first time I had this, one of the clients was a remote user dialling 
in via VPN, which could indeed be laggy. But I am also seeing it from 
neighbouring hosts that are on the same physical network with reliable 
ping times way below 1ms. How is that considered laggy?



On 18/09/2023 18:07, Laura Flores wrote:

Hi Janek,

There was some documentation added about it here: 
https://docs.ceph.com/en/pacific/cephfs/health-messages/


There is a description of what it means, and it's tied to an mds 
configurable.


On Mon, Sep 18, 2023 at 10:51 AM Janek Bevendorff 
 wrote:


Hey all,

Since the upgrade to Ceph 16.2.14, I keep seeing the following
warning:

10 client(s) laggy due to laggy OSDs

ceph health detail shows it as:

[WRN] MDS_CLIENTS_LAGGY: 10 client(s) laggy due to laggy OSDs
 mds.***(mds.3): Client *** is laggy; not evicted because some
OSD(s) is/are laggy
 more of this...

When I restart the client(s) or the affected MDS daemons, the message
goes away and then comes back after a while. ceph osd perf does
not list
any laggy OSDs (a few with 10-60ms ping, but overwhelmingly <
1ms), so
I'm on a total loss what this even means.

I have never seen this message before nor was I able to find anything
about it. Do you have any idea what this message actually means
and how
I can get rid of it?

Thanks
Janek

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io



--

Laura Flores

She/Her/Hers

Software Engineer, Ceph Storage 

Chicago, IL

lflo...@ibm.com | lflo...@redhat.com 
M: +17087388804 




--
Bauhaus-Universität Weimar
Bauhausstr. 9a, R308
99423 Weimar, Germany

Phone: +49 3643 58 3577
www.webis.de


smime.p7s
Description: S/MIME Cryptographic Signature
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: CephFS warning: clients laggy due to laggy OSDs

2023-09-18 Thread Laura Flores
Hi Janek,

There was some documentation added about it here:
https://docs.ceph.com/en/pacific/cephfs/health-messages/

There is a description of what it means, and it's tied to an mds
configurable.

On Mon, Sep 18, 2023 at 10:51 AM Janek Bevendorff <
janek.bevendo...@uni-weimar.de> wrote:

> Hey all,
>
> Since the upgrade to Ceph 16.2.14, I keep seeing the following warning:
>
> 10 client(s) laggy due to laggy OSDs
>
> ceph health detail shows it as:
>
> [WRN] MDS_CLIENTS_LAGGY: 10 client(s) laggy due to laggy OSDs
>  mds.***(mds.3): Client *** is laggy; not evicted because some
> OSD(s) is/are laggy
>  more of this...
>
> When I restart the client(s) or the affected MDS daemons, the message
> goes away and then comes back after a while. ceph osd perf does not list
> any laggy OSDs (a few with 10-60ms ping, but overwhelmingly < 1ms), so
> I'm on a total loss what this even means.
>
> I have never seen this message before nor was I able to find anything
> about it. Do you have any idea what this message actually means and how
> I can get rid of it?
>
> Thanks
> Janek
>
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>


-- 

Laura Flores

She/Her/Hers

Software Engineer, Ceph Storage 

Chicago, IL

lflo...@ibm.com | lflo...@redhat.com 
M: +17087388804
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io