Re: [lustre-discuss] LDLM locks not expiring/cancelling

2020-01-07 Thread Steve Crusan
Thanks Diego, long time no see! I haven't been using NRS TBF.

I think there's a few problems, some of which we were aware of before, but
the lack of lock cancels was causing chaos.

* (Mark lustre_inode_cache as reclaimable)
https://jira.whamcloud.com/browse/LU-12313
* tested on a 2.12.3 client (without patch above), but we were actually
getting lock cancels now

So I think I'll join 2020 and run 2.12.3 and probably add the SUnreclaim
patch to that as well, as it seems simple enough.

Thank you!

~Steve

On Mon, Jan 6, 2020 at 2:33 AM Moreno Diego (ID SIS) <
diego.mor...@id.ethz.ch> wrote:

> Hi Steve,
>
>
>
> I was having a similar problem in the past months where the MDS servers
> would go OOM because of SlabUnreclaim. The root cause has not yet been
> found but we stopped seeing this the day we disabled the NRS TBF (QoS) for
> any LDLM service (just in case you have it enabled). Just in case you have
> it enabled. It would be good to check as well what’s being consumed in the
> slab cache. In our case it was mostly kernel objects and not ldlm.
>
>
>
> Diego
>
>
>
>
>
> *From: *lustre-discuss  on
> behalf of Steve Crusan 
> *Date: *Thursday, 2 January 2020 at 20:25
> *To: *"lustre-discuss@lists.lustre.org" 
> *Subject: *[lustre-discuss] LDLM locks not expiring/cancelling
>
>
>
> Hi all,
>
>
>
> We are running into a bizarre situation where we aren't having stale locks
> cancel themselves, and even worse, it seems as if
> ldlm.namespaces.*.lru_size is being ignored.
>
>
>
> For instance, I unmount our Lustre file systems on a client machine, then
> remount. Next, I'll run "lctl set_param ldlm.namespaces.*.lru_max_age=60s,
> lctl set_param ldlm.namespaces.*.lru_size=1024". This (I believe)
> theoretically would only allow 1024 ldlm locks per osc, and then I'd see a
> lot of lock cancels (via ldlm.namespaces.${ost}.pool.stats). We also should
> see cancels if the grant time > lru_max_age.
>
>
>
> We can trigger this simply by running 'find' on the root of our Lustre
> file system, and waiting for awhile. Eventually the clients SUnreclaim
> value bloats to 60-70GB (!!!), and each of our OSTs have 30-40k LRU locks
> (via lock_count). This is early in the process:
>
>
>
> """
>
> ldlm.namespaces.h5-OST003f-osc-8802d8559000.lock_count=2090
> ldlm.namespaces.h5-OST0040-osc-8802d8559000.lock_count=2127
> ldlm.namespaces.h5-OST0047-osc-8802d8559000.lock_count=52
> ldlm.namespaces.h5-OST0048-osc-8802d8559000.lock_count=1962
> ldlm.namespaces.h5-OST0049-osc-8802d8559000.lock_count=1247
> ldlm.namespaces.h5-OST004a-osc-8802d8559000.lock_count=1642
> ldlm.namespaces.h5-OST004b-osc-8802d8559000.lock_count=1340
> ldlm.namespaces.h5-OST004c-osc-8802d8559000.lock_count=1208
> ldlm.namespaces.h5-OST004d-osc-8802d8559000.lock_count=1422
> ldlm.namespaces.h5-OST004e-osc-8802d8559000.lock_count=1244
> ldlm.namespaces.h5-OST004f-osc-8802d8559000.lock_count=1117
> ldlm.namespaces.h5-OST0050-osc-8802d8559000.lock_count=1165
>
> """
>
>
>
> But this will grow over time, and eventually this compute node gets
> evicted from the MDS (after 10 minutes of cancelling locks/hanging). The
> only way we have been able to reduce the slab usage is to drop caches and
> set LRU=clear...but the problem just comes back depending on the workload.
>
>
>
> We are running 2.10.3 client side, 2.10.1 server side. Have there been any
> fixes added into the codebase for 2.10 that we need to apply? This seems to
> be the closest to what we are experiencing:
>
>
>
> https://jira.whamcloud.com/browse/LU-11518
>
>
>
>
>
> PS: I've checked other systems across our cluster, and some of them have
> as many as 50k locks per OST. I am kind of wondering if these locks are
> staying around much longer than the lru_max_age default (65 minutes), but I
> cannot prove that. Is there a good way to translate held locks to fids? I
> have been messing around with lctl set_param debug="XXX" and lctl set_param
> ldlm.namespaces.*.dump_namespace, but I don't feel like I'm getting *all*
> of the locks.
>
>
>
> ~Steve
>


-- 

*Steve Crusan*

Storage Specialist







DownUnder GeoSolutions



16200 Park Row Drive, Suite 100

Houston TX 77084, USA

tel +1 832 582 3221

ste...@dug.com

www.dug.com
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] LDLM locks not expiring/cancelling

2020-01-06 Thread Moreno Diego (ID SIS)
Hi Steve,

I was having a similar problem in the past months where the MDS servers would 
go OOM because of SlabUnreclaim. The root cause has not yet been found but we 
stopped seeing this the day we disabled the NRS TBF (QoS) for any LDLM service 
(just in case you have it enabled). Just in case you have it enabled. It would 
be good to check as well what’s being consumed in the slab cache. In our case 
it was mostly kernel objects and not ldlm.

Diego


From: lustre-discuss  on behalf of 
Steve Crusan 
Date: Thursday, 2 January 2020 at 20:25
To: "lustre-discuss@lists.lustre.org" 
Subject: [lustre-discuss] LDLM locks not expiring/cancelling

Hi all,

We are running into a bizarre situation where we aren't having stale locks 
cancel themselves, and even worse, it seems as if ldlm.namespaces.*.lru_size is 
being ignored.

For instance, I unmount our Lustre file systems on a client machine, then 
remount. Next, I'll run "lctl set_param ldlm.namespaces.*.lru_max_age=60s, lctl 
set_param ldlm.namespaces.*.lru_size=1024". This (I believe) theoretically 
would only allow 1024 ldlm locks per osc, and then I'd see a lot of lock 
cancels (via ldlm.namespaces.${ost}.pool.stats). We also should see cancels if 
the grant time > lru_max_age.

We can trigger this simply by running 'find' on the root of our Lustre file 
system, and waiting for awhile. Eventually the clients SUnreclaim value bloats 
to 60-70GB (!!!), and each of our OSTs have 30-40k LRU locks (via lock_count). 
This is early in the process:

"""
ldlm.namespaces.h5-OST003f-osc-8802d8559000.lock_count=2090
ldlm.namespaces.h5-OST0040-osc-8802d8559000.lock_count=2127
ldlm.namespaces.h5-OST0047-osc-8802d8559000.lock_count=52
ldlm.namespaces.h5-OST0048-osc-8802d8559000.lock_count=1962
ldlm.namespaces.h5-OST0049-osc-8802d8559000.lock_count=1247
ldlm.namespaces.h5-OST004a-osc-8802d8559000.lock_count=1642
ldlm.namespaces.h5-OST004b-osc-8802d8559000.lock_count=1340
ldlm.namespaces.h5-OST004c-osc-8802d8559000.lock_count=1208
ldlm.namespaces.h5-OST004d-osc-8802d8559000.lock_count=1422
ldlm.namespaces.h5-OST004e-osc-8802d8559000.lock_count=1244
ldlm.namespaces.h5-OST004f-osc-8802d8559000.lock_count=1117
ldlm.namespaces.h5-OST0050-osc-8802d8559000.lock_count=1165
"""

But this will grow over time, and eventually this compute node gets evicted 
from the MDS (after 10 minutes of cancelling locks/hanging). The only way we 
have been able to reduce the slab usage is to drop caches and set 
LRU=clear...but the problem just comes back depending on the workload.

We are running 2.10.3 client side, 2.10.1 server side. Have there been any 
fixes added into the codebase for 2.10 that we need to apply? This seems to be 
the closest to what we are experiencing:

https://jira.whamcloud.com/browse/LU-11518


PS: I've checked other systems across our cluster, and some of them have as 
many as 50k locks per OST. I am kind of wondering if these locks are staying 
around much longer than the lru_max_age default (65 minutes), but I cannot 
prove that. Is there a good way to translate held locks to fids? I have been 
messing around with lctl set_param debug="XXX" and lctl set_param 
ldlm.namespaces.*.dump_namespace, but I don't feel like I'm getting *all* of 
the locks.

~Steve
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


[lustre-discuss] LDLM locks not expiring/cancelling

2020-01-02 Thread Steve Crusan
Hi all,

We are running into a bizarre situation where we aren't having stale locks
cancel themselves, and even worse, it seems as if
ldlm.namespaces.*.lru_size is being ignored.

For instance, I unmount our Lustre file systems on a client machine, then
remount. Next, I'll run "lctl set_param ldlm.namespaces.*.lru_max_age=60s,
lctl set_param ldlm.namespaces.*.lru_size=1024". This (I believe)
theoretically would only allow 1024 ldlm locks per osc, and then I'd see a
lot of lock cancels (via ldlm.namespaces.${ost}.pool.stats). We also should
see cancels if the grant time > lru_max_age.

We can trigger this simply by running 'find' on the root of our Lustre file
system, and waiting for awhile. Eventually the clients SUnreclaim value
bloats to 60-70GB (!!!), and each of our OSTs have 30-40k LRU locks (via
lock_count). This is early in the process:

"""
ldlm.namespaces.h5-OST003f-osc-8802d8559000.lock_count=2090
ldlm.namespaces.h5-OST0040-osc-8802d8559000.lock_count=2127
ldlm.namespaces.h5-OST0047-osc-8802d8559000.lock_count=52
ldlm.namespaces.h5-OST0048-osc-8802d8559000.lock_count=1962
ldlm.namespaces.h5-OST0049-osc-8802d8559000.lock_count=1247
ldlm.namespaces.h5-OST004a-osc-8802d8559000.lock_count=1642
ldlm.namespaces.h5-OST004b-osc-8802d8559000.lock_count=1340
ldlm.namespaces.h5-OST004c-osc-8802d8559000.lock_count=1208
ldlm.namespaces.h5-OST004d-osc-8802d8559000.lock_count=1422
ldlm.namespaces.h5-OST004e-osc-8802d8559000.lock_count=1244
ldlm.namespaces.h5-OST004f-osc-8802d8559000.lock_count=1117
ldlm.namespaces.h5-OST0050-osc-8802d8559000.lock_count=1165
"""

But this will grow over time, and eventually this compute node gets evicted
from the MDS (after 10 minutes of cancelling locks/hanging). The only way
we have been able to reduce the slab usage is to drop caches and set
LRU=clear...but the problem just comes back depending on the workload.

We are running 2.10.3 client side, 2.10.1 server side. Have there been any
fixes added into the codebase for 2.10 that we need to apply? This seems to
be the closest to what we are experiencing:

https://jira.whamcloud.com/browse/LU-11518


PS: I've checked other systems across our cluster, and some of them have as
many as 50k locks per OST. I am kind of wondering if these locks are
staying around much longer than the lru_max_age default (65 minutes), but I
cannot prove that. Is there a good way to translate held locks to fids? I
have been messing around with lctl set_param debug="XXX" and lctl set_param
ldlm.namespaces.*.dump_namespace, but I don't feel like I'm getting *all*
of the locks.

~Steve
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org