[lustre-discuss] Eternally Invalid Lock?

2024-04-24 Thread Ellis Wilson via lustre-discuss
Hi all,

(This is on 2.15.4 with very limited modifications, none to speak of in ldlm or 
similar)

Very rarely, when attempting to perform an lctl barrier_freeze, we run into a 
situation where it fails with EINVAL.  At that point all future lctl barrier 
operations (including rescan) return EINVAL, except barrier_status, which 
simply shows failed.  This remains stuck that way until we reboot the MDS or we 
reach heat death of the universe.  We have only tested the former solution, and 
that does work, though it's extremely heavyweight.

Since we've been able to repro this on a test system (so we're leaving it in 
that state), we've been able to collect traces, and we see these two that are 
quite interesting:

0001:0001:7.0:1713898822.714580:0:2921:0:(ldlm_lockd.c:2382:ldlm_callback_handler())
 callback on lock 0x831100496f36c2f4 - lock disappeared
0001:0001:4.0:1713898822.710567:0:2676:0:(ldlm_lockd.c:2286:ldlm_callback_errmsg())
 @@@ Operate with invalid parameter, NID=12345-0@lo lock=0x831100496f36c2f4: rc 
= 0  req@2ae045d9 x1783750461943232/t0(0) 
o106->8afc7dcc-fc22-4bb8-8ca1-b5df01779cf4@0@lo:64/0 lens 400/224 e 0 to 0 dl 
1713899139 ref 1 fl Interpret:/0/0 rc -22/-22 job:''

This seems to correlate with this code:
2390 /*
2391  * Force a known safe race, send a cancel to the server for a lock
2392  * which the server has already started a blocking callback on.
2393  */
2394 if (OBD_FAIL_CHECK(OBD_FAIL_LDLM_CANCEL_BL_CB_RACE) &&
2395 lustre_msg_get_opc(req->rq_reqmsg) == LDLM_BL_CALLBACK) {
2396 rc = ldlm_cli_cancel(_req->lock_handle[0], 0);
2397 if (rc < 0)
2398 CERROR("ldlm_cli_cancel: %d\n", rc);
2399 }
2400
2401 lock = ldlm_handle2lock_long(_req->lock_handle[0], 0);
2402 if (!lock) {
2403 CDEBUG(D_DLMTRACE,
2404"callback on lock %#llx - lock disappeared\n",
2405dlm_req->lock_handle[0].cookie);
2406 rc = ldlm_callback_reply(req, -EINVAL);
2407 ldlm_callback_errmsg(req, "Operate with invalid 
parameter", rc,
2408  _req->lock_handle[0]);
2409 RETURN(0);
2410 }

The weird thing is that this lock never expires.  0x831100496f36c2f4 is here to 
stay.  Attempting to clear it by setting clear on the lru_size setting does 
nothing.

Practical questions:
1. I'm assuming NID=12345-0@lo is basically equivalent to "localhost" for 
Lustre, but if it's not, let me know.  I see the first part of this is defined 
as LNET_PID_LUSTRE, though I'm not 100% sure what PID here stands for (I doubt 
process ID?)
2. Is there any tool I can't find to instruct Lustre to drop a lock by ID?  
That'd be handy, though I realize I'm asking for a "turn off airbags" button.
3. One of my engineers made the following comment in case it is helpful:
"It appears that the "lock disappeared" lock exists in the dump_namespaces 
output as the remote end of another lock but nowhere as a lock itself. It's 
also interesting that it seems like the same resource appears twice with 
different resource IDs but the same 3 part ID that looks like a FID:
0001:0001:10.0:1713987044.511342:0:1897178:0:(ldlm_resource.c:1783:ldlm_resource_dump())
 --- Resource: [0x736665727473756c:0x5:0x0].0x0 (c3f4ce61) refcount = 3
0001:0001:10.0:1713987044.511343:0:1897178:0:(ldlm_resource.c:1787:ldlm_resource_dump())
 Granted locks (in reverse order):
0001:0001:10.0:1713987044.511343:0:1897178:0:(ldlm_resource.c:1790:ldlm_resource_dump())
 ### ### ns: MGS lock: e2096801/0x831100496f36eea6 lrc: 2/0,0 mode: 
CR/CR res: [0x736665727473756c:0x5:0x0].0x0 rrc: 4 type: PLN flags: 
0x40 nid: 0@lo remote: 0x831100496f36ee9f expref: 14 pid: 2699 
timeout: 0 lvb_type: 0
0001:0040:10.0:1713987044.511344:0:1897178:0:(ldlm_resource.c:1600:ldlm_resource_putref())
 putref res: c3f4ce61 count: 3
0001:0001:10.0:1713987044.511344:0:1897178:0:(ldlm_resource.c:1790:ldlm_resource_dump())
 ### ### ns: MGS lock: 1007e670/0x831100496f36c2fb lrc: 2/0,0 mode: 
CR/CR res: [0x736665727473756c:0x5:0x0].0x0 rrc: 4 type: PLN flags: 
0x40 nid: 0@lo remote: 0x831100496f36c2f4 expref: 14 pid: 2700 
timeout: 0 lvb_type: 0
0001:0040:10.0:1713987044.511345:0:1897178:0:(ldlm_resource.c:1600:ldlm_resource_putref())
 putref res: c3f4ce61 count: 3
0001:0040:10.0:1713987044.511346:0:1897178:0:(ldlm_resource.c:1600:ldlm_resource_putref())
 putref res: c3f4ce61 count: 2
0001:0040:10.0:1713987044.511346:0:1897178:0:(ldlm_resource.c:1566:ldlm_resource_getref())
 getref res: 1ab54d34 count: 35

0001:0001:10.0:1713987044.512238:0:1897178:0:(ldlm_resource.c:1783:ldlm_resource_dump())
 --- Resource: [0x736665727473756c:0x5:0x0].0x0 (19b90cb9) 

[lustre-discuss] Full List of Required Open Lustre Ports?

2023-02-01 Thread Ellis Wilson via lustre-discuss
Hi folks,

We've seen some weird stuff recently with UFW/iptables dropping packets on our 
OSS and MDS nodes.  We are running 2.15.1.  Example:

[   69.472030] [UFW BLOCK] IN=eth0 OUT= MAC= SRC= DST= LEN=52 
TOS=0x00 PREC=0x00 TTL=64 ID=58224 DF PROTO=TCP SPT=1022 DPT=988 WINDOW=510 
RES=0x00 ACK FIN URGP=0

[11777.280724] [UFW BLOCK] IN=eth0 OUT= MAC= SRC= DST= LEN=64 
TOS=0x00 PREC=0x00 TTL=64 ID=44206 DF PROTO=TCP SPT=988 DPT=1023 WINDOW=509 
RES=0x00 ACK URGP=0

Previously, we were only allowing 988 bidirectionally on BOTH clients and 
servers.  This was based on guidance from the Lustre manual.  From the above 
messages it appears we may need to expand that range.  This thread discusses it:
https://www.mail-archive.com/lustre-discuss@lists.lustre.org/msg17229.html

Based on that thread and some code reading it appears that sans explicit 
configuration of conns_per_peer the extra ports potentially required are 
autotuning (ksocklnd_speed2cpp).  E.G., if we have a node with 50Gbps 
interface, we may need up to 3 ports open to accommodate the extra ports.  
These appear to be selected beginning at 1023 and going down as far as 512.

Questions:
1. If we do not open up more than 988, are there known performance issues for 
machines at or below say, 50Gbps?  It does seem that with these closed we don't 
have correctness or visible performance problems, so there must be some 
fallback mechanism at play.
2. Can we just open 1023 to 1021 for a 50GigE machine?  Or are there situations 
where binding might fail and the algorithm could potentially attempt to create 
sockets all the way down to 512?
3. Regardless of the answer to #2, do we need to open these ports on all client 
and server nodes, or can we get away with just server nodes?
4. Do these need to be opened just for egress from the node in question, or 
bidirectionally?

Thanks in advance!

Best,

ellis
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


[lustre-discuss] Intent of resize in mkfs.lustre

2022-11-09 Thread Ellis Wilson via lustre-discuss
Hi all,

I ran into an issue against drives just shy of 16TiB with mkfs.lustre, which 
appears to relate to how resize is employed by mkfs.lustre when it calls mke2fs.

I've opened this:
https://jira.whamcloud.com/browse/LU-16305
Side note: how do I assign something to myself?  I have a fix but can't find 
any buttons on JIRA that allow me to pick up a bug I opened for myself.

My fix bounds disk capacity 1MiB below the specified resize value if your disk 
falls into the problem range of (16TiB-32GiB) to (16TiB-1B), but I wanted to 
better understand what we're trying to accomplish with the extended option 
"resize."

My understanding of resize in the mke2fs context is that it reserves extra 
space in the block descriptor table such that you could extend ext*/ldiskfs 
down the road up to the given resize block count.  However, in 
libmount_utils_ldiskfs.c Lustre's use of it seems like an optimization I don't 
quite understand:

/* In order to align the filesystem metadata on 1MB boundaries,
 * give a resize value that will reserve a power-of-two group
 * descriptor blocks, but leave one block for the superblock.
 * Only useful for filesystems with < 2^32 blocks due to resize
 * limitations.

The comment makes it sound like resize varies with the device size, but it 
currently only varies with block size (for a 4KB block size it's always 
4290772992).

Does anybody know what is this optimization attempting to achieve, and what 
motivated it since this doesn't seem related in the least to actually resizing 
the drive?  Since most spinners are north of 16TiB nowadays, this optimization 
won't be enabled for them -- is that concerning?

Best,

ellis
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] lproc stats changed snapshot_time from unix-epoch to uptime/monotonic in 2.15

2022-08-25 Thread Ellis Wilson via lustre-discuss
Thanks for confirming Andreas, and will do!

-Original Message-
From: Andreas Dilger  
Sent: Wednesday, August 24, 2022 8:47 PM
To: Ellis Wilson 
Cc: lustre-discuss@lists.lustre.org
Subject: [EXTERNAL] Re: [lustre-discuss] lproc stats changed snapshot_time from 
unix-epoch to uptime/monotonic in 2.15

Ellis, thanks for reporting this.  This looks like it was a mistake. 

The timestamps should definitely be in wallclock time, but this looks to have 
been changed unintentionally to reduce overhead, and use a u64 instead of 
dealing with timespec64 math, while losing the original intent (there are many 
different ktime_get variants, all alike).

I think many monitoring tools will be unaffected because they use the delta 
between successive timestamps, but having timestamps that are relative to boot 
time is problematic since they may repeat or go backward after a reboot, and 
some tools may use this timestamp when inserting into a tsdb to avoid 
processing lag. 

Please file a ticket, and ideally if you can submit a patch that converts 
ktime_get() to ktime_get_real_ns() for the places that are changed in the patch 
(with a "Fixes:" line to track it against the original patch, which was commit 
ea2cd3af7b).

Cheers, Andreas

> On Aug 24, 2022, at 14:50, Ellis Wilson via lustre-discuss 
>  wrote:
> 
> Hi all,
> 
> One of my colleagues noticed that in testing 2.15.1 out the stats returned 
> include snapshot_time showing up in a different fashion than before.  
> Previously, ktime_get_real_ts64 was used to get the current timestamp and 
> that was presented when stats were printed, whereas now uptime is used as 
> returned by ktime_get.  Is there a monotonic requirement to snapshot_time 
> that I'm not thinking about that makes ktime_get more useful?  The previous 
> behavior of getting the current time alongside the stats so you could reason 
> about when they were gotten made more sense to me.  But perhaps Andreas had a 
> different vision for use of snapshot_time given that he was the one who 
> revised it?
> 
> I'm glad to open a bug and propose a patch if this was just a mistake, but 
> figured I'd ask first.
> 
> Best,
> 
> ellis
> ___
> lustre-discuss mailing list
> lustre-discuss@lists.lustre.org
> https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Flists.lustre.org%2Flistinfo.cgi%2Flustre-discuss-lustre.orgdata=05%7C01%7Celliswilson%40microsoft.com%7C211612b72e78476e056008da8633576a%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637969852332016835%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7Csdata=Ycw4J8CUQ1WC9c96G2B0ko1gwPO1A4sj9ThFz3xuxuQ%3Dreserved=0
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


[lustre-discuss] lproc stats changed snapshot_time from unix-epoch to uptime/monotonic in 2.15

2022-08-24 Thread Ellis Wilson via lustre-discuss
Hi all,

One of my colleagues noticed that in testing 2.15.1 out the stats returned 
include snapshot_time showing up in a different fashion than before.  
Previously, ktime_get_real_ts64 was used to get the current timestamp and that 
was presented when stats were printed, whereas now uptime is used as returned 
by ktime_get.  Is there a monotonic requirement to snapshot_time that I'm not 
thinking about that makes ktime_get more useful?  The previous behavior of 
getting the current time alongside the stats so you could reason about when 
they were gotten made more sense to me.  But perhaps Andreas had a different 
vision for use of snapshot_time given that he was the one who revised it?

I'm glad to open a bug and propose a patch if this was just a mistake, but 
figured I'd ask first.

Best,

ellis
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] [EXTERNAL] Limiting Lustre memory use?

2022-02-22 Thread Ellis Wilson via lustre-discuss
Hi Bill,

I just ran into a similar issue.  See:
https://jira.whamcloud.com/browse/LU-15468

Lustre definitely caches data in the pagecache, and as far as I have seen 
metadata in slab.  I'd start by running slabtop on a client machine if you can 
stably reproduce the OOM situation, or creating a cronjob to cat /proc/meminfo 
and /proc/vmstat into a file at minutely intervals to try to save state of the 
machine before it goes belly up.  If you see a tremendous amount consumed by 
Lustre slabs then it's likely on the inode caching side (the slab name should 
be indicative though), and you might try a client build with this recent change 
to see if it mitigates the issue:
https://review.whamcloud.com/#/c/39973

Note that disabling the Lustre inode cache like this will inherently apply 
significantly more pressure on your MDTs, but if it keeps you out of OOM 
territory, it's probably a win.

In my case it wasn't metadata that was forcing my clients to OOM, but PTLRPC 
holding onto references to pages the rest of Lustre thought it was done with 
until my OSTs committed their transactions.  Revising my OST mount options to 
use an explicit commit=5 fixed my problem.

Best,

ellis

-Original Message-
From: lustre-discuss  On Behalf Of 
bill broadley via lustre-discuss
Sent: Friday, February 18, 2022 4:43 PM
To: lustre-discuss@lists.lustre.org
Subject: [EXTERNAL] [lustre-discuss] Limiting Lustre memory use?


On a cluster I managed (without Lustre), we had many problems with users 
running nodes out of ram which often killed the node.  We added cgroup support 
to slurm and those problems disappeared.  Nearly 100% of the time get a cgroup 
OOM instead of a kernel OOM and the nodes would stay up and stable. This became 
doubly important when we started allowing jobs to share nodes and didn't want 
job A to be able to crash job B.

I've tried similar on a Lustre enabled cluster and it seems like the memory 
used by Lustre (which I believe is in the kernel and outside of the job's 
cgroup).  I think part of the problem is I believe Lustre caches metadata in 
the linux page cache, but not data.  I've tried reducing the ram available to 
slurm, but still getting kernel OOMs instead of cgroup OOMs.

Anyone have a suggestion for fixing this?  Is there any way to limit Lustre's 
memory use in the kernel?  Or force that caching into userspace and inside the 
cgroup?  Or possibly out of ram and onto a client local NVMe?

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Flists.lustre.org%2Flistinfo.cgi%2Flustre-discuss-lustre.orgdata=04%7C01%7Celliswilson%40microsoft.com%7C03fd04f2bfc44a103a0908d9f3279b87%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637808174021538816%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000sdata=0Pr7JcdxBeo9G8SbcodZP%2Bj2FTPFrdI4bKt%2BbjMO%2BKQ%3Dreserved=0
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


[lustre-discuss] Appropriate Umount Ordering

2022-02-17 Thread Ellis Wilson via lustre-discuss
Hi all,

(Hopefully) simple two questions this time around.  This is for 2.14.0, and my 
cluster is setup with no failovers for MDTs or OSTs.  OBD timeouts have not 
been altered from the defaults.

Question 1:

I read on the Lustre Wiki that the appropriate ordering to umount the various 
components of a Lustre filesystem is:
1. Clients
2. MDT(s)
3. OSTs
4. MGS

However, if I do it this way, the OST mounts always hang for 04:25 seconds 
before umounting.  Dmesg reports:
[88944.272233] Lustre: 30178:0:(client.c:2282:ptlrpc_expire_one_request()) @@@ 
Request sent has timed out for slow reply: [sent 1645111309/real 1645111309]  
req@cc9c1aeb x1724931853622016/t0(0) 
o39->lustrefs-MDT-lwp-OST@10.1.98.8@tcp:12/10 lens 224/224 e 0 to 1 dl 
1645111574 ref 2 fl Rpc:XNQr/0/ rc 0/-1 job:''
[88944.275884] Lustre: Failing over lustrefs-OST
[88944.429622] Lustre: server umount lustrefs-OST complete

For reference, if I reverse OSTs and MDT (do the MDT second), then all of the 
OST umounts are fast, but the MDT takes a whopping 8 minutes and 50 seconds to 
umount.

Why is the canonical shutdown ordering delaying so long (and so specifically) 
for me?

Question 2:

In all cases (OSTs or MDTs) of umount, whether they are fast or not, I see 
messages like the following in dmesg:
[88944.275884] Lustre: Failing over lustrefs-OST
or
[78406.007678] Lustre: Failing over lustrefs-MDT

There is no failover configured in my setup.  The MGS is up the entire time in 
all cases.  What is lustre doing here?  How do I explicitly disable this 
failover attempt, since it seems to be at best misleading and at worst directly 
related to the lengthy delays?  FWIW, I have tried umount with '-f' to cause 
the MDT to go into failout rather than failover to no avail.

Thanks for any help folks can offer on this in advance,

ellis
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Lustre Client Lockup Under Buffered I/O (2.14/2.15)

2022-02-14 Thread Ellis Wilson via lustre-discuss
I believe I have root caused this, and posted detailed analysis on the opened 
JIRA issue (link in the previous message).  Questions for the community:

1. The Lustre manual claims that "By default, sync_journal is enabled 
(sync_journal=1), so that journal entries are committed synchronously," but I'm 
finding that the reverse is and has been true for over a decade.  This is the 
cause of my client OOM malaise - my clients are holding onto referenced pages 
until the OSTs commit their journals *and* the clients ping the MGS or somebody 
else that updates their last committed transaction number to a value greater 
than the outstanding requests.  These small clients (in fact, even ones as 
large as 64GB) can easily write fast enough to exhaust memory before the OSTs 
decide it's time to flush the transactions.  Can somebody clarify if this is 
just a clerical error in the manual and async journal committing is expected to 
be default and safe?

2. It appears that although the default "commit" mount option for ext4 is 5 
seconds, this is either disabled entirely or set to a much higher value in 
ldiskfs.  Can somebody clarify what the ldiskfs default setting is for commit 
(I'm failing hard trying to locate it in code or ldiskfs patches)?  Adjusting 
the mount option on the OST to use "commit=5" does the right thing (prevents my 
client from going OOM without the workaround in #1) from what I can tell, so 5s 
must not be the default for ldiskfs.

3. Are there thoughts from the community on whether setting "sync_journal=1" in 
lctl or changing the mount option to "commit=5" are preferable?  The latter 
seems like it will be slightly more performant for very busy systems, but for 
streaming I/O so far they produce identical results.

4. OFD targets appear to maintain grant info relating to dirty, pending, and 
current available grant.  I'm witnessing pending well exceed the ldiskfs 
journal size on my OSTs (which defaults to 1GB).  Code suggests these two are 
discrete concepts, as pending is correctness checked against blocks in the 
filesystem shifted left by the power of two associated with the block size.  
What's the rationale behind the pending value?

Best,

ellis

From: Ellis Wilson
Sent: Thursday, January 20, 2022 2:28 PM
To: Peter Jones ; Raj ; Patrick 
Farrell 
Cc: lustre-discuss@lists.lustre.org
Subject: RE: [lustre-discuss] [EXTERNAL] Re: Lustre Client Lockup Under 
Buffered I/O (2.14/2.15)

Thanks for facilitating a login for me Peter.  The bug with all logs and info I 
could think to include has been opened here:

https://jira.whamcloud.com/browse/LU-15468

I'm going to keep digging on my end, but if anybody has any other bright ideas 
or experiments they'd like me to try, don't hesitate to say so here or in the 
bug.

From: Peter Jones mailto:pjo...@whamcloud.com>>
Sent: Thursday, January 20, 2022 9:28 AM
To: Ellis Wilson mailto:elliswil...@microsoft.com>>; 
Raj mailto:rajgau...@gmail.com>>; Patrick Farrell 
mailto:pfarr...@ddn.com>>
Cc: lustre-discuss@lists.lustre.org<mailto:lustre-discuss@lists.lustre.org>
Subject: Re: [lustre-discuss] [EXTERNAL] Re: Lustre Client Lockup Under 
Buffered I/O (2.14/2.15)

You don't often get email from 
pjo...@whamcloud.com<mailto:pjo...@whamcloud.com>. Learn why this is 
important<http://aka.ms/LearnAboutSenderIdentification>
Ellis

JIRA accounts can be requested from 
i...@whamcloud.com<mailto:i...@whamcloud.com>

Peter

From: lustre-discuss 
mailto:lustre-discuss-boun...@lists.lustre.org>>
 on behalf of Ellis Wilson via lustre-discuss 
mailto:lustre-discuss@lists.lustre.org>>
Reply-To: Ellis Wilson 
mailto:elliswil...@microsoft.com>>
Date: Thursday, January 20, 2022 at 6:20 AM
To: Raj mailto:rajgau...@gmail.com>>, Patrick Farrell 
mailto:pfarr...@ddn.com>>
Cc: "lustre-discuss@lists.lustre.org<mailto:lustre-discuss@lists.lustre.org>" 
mailto:lustre-discuss@lists.lustre.org>>
Subject: Re: [lustre-discuss] [EXTERNAL] Re: Lustre Client Lockup Under 
Buffered I/O (2.14/2.15)

Thanks Raj - I've checked all of the nodes in the cluster and they all have 
peer_credits set to 8, and credits are set to 256.  AFAIK that's quite low - 8 
concurrent sends to any given peer at a time. Since I only have two OSSes, for 
this client, that's only 16 concurrent sends at a given moment.  IDK if at this 
level this devolves to the maximum RPC size of 1MB or the current max BRW I 
have set of 4MB, but in either case these are small MB values.

I've reached out to Andreas and Patrick to try to get a JIRA account to open a 
bug, but have not heard back yet.  If somebody on-list is more appropriate to 
assist with this, please ping me.  I collected quite a bit of logs/traces 
yesterday and have sysrq stacks to share when I can get access to the whamcloud 
JIRA.

Best,

ellis

From: Raj mailto:rajgau...@gmail.com>>
Sent: Thurs

Re: [lustre-discuss] [EXTERNAL] Re: Lustre Client Lockup Under Buffered I/O (2.14/2.15)

2022-01-20 Thread Ellis Wilson via lustre-discuss
Thanks for facilitating a login for me Peter.  The bug with all logs and info I 
could think to include has been opened here:

https://jira.whamcloud.com/browse/LU-15468

I'm going to keep digging on my end, but if anybody has any other bright ideas 
or experiments they'd like me to try, don't hesitate to say so here or in the 
bug.

From: Peter Jones 
Sent: Thursday, January 20, 2022 9:28 AM
To: Ellis Wilson ; Raj ; 
Patrick Farrell 
Cc: lustre-discuss@lists.lustre.org
Subject: Re: [lustre-discuss] [EXTERNAL] Re: Lustre Client Lockup Under 
Buffered I/O (2.14/2.15)

You don't often get email from 
pjo...@whamcloud.com<mailto:pjo...@whamcloud.com>. Learn why this is 
important<http://aka.ms/LearnAboutSenderIdentification>
Ellis

JIRA accounts can be requested from 
i...@whamcloud.com<mailto:i...@whamcloud.com>

Peter

From: lustre-discuss 
mailto:lustre-discuss-boun...@lists.lustre.org>>
 on behalf of Ellis Wilson via lustre-discuss 
mailto:lustre-discuss@lists.lustre.org>>
Reply-To: Ellis Wilson 
mailto:elliswil...@microsoft.com>>
Date: Thursday, January 20, 2022 at 6:20 AM
To: Raj mailto:rajgau...@gmail.com>>, Patrick Farrell 
mailto:pfarr...@ddn.com>>
Cc: "lustre-discuss@lists.lustre.org<mailto:lustre-discuss@lists.lustre.org>" 
mailto:lustre-discuss@lists.lustre.org>>
Subject: Re: [lustre-discuss] [EXTERNAL] Re: Lustre Client Lockup Under 
Buffered I/O (2.14/2.15)

Thanks Raj - I've checked all of the nodes in the cluster and they all have 
peer_credits set to 8, and credits are set to 256.  AFAIK that's quite low - 8 
concurrent sends to any given peer at a time. Since I only have two OSSes, for 
this client, that's only 16 concurrent sends at a given moment.  IDK if at this 
level this devolves to the maximum RPC size of 1MB or the current max BRW I 
have set of 4MB, but in either case these are small MB values.

I've reached out to Andreas and Patrick to try to get a JIRA account to open a 
bug, but have not heard back yet.  If somebody on-list is more appropriate to 
assist with this, please ping me.  I collected quite a bit of logs/traces 
yesterday and have sysrq stacks to share when I can get access to the whamcloud 
JIRA.

Best,

ellis

From: Raj mailto:rajgau...@gmail.com>>
Sent: Thursday, January 20, 2022 8:14 AM
To: Patrick Farrell mailto:pfarr...@ddn.com>>
Cc: Andreas Dilger mailto:adil...@whamcloud.com>>; Ellis 
Wilson mailto:elliswil...@microsoft.com>>; 
lustre-discuss@lists.lustre.org<mailto:lustre-discuss@lists.lustre.org>
Subject: [EXTERNAL] Re: [lustre-discuss] Lustre Client Lockup Under Buffered 
I/O (2.14/2.15)

You don't often get email from rajgau...@gmail.com<mailto:rajgau...@gmail.com>. 
Learn why this is important<http://aka.ms/LearnAboutSenderIdentification>
Ellis, I would also check the peer_credit between server and the client. They 
should be same.

On Wed, Jan 19, 2022 at 9:27 AM Patrick Farrell via lustre-discuss 
mailto:lustre-discuss@lists.lustre.org>> wrote:
Ellis,

As you may have guessed, that function just set looks like a node which is 
doing buffered I/O and thrashing for memory.  No particular insight available 
from the count of functions there.

Would you consider opening a bug report in the Whamcloud JIRA?  You should have 
enough for a good report, here's a few things that would be helpful as well:

It sounds like you can hang the node on demand.  If you could collect stack 
traces with:

echo t > /proc/sysrq-trigger
after creating the hang, that would be useful.  (It will print to dmesg.)

You've also collected debug logs - Could you include, say, the last 100 MiB of 
that log set?  That should be reasonable to attach if compressed.

Regards,
Patrick
________________
From: lustre-discuss 
mailto:lustre-discuss-boun...@lists.lustre.org>>
 on behalf of Ellis Wilson via lustre-discuss 
mailto:lustre-discuss@lists.lustre.org>>
Sent: Wednesday, January 19, 2022 8:32 AM
To: Andreas Dilger mailto:adil...@whamcloud.com>>
Cc: lustre-discuss@lists.lustre.org<mailto:lustre-discuss@lists.lustre.org> 
mailto:lustre-discuss@lists.lustre.org>>
Subject: Re: [lustre-discuss] Lustre Client Lockup Under Buffered I/O 
(2.14/2.15)


Hi Andreas,



Apologies in advance for the top-post.  I'm required to use Outlook for work, 
and it doesn't handle in-line or bottom-posting well.



Client-side defaults prior to any tuning of mine (this is a very minimal 
1-client, 1-MDS/MGS, 2-OSS cluster):

~# lctl get_param llite.*.max_cached_mb

llite.lustrefs-8d52a9c52800.max_cached_mb=

users: 5

max_cached_mb: 7748

used_mb: 0

unused_mb: 7748

reclaim_count: 0

~# lctl get_param osc.*.max_dirty_mb

osc.lustrefs-OST-osc-8d52a9c52800.max_dirty_mb=1938

osc.lustrefs-OST0001-osc-8d52a9c52800.max_dirty_mb=1938

~# lctl get_param osc.*.max_rpcs_in_flight

osc.lustrefs-OST-osc-8d52a9c52800.ma

Re: [lustre-discuss] [EXTERNAL] Re: Lustre Client Lockup Under Buffered I/O (2.14/2.15)

2022-01-20 Thread Ellis Wilson via lustre-discuss
Thanks Raj - I've checked all of the nodes in the cluster and they all have 
peer_credits set to 8, and credits are set to 256.  AFAIK that's quite low - 8 
concurrent sends to any given peer at a time. Since I only have two OSSes, for 
this client, that's only 16 concurrent sends at a given moment.  IDK if at this 
level this devolves to the maximum RPC size of 1MB or the current max BRW I 
have set of 4MB, but in either case these are small MB values.

I've reached out to Andreas and Patrick to try to get a JIRA account to open a 
bug, but have not heard back yet.  If somebody on-list is more appropriate to 
assist with this, please ping me.  I collected quite a bit of logs/traces 
yesterday and have sysrq stacks to share when I can get access to the whamcloud 
JIRA.

Best,

ellis

From: Raj 
Sent: Thursday, January 20, 2022 8:14 AM
To: Patrick Farrell 
Cc: Andreas Dilger ; Ellis Wilson 
; lustre-discuss@lists.lustre.org
Subject: [EXTERNAL] Re: [lustre-discuss] Lustre Client Lockup Under Buffered 
I/O (2.14/2.15)

You don't often get email from rajgau...@gmail.com<mailto:rajgau...@gmail.com>. 
Learn why this is important<http://aka.ms/LearnAboutSenderIdentification>
Ellis, I would also check the peer_credit between server and the client. They 
should be same.

On Wed, Jan 19, 2022 at 9:27 AM Patrick Farrell via lustre-discuss 
mailto:lustre-discuss@lists.lustre.org>> wrote:
Ellis,

As you may have guessed, that function just set looks like a node which is 
doing buffered I/O and thrashing for memory.  No particular insight available 
from the count of functions there.

Would you consider opening a bug report in the Whamcloud JIRA?  You should have 
enough for a good report, here's a few things that would be helpful as well:

It sounds like you can hang the node on demand.  If you could collect stack 
traces with:

echo t > /proc/sysrq-trigger
after creating the hang, that would be useful.  (It will print to dmesg.)

You've also collected debug logs - Could you include, say, the last 100 MiB of 
that log set?  That should be reasonable to attach if compressed.

Regards,
Patrick

From: lustre-discuss 
mailto:lustre-discuss-boun...@lists.lustre.org>>
 on behalf of Ellis Wilson via lustre-discuss 
mailto:lustre-discuss@lists.lustre.org>>
Sent: Wednesday, January 19, 2022 8:32 AM
To: Andreas Dilger mailto:adil...@whamcloud.com>>
Cc: lustre-discuss@lists.lustre.org<mailto:lustre-discuss@lists.lustre.org> 
mailto:lustre-discuss@lists.lustre.org>>
Subject: Re: [lustre-discuss] Lustre Client Lockup Under Buffered I/O 
(2.14/2.15)


Hi Andreas,



Apologies in advance for the top-post.  I'm required to use Outlook for work, 
and it doesn't handle in-line or bottom-posting well.



Client-side defaults prior to any tuning of mine (this is a very minimal 
1-client, 1-MDS/MGS, 2-OSS cluster):

~# lctl get_param llite.*.max_cached_mb

llite.lustrefs-8d52a9c52800.max_cached_mb=

users: 5

max_cached_mb: 7748

used_mb: 0

unused_mb: 7748

reclaim_count: 0

~# lctl get_param osc.*.max_dirty_mb

osc.lustrefs-OST-osc-8d52a9c52800.max_dirty_mb=1938

osc.lustrefs-OST0001-osc-8d52a9c52800.max_dirty_mb=1938

~# lctl get_param osc.*.max_rpcs_in_flight

osc.lustrefs-OST-osc-8d52a9c52800.max_rpcs_in_flight=8

osc.lustrefs-OST0001-osc-8d52a9c52800.max_rpcs_in_flight=8

~# lctl get_param osc.*.max_pages_per_rpc

osc.lustrefs-OST-osc-8d52a9c52800.max_pages_per_rpc=1024

osc.lustrefs-OST0001-osc-8d52a9c52800.max_pages_per_rpc=1024



Thus far I've reduced the following to what I felt were really conservative 
values for a 16GB RAM machine:



~# lctl set_param llite.*.max_cached_mb=1024

llite.lustrefs-8d52a9c52800.max_cached_mb=1024

~# lctl set_param osc.*.max_dirty_mb=512

osc.lustrefs-OST-osc-8d52a9c52800.max_dirty_mb=512

osc.lustrefs-OST0001-osc-8d52a9c52800.max_dirty_mb=512

~# lctl set_param osc.*.max_pages_per_rpc=128

osc.lustrefs-OST-osc-8d52a9c52800.max_pages_per_rpc=128

osc.lustrefs-OST0001-osc-8d52a9c52800.max_pages_per_rpc=128

~# lctl set_param osc.*.max_rpcs_in_flight=2

osc.lustrefs-OST-osc-8d52a9c52800.max_rpcs_in_flight=2

osc.lustrefs-OST0001-osc-8d52a9c52800.max_rpcs_in_flight=2



This slows down how fast I get to basically OOM from <10 seconds to more like 
25 seconds, but the trend is identical.



As an example of what I'm seeing on the client, you can see below we start with 
most free, and then iozone rapidly (within ~10 seconds) causes all memory to be 
marked used, and that stabilizes at about 140MB free until at some point it 
stalls for 20 or more seconds and then some has been synced out:

~# dstat --mem

--memory-usage-

used  free  buff  cach

1029M 13.9G 2756k  215M

1028M 13.9G 2756k  215M

1028M 13.9G 2756k  215M

1088M 13.9G 2756k  215M

2550M 11.5G 2764k 1238M

3989M 10.1G 2764k 1236M

5404M 8881M 2764k 1239M

6831M 7

[lustre-discuss] Memory Management in Lustre

2022-01-19 Thread Ellis Wilson via lustre-discuss
Hi folks,

Broader (but related) question than my current malaise with OOM issues on 
2.14/2.15:  Is there any documentation or can somebody point me at some code 
that explains memory management within Lustre?  I've hunted through Lustre 
manuals, the Lustre internals doc, and a bunch of code, but can find nothing 
that documents the memory architecture in place.  I'm specifically looking at 
PTLRPC and OBD code right now, and I can't seem to find anywhere that 
explicitly limits the amount of allocations Lustre will perform.  On other 
filesystems I've worked on there are memory pools that you can explicitly size 
with maxes, and while these may be discrete between areas or reference counters 
used to leverage a system-shared pool, I expected to see /something/ that might 
bake in limits of some kind.  I'm sure I'm just not finding it.  Any help is 
greatly appreciated.

Best,

ellis
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Lustre Client Lockup Under Buffered I/O (2.14/2.15)

2022-01-19 Thread Ellis Wilson via lustre-discuss
 don't 
hesitate to ask.

Best,

ellis

From: Andreas Dilger 
Sent: Tuesday, January 18, 2022 9:54 PM
To: Ellis Wilson 
Cc: lustre-discuss@lists.lustre.org
Subject: [EXTERNAL] Re: [lustre-discuss] Lustre Client Lockup Under Buffered 
I/O (2.14/2.15)

You don't often get email from 
adil...@whamcloud.com<mailto:adil...@whamcloud.com>. Learn why this is 
important<http://aka.ms/LearnAboutSenderIdentification>
On Jan 18, 2022, at 13:40, Ellis Wilson via lustre-discuss 
mailto:lustre-discuss@lists.lustre.org>> wrote:

Recently we've switched from using ZFS to ldiskfs as the backing filesystem to 
work around some performance issues and I'm finding that when I put the cluster 
under load (with as little as a single client) I can almost completely lockup 
the client.  SSH (even existing sessions) stall, iostat, top, etc all freeze 
for 20 to 200 seconds.  This alleviates for small windows and recurs as long as 
I leave the io-generating process in existence.  It reports extremely high CPU 
and RAM usage, and appears to be consumed exclusively doing 'system'-tagged 
work.  This is on 2.14.0, but I've reproduced on more or less HOL for 
master-next.  If I do direct-IO, performance is fantastic and I have no such 
issues regarding CPU/memory pressure.

Uname: Linux 85df894e-8458-4aa4-b16f-1d47154c0dd2-lclient-a0-g0-vm 
5.4.0-1065-azure #68~18.04.1-Ubuntu SMP Fri Dec 3 14:08:44 UTC 2021 x86_64 
x86_64 x86_64 GNU/Linux

I dmesg I see consistent spew on the client about:
[19548.601651] LustreError: 30918:0:(events.c:208:client_bulk_callback()) event 
type 1, status -5, desc b69b83b0
[19548.662647] LustreError: 30917:0:(events.c:208:client_bulk_callback()) event 
type 1, status -5, desc 9ef2fc22
[19549.153590] Lustre: lustrefs-OST-osc-8d52a9c52800: Connection to 
lustrefs-OST (at 10.1.98.7@tcp<mailto:10.1.98.7@tcp>) was lost; in progress 
operations using this service will wait for recovery to complete
[19549.153621] Lustre: 30927:0:(client.c:2282:ptlrpc_expire_one_request()) @@@ 
Request sent has failed due to network error: [sent 1642535831/real 1642535833] 
 req@02361e2d x1722317313374336/t0(0) 
o4->lustrefs-OST0001-osc-8d52a9c52800@10.1.98.10<mailto:lustrefs-OST0001-osc-8d52a9c52800@10.1.98.10>@tcp:6/4
 lens 488/448 e 0 to 1 dl 1642535883 ref 2 fl Rpc:eXQr/0/ rc 0/-1 job:''
[19549.153623] Lustre: 30927:0:(client.c:2282:ptlrpc_expire_one_request()) 
Skipped 4 previous similar messages

But I actually think this is a symptom of extreme memory pressure causing the 
client to timeout things, not a cause.

Testing with obdfilter-survey (local) on the OSS side shows expected 
performance of the disk subsystem.  Testing with lnet_selftest from client to 
OSS shows expected performance.  In neither case do I see the high cpu or 
memory pressure issues.

Reducing a variety of lctl tunables that appear to govern memory allowances for 
Lustre clients does not improve the situation.

What have you reduced here?  llite.*.max_cached_mb, osc.*.max_dirty_mb, 
osc.*.max_rpcs_in_flight and osc.*.max_pages_per_rpc?


By all appearances, the running iozone or even simple dd processes gradually 
(i.e., over a span of just 10 seconds or so) consumes all 16GB of RAM on the 
client I'm using.  I've generated bcc profile graphs for both on- and off-cpu 
analysis, and they are utterly boring -- they basically just reflect rampant 
calls to shrink_inactive_list resulting from page_cache_alloc in the presence 
of extreme memory pressure.

We have seen some issues like this that are being looked at, but this is mostly 
only seen on smaller VM clients used in testing and not larger production 
clients.  Are you able to test with more RAM on the client?  Have you tried 
with 2.12.8 installed on the client?

Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Whamcloud






___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


[lustre-discuss] Lustre Client Lockup Under Buffered I/O (2.14/2.15)

2022-01-18 Thread Ellis Wilson via lustre-discuss
Hi all,

Recently we've switched from using ZFS to ldiskfs as the backing filesystem to 
work around some performance issues and I'm finding that when I put the cluster 
under load (with as little as a single client) I can almost completely lockup 
the client.  SSH (even existing sessions) stall, iostat, top, etc all freeze 
for 20 to 200 seconds.  This alleviates for small windows and recurs as long as 
I leave the io-generating process in existence.  It reports extremely high CPU 
and RAM usage, and appears to be consumed exclusively doing 'system'-tagged 
work.  This is on 2.14.0, but I've reproduced on more or less HOL for 
master-next.  If I do direct-IO, performance is fantastic and I have no such 
issues regarding CPU/memory pressure.

Uname: Linux 85df894e-8458-4aa4-b16f-1d47154c0dd2-lclient-a0-g0-vm 
5.4.0-1065-azure #68~18.04.1-Ubuntu SMP Fri Dec 3 14:08:44 UTC 2021 x86_64 
x86_64 x86_64 GNU/Linux

I dmesg I see consistent spew on the client about:
[19548.601651] LustreError: 30918:0:(events.c:208:client_bulk_callback()) event 
type 1, status -5, desc b69b83b0
[19548.662647] LustreError: 30917:0:(events.c:208:client_bulk_callback()) event 
type 1, status -5, desc 9ef2fc22
[19549.153590] Lustre: lustrefs-OST-osc-8d52a9c52800: Connection to 
lustrefs-OST (at 10.1.98.7@tcp) was lost; in progress operations using this 
service will wait for recovery to complete
[19549.153621] Lustre: 30927:0:(client.c:2282:ptlrpc_expire_one_request()) @@@ 
Request sent has failed due to network error: [sent 1642535831/real 1642535833] 
 req@02361e2d x1722317313374336/t0(0) 
o4->lustrefs-OST0001-osc-8d52a9c52800@10.1.98.10@tcp:6/4 lens 488/448 e 0 
to 1 dl 1642535883 ref 2 fl Rpc:eXQr/0/ rc 0/-1 job:''
[19549.153623] Lustre: 30927:0:(client.c:2282:ptlrpc_expire_one_request()) 
Skipped 4 previous similar messages

But I actually think this is a symptom of extreme memory pressure causing the 
client to timeout things, not a cause.

Testing with obdfilter-survey (local) on the OSS side shows expected 
performance of the disk subsystem.  Testing with lnet_selftest from client to 
OSS shows expected performance.  In neither case do I see the high cpu or 
memory pressure issues.

Reducing a variety of lctl tunables that appear to govern memory allowances for 
Lustre clients does not improve the situation.  By all appearances, the running 
iozone or even simple dd processes gradually (i.e., over a span of just 10 
seconds or so) consumes all 16GB of RAM on the client I'm using.  I've 
generated bcc profile graphs for both on- and off-cpu analysis, and they are 
utterly boring -- they basically just reflect rampant calls to 
shrink_inactive_list resulting from page_cache_alloc in the presence of extreme 
memory pressure.

Any suggestions on the best path to debugging this are very welcome.

Best,

ellis
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


[lustre-discuss] Configuring File Layout Questions

2021-07-13 Thread Ellis Wilson via lustre-discuss
Hi Lustre folks,

A few questions about around configuring file layouts, specifically progressive 
file layouts:

1. In a freshly stood-up Lustre cluster, if there are no clients yet mounted, 
are there any Lustre utilities (I've not found one) that allows one to perform 
the equivalent of "lfs setstripe" without an active mount point (say, from the 
MDS node)?

2. If not, is there a reasonable API against which such a utility could be 
constructed, or is this request at odds with the architecture?

3. In the absence of a separate client to mount the filesystem to perform 
normal "lfs" commands, can one safely mount the cluster directly from an MDS or 
some other node within the Lustre FS proper?  My understanding is that it is 
not safe, but that's based on hearsay, so I'd love to get a more authoritative 
answer.

Thanks to anybody who can help answer one or more of these!

ellis
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org