Re: [lustre-discuss] How to activate an OST on a client ?

2024-08-27 Thread Andreas Dilger via lustre-discuss
Hi Jan,
There is "lctl --device  recover" that will trigger a reconnect to the 
named OST device (per "lctl dl" output), but not sure if that will help.


Cheers, Andreas

On Aug 22, 2024, at 06:36, Haarst, Jan van via lustre-discuss 
 wrote:


Hi,

Probably the wording of the subject doesn’t actually cover the issue, what we 
see is this :
We have a client behind a router (linking tcp to Omnipath) that shows an 
inactive OST (all on 2.15.5).
Other clients that go through the router do not have this issue.

One client had the same issue, although it showed a different OST as inactive.
After a reboot, all was well again on that machine.

The clients can lctl ping the OSSs.

So although we have a workaround (reboot the client), it would be nice to:

  1.  Fix the issue without a reboot
  2.  Fix the underlying issue.

It might be unrelated, but we also see another routing issue every now and then:
The router stops routing request toward a certain OSS, and this can be fixed by 
deleting the peer_nid of the OSS from the router.

I am probably missing informative logs, but I’m more than happy to try to 
generate them, if somebody has a pointer to how.

We are a bit stumped right now.

With kind regards,

--
Jan van Haarst
HPC Administrator
For Anunna/HPC questions, please use https://support.wur.nl (with HPC as 
service)
Aanwezig: maandag, dinsdag, donderdag & vrijdag
Facilitair Bedrijf, onderdeel van Wageningen University & Research
Afdeling Informatie Technologie
Postbus 59, 6700 AB, Wageningen
Gebouw 116, Akkermaalsbos 12, 6700 WB, Wageningen
http://www.wur.nl/nl/Disclaimer.htm

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] lfs find %LF

2024-08-22 Thread Andreas Dilger via lustre-discuss
On Aug 21, 2024, at 09:42, Michael DiDomenico via lustre-discuss 
mailto:lustre-discuss@lists.lustre.org>> wrote:

as far as i can tell most of the time when lustre outputs a fid it's bracketed

[0x2e279:0x1340:0x0]

however, if you run

lfs find  -printf "%LF\n"

0x2e279:0x1340:0x0

the fid comes out unbracketed.  i don't think most of the tools care
as they're likely striping off the brackets, just curious if there's a
reason for the different output or is it an oversight?

Using "[]" around the FID can cause problems with shell expansion, which thinks 
it is a list of possible characters.

So, for example if "lfs fid2path /mnt/testfs [0x2e279:0x1340:0x0]" is used 
(and "[]" are not escaped), the shell will match any file with a name 
"[0123479ex:]", which can cause strange problems (sometimes users have files 
named "0" or "2" due to bad stderr redirection).

All of the in-tree tools can handle FIDs without the [], but we can't remove 
them for compatibility if external tools expect to see them.  It also makes the 
FIDs more visually identifiable in the logs.

Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Whamcloud







___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] changelog record format

2024-08-19 Thread Andreas Dilger via lustre-discuss
I would say that the changelog text output format is mainly for debugging and 
not really intended as a primary input format for some other changelog 
consumer.  Instead, the changelog consumer should fetch the binary changelog 
records itself and process them directly.

Most of the fields are fairly straight forward:

1 01CREAT 04:24:09.779203918 2024.08.20 0x0 t=[0x213a1:0x1:0x0]
j=cp.0 ef=0xf u=0:0 nid=0@lo p=[0x20007:0x1:0x0] testfile
2 01CREAT 04:26:36.351596347 2024.08.20 0x0 t=[0x213a1:0x2:0x0]
j=cp.0 ef=0xf u=0:0 nid=0@lo p=[0x20007:0x1:0x0] testfile2
3 01CREAT 04:27:54.551053189 2024.08.20 0x0 t=[0x213a1:0x4:0x0]
j=cp.0 ef=0xf u=0:0 nid=0@lo p=[0x213a1:0x3:0x0] file with spaces

1) record number
2) record type number/name
3) time of update
4) date of update
5) changelog flags
6) t= target FID
7) j= jobid
8) ef= extra field mask
9) u= UID:GID
10) nid= Network ID of node performing update
11) p= parent FID
12) filename (until end of line)

If you think they need more explanation, a patch to lustre/doc/lctl-changelog.1 
would be welcome.

On Aug 19, 2024, at 17:30, Michael DiDomenico via lustre-discuss 
mailto:lustre-discuss@lists.lustre.org>> wrote:

is there a document somewhere that describes the changelog record line
format?  the lustre manual lists the record types, but it doesnt
define what the rest key/value pairs are for each record type.   most
of the fields are self evident, but others i'm less clear on.
unfortunately googl'ing 'lustre changelog' doesn't help :(

Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Whamcloud







___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Rhel8.10 Lustre Kernel Performance Decrease

2024-08-01 Thread Andreas Dilger via lustre-discuss
Sorry, you'll have to ask someone at Azure about this.  I don't know anything 
about what "Premium Lustre" or "Lustre V2" means, and I can't speak to any kind 
of performance for their systems.

On Jul 30, 2024, at 08:48, Baucum, Rashun 
mailto:rashun.bau...@td.com>> wrote:

Good morning, Andreas.

I apologize for the delay.

Yes, I will provide additional information. I initially asked the question as a 
general inquiry. The crux of the performance issue we see is that the newer 
RHEL 8.10 - 2.15.5 lustre with similar builds do not hit the same average 
performance as we have seen previously.

Which version of Lustre for the old and new kernel and was it the same before 
upgrading to RHEL8.10?

  *   Previous Kernel : 3.10.0-1160.49.1.el7_lustre, RHEL 7.9
  *   Current Kernel : 4.18.0-553.5.1.el8_lustre, RHEL 8.10
  *   It was not the same, we waited for the last minor version of RHEL8.


Which RHEL version are you comparing against, RHEL 8.9?

  *   Comparison Version : RHEL 8.8
  *   Lustre Client Version :  4.18.0-425.3.1.el8_lustre


Have you upgraded both the clients and servers to RHEL 8.10, or only the 
clients?

  *   Currently both are upgraded to the RHEL 8.10.



Results of FIO testing

Original Production: Initial baseline established ~1 year ago. This build is no 
longer in use, but performance of non "premium" builds should be approximately 
around this.
Lustre V2: Current builds and are currently being used as a direct reference 
point. These used RHEL8 clients while being RHEL7 lustres.
RHEL8.10 – 2.15.5: Builds being tested before we push updates to higher 
environments.

Between Original Production, Lustre V2 36, and the RHEL8.10 lustres there are 
minimal changes between builds. In this specific case the only difference 
between Lustre V2 36 and RHEL8.10 builds are the lustre versions and the RHEL 
8.10 Premium Lustre uses Azure's premium SSDs as disks for OSTs. All other 
builds use standard HDDs for OSTs.

Write Throughput
Original Production
Lustre V2 36
Lustre V2 108
RHEL 8.10 - 2.15.5 - Premium Lustre
RHEL 8.10 - 2.15.5 - Standard Lustre
Latency (sec) Avg
1.485
1.482
1.338
1.596
2.968
IOPS Avg
689
690
765
640
640
Bandwidth (MB/s)
723
724
803
672
362
IO (GB)
4340
4346
4818
4037
2173

Write IOPS
Original Production
Lusre V2 36
Lustre V2 108
RHEL 8.10 - 2.15.5 - Premium Lustre
RHEL 8.10 - 2.15.5 - Standard Lustre
Latency (sec) Avg
10.354
18.709
4.911
5.839
11.821
Avg IOPS
24
13
52
43
21
Bandwidth (MB/s)
26
14
55
45
23
IO (GB)
156
86
328
276
137

Read Throughput
Original Production
Lustre V2 36
Lustre V2 108
RHEL 8.10 - 2.15.5 - Premium Lustre
RHEL 8.10 - 2.15.5 - Standard Lustre
Latency (sec) Avg
1.292
1.223
2.307
1.598
3.022
Avg IOPS
792
836
754
640
338
Bandwidth (MB/s)
831
877
791
672
355
IO (GB)
4986
5269
4750
4033
2134

Read IOPS
Original Production
Lustre V2 36
Lustre V2 108
RHEL 8.10 - 2.15.5 - Premium Lustre
RHEL 8.10 - 2.15.5 - Standard Lustre
Latency (sec) Avg
4.190
5.887
5.076
5.954
11.924
Avg IOPS
61
43
50
42
21
Bandwidth (MB/s)
64
45
53
45
23
IO (GB)
384
274
318
271
135


Thanks,
Rashun Baucum



Internal

From: Andreas Dilger mailto:adil...@whamcloud.com>>
Sent: Thursday, July 4, 2024 12:40 AM
To: Baucum, Rashun mailto:rashun.bau...@td.com>>
Cc: lustre-discuss@lists.lustre.org
Subject: Re: [lustre-discuss] Rhel8.10 Lustre Kernel Performance Decrease

CAUTION: EXTERNAL MAIL. DO NOT CLICK ON LINKS OR OPEN ATTACHMENTS YOU DO NOT 
TRUST
ATTENTION : COURRIEL EXTERNE. NE CLIQUEZ PAS SUR DES LIENS ET N'OUVREZ PAS DE 
PIÈCES JOINTES AUXQUELS VOUS NE FAITES PAS CONFIANCE

On Jul 3, 2024, at 13:12, Baucum, Rashun via lustre-discuss 
mailto:lustre-discuss@lists.lustre.org>> wrote:

Good afternoon,

We have recently started executing performance testing on the new rhel 8.10 
lustre kernel. We have noticed an drop in performance in our initial testing. 
Its roughly a 30-40% drop in total IO observed with our FIO testing. My 
question is has anyone else noticed any performance decreases?

Hi Rashun,
could you please be more specific about what you are comparing?  Which version 
of Lustre for the old and new kernel, and was it the same before upgrading to 
RHEL8.10?  Which RHEL version are you comparing against, RHEL 8.9?  Have you 
upgraded both the clients and servers to RHEL8.10, or only the clients?

Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Whamcloud








If you wish to unsubscribe from receiving commercial electronic messages from 
TD Bank Group, please click here or go to the 
following web address: www.td.com/tdoptout
Si vous souhaitez vous désabonner des messages électroniques de nature 
commerciale envoyés par Groupe Banque TD veuillez cliquer 
ici ou vous rendre à l'adresse 
www.td.com/tddesab

NOTICE: Confidential message which may be privileged. Unauthorized 
use/disclosure prohibited. If received in erro

Re: [lustre-discuss] question on usage of O_LOV_DELAY_CREATE

2024-07-31 Thread Andreas Dilger via lustre-discuss
Thomas,
your analysis is as good as any possible.  There should be at least an ioctl() 
call after the open() to create the objects before the pwrite64() call.  You 
would need to discuss this with Cray, use a different MPI, or potentially 
"pre-create" the file before MPI_File_open() so that O_LOV_DELAY_CREATE has no 
effect.

Cheers, Andreas

On Jul 30, 2024, at 08:40, Bertschinger, Thomas Andrew Hjorth via 
lustre-discuss 
mailto:lustre-discuss@lists.lustre.org>> wrote:

Hello,

We have an application that fails doing the following on one of our systems:

...
openat(AT_FDCWD, "mpi_test.out", O_WRONLY|O_CREAT|O_NOCTTY|FASYNC, 0611) = 4
pwrite64(4, "\3\0\0\0", 4, 0)   = -1 EBADF (Bad file descriptor)
...

It opens a file with O_LOV_DELAY_CREATE (or O_NOCTTY|FASYNC as strace 
interprets it), and then immediately tries to write to it.

>From the comments above ll_file_open() in Lustre:

If opened with O_LOV_DELAY_CREATE, then we don't do the object creation or open 
until ll_lov_setstripe() ioctl is called.

It sounds like the expectation is that the process calling open() like this 
follows it up with an ioctl to set the stripe information prior to writing.

Is this correct? In other words, is it reasonable to say that the failing code 
is doing something erroneous?

Here's a minimal MPI program that reproduces the problem. The issue only arises 
when using the Cray MPI implementation, however. When tested with openmpi and 
ANL mpich, the openat() call doesn't use O_LOV_DELAY_CREATE. Since the Cray 
implementation is unfortunately not open source, I have no insight into what 
this code is "supposed" to be doing. :(

#include 
#include 

int main(int argc, char *argv[])
{
   int err = MPI_Init(&argc, &argv);

   MPI_File fh;
   err = MPI_File_open(MPI_COMM_WORLD, "mpi_test.out",
   MPI_MODE_WRONLY|MPI_MODE_CREATE, MPI_INFO_NULL, &fh);
   printf("MPI_File_open returned: %d\n", err);

   long data = 3;
   err = MPI_File_write(fh, &data, 1, MPI_LONG, MPI_STATUS_IGNORE);
   printf("MPI_File_write returned: %d\n", err);

   err = MPI_File_close(&fh);
   printf("MPI_File_close returned: %d\n", err);

   MPI_Finalize();
   return 0;
}

Thanks,
Thomas Bertschinger
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Whamcloud







___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Trying to only build the lustreapi without sudo - make install error (Permission denied)

2024-07-31 Thread Andreas Dilger via lustre-discuss
On Jul 29, 2024, at 12:32, Apostolis Stamatis via lustre-discuss 
mailto:lustre-discuss@lists.lustre.org>> wrote:

Hello all,

I am trying to use the lustreapi in a project and I need to build it without 
sudo privileges in order to compile code that uses it. I don't need to build 
the whole client support, my end goal is only to compile code using lustreapi.h

What I am doing is:

```

sh autogen.sh

./configure --disable-server --disable-modules --disable-tests 
--prefix=/home/user/thirdparty/

make install

```

This fails with the error (output shortened to what I think is relevant)

```

make[3]: Entering directory '/home/user/lustre-release/libcfs/libcfs'
/bin/bash ../../libtool  --tag=CC   --mode=link gcc -fPIC -g -O2 -Wall -Werror  
 -o libcfs.la  util/libcfs_la-string.lo 
util/libcfs_la-nidstrings.lo util/libcfs_la-param.lo util/libcfs_la-parser.lo 
util/libcfs_la-l_ioctl.lo  -lkeyutils
libtool: link: rm -fr  .libs/libcfs.a .libs/libcfs.la
rm: cannot remove '.libs/libcfs.a': Permission denied
rm: cannot remove '.libs/libcfs.la': Permission denied

It looks like these files are not owned by the user running this command.  
Maybe there was previously a "sudo make" or similar run, and these 
directories/files are owned by root?

libtool: link: ar cr .libs/libcfs.a util/.libs/libcfs_la-string.o 
util/.libs/libcfs_la-nidstrings.o util/.libs/libcfs_la-param.o 
util/.libs/libcfs_la-parser.o util/.libs/libcfs_la-l_ioctl.o
ar: could not create temporary file whilst writing archive: no more archived 
files

```

When running `sudo make install`, it works


So my questions are:

1) Is it possible to run make install without sudo for my use case

I think yes, though I haven't tested this myself.  The errors above indicate a 
permission error and not something wrong with the build.

2) Is there a better way to only build the support for using lustreapi 
(assuming that what is required is only the library liblustreapi.a and the 
include headers)

It would make more sense to build and install the libraries together with the 
modules so that they are a consistent version.

Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Whamcloud







___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] lustreapi not found when compiling sample C program

2024-07-10 Thread Andreas Dilger via lustre-discuss
Why not build packages and install them properly?  There are a few pages on 
https://wiki.lustre.org/ for this (which should be updated if they are missing 
some info), but the TLDR to build a client is to run in the Lustre Git checkout 
tree:

sh autogen.sh
./configure --disable-tests --disable-server
make debs

(or "make rpms" for RH/SLES) and then install the resulting packages should 
work for most people if the pre-compiled packages do not.  This avoids copying 
the files around, ensures that those files are cleaned up when new packages are 
installed/removed, has proper dependencies, etc.

Cheers, Andreas

On Jul 10, 2024, at 15:55, Apostolis Stamatis via lustre-discuss 
mailto:lustre-discuss@lists.lustre.org>> wrote:

Hello all,

Sharing an update on this issue to help anyone in the future encountering 
similar problems.

The first solution I tried was to abandon the installation from source and 
install the deb packages. This fixed the issues with the linker and the 
compiler and I was able to use the C api, because it put the files on the 
default paths for gcc and ld. However, this caused other issues when I went to 
mount the filesystem (eg modprobe lustre not working), to my understanding 
caused by the deb packages being built for a slightly different kernel version.

So this made me go back to installing from source and building the deb packages 
myself. In order to fix my path errors, I manually copied the required files to 
the default paths used by gcc and ld. I understand this is not necessary, as I 
can also add the paths when compiling. However this made me wonder whether 
there is a way to directly generate/copy the include files/libraries on the 
system defaults ? Looked through the docs but couldn't find anything on this.

In any case, what I needed to copy for my use case (there might be additional 
files for completeness which I didn't need, not sure):

cp lustre-release/lustre/include/lustre/*.h /usr/include/lustre/

cp lustre-release/lustre/include/uapi/linux/lustre/*.h 
/usr/include/linux/lustre/

cp lustre-release/debian/lustre-dev/usr/lib/* /usr/lib/

After that the simple `gcc test.c -llustreapi` worked

Thanks for the help, feel free to add anything you might think is useful 
(potentially a better way to do what I described)

Regards,

Apostolis


On 4/7/24 19:13, Apostolis Stamatis via lustre-discuss wrote:
Thanks for the help Andreas, indeed installing the lustre-dev and 
lustre-client-utils packages solved the issue with the lustreapi library.

However I am still getting an error:

```

/usr/bin/ld: /tmp/cceAX1ZW.o: in function `main':
test_file.c:(.text+0x37): undefined reference to `llapi_file_create'
collect2: error: ld returned 1 exit status

```

This leads me to believe I am doing something else wrong (potentially with the 
includes?).

Anyone with any input on what the issue might be or alternatively the steps 
they have followed to use the C lustre api?

Cheers, Apostolis


On 8/6/24 21:19, Andreas Dilger wrote:
On Jun 8, 2024, at 08:14, Apostolis Stamatis via lustre-discuss 
mailto:lustre-discuss@lists.lustre.org>> wrote:

Hello everyone,

I am trying to use the C api for lustre, using Ubuntu 22.04, kernel version 
5.15.0-107 and lustre client modules version 2.15.4
I am building lustre from source with the following steps (removed some junk 
like git clone and cd) (mainly from the guide 
https://metebalci.com/blog/lustre-2.15.4-on-rhel-8.9-and-ubuntu-22.04/)

It would be great to copy this page to wiki.lustre.org. 
It is a bit ironic that this page is mentioning that the wiki is outdated, but 
then proceeds to not update the wiki with new content...

```
sudo apt install build-essential libtool pkg-config flex bison libpython3-dev 
libmount-dev libaio-dev libssl-dev libnl-genl-3-dev libkeyutils-dev libyaml-dev 
libreadline-dev module-assistant debhelper dpatch libsnmp-dev mpi-default-dev 
quilt swig
sh autogen.sh
./configure --disable-server
make dkms-debs
sudo dpkg -i debs/lustre-client-modules-dkms_2.15.4-1_amd64.deb
sudo apt --fix-broken install
sudo dpkg -i debs/lustre-client-utils_2.15.4-1_amd64.deb
```

The client works as expected and can mount and modify the filesystem.
However when I try to compile the sample program using gcc v 11.4.0 with the 
command
`gcc -I/usr/src/lustre-client-modules-2.15.4/lustre/include 
-I/usr/src/lustre-client-modules-2.15.4/lustre/include/uapi/ 
-I/usr/src/lustre-client-modules-2.15.4/lustre/include/lustre -llustreapi 
test_file.c -o test`
I get the error `/usr/bin/ld: cannot find -llustreapi: No such file or 
directory`

After trying to find the lustreapi library manually, indeed I can't seem to 
find it anywhere

According to the most recent build on the b2_15 branch on Ununtu 22.04 
https://build.whamcloud.com/job/lustre-b2_15/87/arch=x86_64,build_type=client,distro=ubuntu2204,ib_stack=inkernel/

There is a "lustre-dev" package built, and it looks like that this would 
contain the libr

Re: [lustre-discuss] GitHub lustre-releases repo not in sync

2024-07-09 Thread Andreas Dilger via lustre-discuss
Hi Tom,
I'm the owner of the "lustre" account at github, and I normally do a manual 
update after branch landings on master/b2_15.

I've synced both branches.

Cheers, Andreas

On Jul 8, 2024, at 08:45, Tom Vander Aa (imec) via lustre-discuss 
mailto:lustre-discuss@lists.lustre.org>> wrote:

Hi,

I don’t know who is responsible for syncing 
https://github.com/lustre/lustre-release with whamcloud, but the GitHub repo 
has not been synced in 3 weeks.
This is unfortunate since my organization does not allow using git with the 
git:// protocol. Hence cloning from 
git://git.whamcloud.com/fs/lustre-release.git does not work.

I hope this can be fixed in some way.

Cheers,
Tom

--
Andreas Dilger
Lustre Principal Architect
Whamcloud







___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] omnipath and lnet_selftest performance

2024-07-06 Thread Andreas Dilger via lustre-discuss


On Jul 5, 2024, at 11:37, Michael DiDomenico via lustre-discuss 
mailto:lustre-discuss@lists.lustre.org>> wrote:

i could use a little help with lustre clients over omni path.  when i
run ib_write_bw tests between two compute nodes i get +10GB/sec.
compute nodes are rhel9.4 with rhel hw drivers

however, when i run lnet_selftest between the same two compute nodes

1m i/o size
16 concurrency

node1-node3
read 1m i/o ~7.1GB/sec
write 1m i/o ~4.7GB/sec

node3-node1
read 1m i/o ~6.6GB/sec
write 1m i/o ~4.9GB/sec

varying the i/o size and concurrency changes the numbers, but not
dramatically.  i've gone through the tuning guide for omnipath and my
lnd tunables all match, but i can't seem to drive the bandwidth any
higher between nodes.

Please provide the actual tuning parameters in use.

Even when we were part of Intel, the OPA tuning parameters suggested by
the OPA team were not necessarily the best in all cases.   There was some
kind of memory registration they kept suggesting, but it was always worse
in practice than in theory.

The biggest win was from setting conns_per_peer=4 or so, because OPA
needs more CPU resources for good performance than IB.

That said, it has been several years since I've had to deal with it, so I can't
say if your current performance is good or bad..

can anyone suggest where i might be dropping some performance or is
this the end?  i feel like there should be more performance here, but
since we recently retooled from rhel7 to rhel9, i'm unsure if there's
a tunable not tuned.  (unfortunately i don't have/can't seem to find
previous numbers to compare)

Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Whamcloud







___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Rhel8.10 Lustre Kernel Performance Decrease

2024-07-03 Thread Andreas Dilger via lustre-discuss
On Jul 3, 2024, at 13:12, Baucum, Rashun via lustre-discuss 
mailto:lustre-discuss@lists.lustre.org>> wrote:

Good afternoon,

We have recently started executing performance testing on the new rhel 8.10 
lustre kernel. We have noticed an drop in performance in our initial testing. 
Its roughly a 30-40% drop in total IO observed with our FIO testing. My 
question is has anyone else noticed any performance decreases?

Hi Rashun,
could you please be more specific about what you are comparing?  Which version 
of Lustre for the old and new kernel, and was it the same before upgrading to 
RHEL8.10?  Which RHEL version are you comparing against, RHEL 8.9?  Have you 
upgraded both the clients and servers to RHEL8.10, or only the clients?

Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Whamcloud







___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Compiling client issue lustre 2.12.9

2024-06-19 Thread Andreas Dilger via lustre-discuss
I just found this in my inbox without any answer.

If you are trying to use newer kernel versions, you also need to use a newer 
Lustre client, 2.15.4 or the in-flight 2.15.5-RC2.  Those are built and tested 
with the newer kernel, and will interoperate with the older servers.

Cheers, Andreas

On May 17, 2024, at 15:21, Jerome Verleyen via lustre-discuss 
mailto:lustre-discuss@lists.lustre.org>> wrote:

Dear all

I need to install client package for Lustre v2.12.9 on some Almalinux 8.9 
system. As i could'nt get rpm file, i try to compile from sours rpm file. I 
follow this recomendation from lustre'wiki:
https://wiki.whamcloud.com/display/PUB/Rebuilding+the+Lustre-client+rpms+for+a+new+kernel

I'm facing a compile issue, and could not resolve this at this moment:

make[3]: Entering directory '/usr/src/kernels/4.18.0-513.24.1.el8_9.x86_64'
  CC [M] /home/jerome/rpmbuild/SOURCES/lustre-2.12.9/lustre/llite/vvp_io.o
In file included from include/linux/string.h:254,
 from include/linux/bitmap.h:9,
 from include/linux/cpumask.h:12,
 from include/linux/smp.h:13,
 from include/linux/lockdep.h:15,
 from include/linux/mutex.h:17,
 from include/linux/kernfs.h:13,
 from include/linux/sysfs.h:16,
 from include/linux/kobject.h:20,
 from 
/home/jerome/rpmbuild/SOURCES/lustre-2.12.9/lustre/include/obd.h:36,
 from 
/home/jerome/rpmbuild/SOURCES/lustre-2.12.9/lustre/llite/vvp_io.c:41:
In function 'fortify_memset_chk',
inlined from 'vvp_io_init' at 
/home/jerome/rpmbuild/SOURCES/lustre-2.12.9/lustre/llite/vvp_io.c:1520:2:
include/linux/fortify-string.h:239:4: error: call to '__write_overflow_field' 
declared with attribute warning: detected write beyond size of field (1st 
parameter); maybe use struct_group()? [-Werror]
__write_overflow_field(p_size_field, size);
^~
cc1: all warnings being treated as errors
make[6]: *** [scripts/Makefile.build:318: 
/home/jerome/rpmbuild/SOURCES/lustre-2.12.9/lustre/llite/vvp_io.o] Error 1
make[5]: *** [scripts/Makefile.build:558: 
/home/jerome/rpmbuild/SOURCES/lustre-2.12.9/lustre/llite] Error 2
make[4]: *** [scripts/Makefile.build:558: 
/home/jerome/rpmbuild/SOURCES/lustre-2.12.9/lustre] Error 2
make[3]: *** [Makefile:1619: 
_module_/home/jerome/rpmbuild/SOURCES/lustre-2.12.9] Error 2
make[3]: Leaving directory '/usr/src/kernels/4.18.0-513.24.1.el8_9.x86_64'
make[2]: *** [autoMakefile:1123: modules] Error 2
make[2]: Leaving directory '/home/jerome/rpmbuild/SOURCES/lustre-2.12.9'
make[1]: *** [autoMakefile:661: all-recursive] Error 1
make[1]: Leaving directory '/home/jerome/rpmbuild/SOURCES/lustre-2.12.9'
make: *** [autoMakefile:519: all] Error 2


In anothee email list, they recom,end to use a CFLAGS option like this: 
-D_FORTIFY_SOURCE=0. However, this option can't resolve my issue.

Hope someone could help me on this stuff?

Best regards.

--
-- Jérôme
Beau jeune homme, il doit pas être loin de ses 75 kilos.
- J'l'ai pas pesé!
- Dans ces poids-là, j'peux vous l'embaumer façon Cléopatre, le Chef d'Oeuvre 
égyptien, inaltérable!
- Mais on vous demande pas de conserver, on vous demande de détruire!
(Michel Audiard)

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Whamcloud







___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


[lustre-discuss] Fwd: lustreapi not found when compiling sample C program

2024-06-08 Thread Andreas Dilger via lustre-discuss
On Jun 8, 2024, at 08:14, Apostolis Stamatis via lustre-discuss 
 wrote:

Hello everyone,

I am trying to use the C api for lustre, using Ubuntu 22.04, kernel version 
5.15.0-107 and lustre client modules version 2.15.4
I am building lustre from source with the following steps (removed some junk 
like git clone and cd) (mainly from the guide 
https://metebalci.com/blog/lustre-2.15.4-on-rhel-8.9-and-ubuntu-22.04/)

It would be great to copy this page to wiki.lustre.org. It is a bit ironic that 
this page is mentioning that the wiki is outdated, but then proceeds to not 
update the wiki with new content...

```
sudo apt install build-essential libtool pkg-config flex bison libpython3-dev 
libmount-dev libaio-dev libssl-dev libnl-genl-3-dev libkeyutils-dev libyaml-dev 
libreadline-dev module-assistant debhelper dpatch libsnmp-dev mpi-default-dev 
quilt swig
sh autogen.sh
./configure --disable-server
make dkms-debs
sudo dpkg -i debs/lustre-client-modules-dkms_2.15.4-1_amd64.deb
sudo apt --fix-broken install
sudo dpkg -i debs/lustre-client-utils_2.15.4-1_amd64.deb
```

The client works as expected and can mount and modify the filesystem.
However when I try to compile the sample program using gcc v 11.4.0 with the 
command
`gcc -I/usr/src/lustre-client-modules-2.15.4/lustre/include 
-I/usr/src/lustre-client-modules-2.15.4/lustre/include/uapi/ 
-I/usr/src/lustre-client-modules-2.15.4/lustre/include/lustre -llustreapi 
test_file.c -o test`
I get the error `/usr/bin/ld: cannot find -llustreapi: No such file or 
directory`

After trying to find the lustreapi library manually, indeed I can't seem to 
find it anywhere

According to the most recent build on the b2_15 branch on Ununtu 22.04  
https://build.whamcloud.com/job/lustre-b2_15/87/arch=x86_64,build_type=client,distro=ubuntu2204,ib_stack=inkernel/

There is a "lustre-dev" package built, and it looks like that this would 
contain the library files, "This package provides development libraries for the 
Lustre filesystem."

Cheers, Andreas
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Unexpected used guard number

2024-06-04 Thread Andreas Dilger via lustre-discuss
The difference between your Intel and AMD nodes may be the RPC checksum type 
that is used by default (the clients and servers negotiate the fastest 
algorithm).

I suspect the checksum error is itself fixed already, but in the meantime you 
could try setting a different checksum than t10ip4k (or whatever it is you are 
using, compare "lctl get_param osc.*.checksum_type" on your Intel vs. AMD 
clients).

Cheers, Andreas

On Jun 3, 2024, at 08:21, Fokke Dijkstra via lustre-discuss 
mailto:lustre-discuss@lists.lustre.org>> wrote:

Dear all,

We are frequently (about daily) seeing the following type of error in our 
logfile on some specific client nodes:

Jun  1 11:03:17 a100gpu1 kernel: LustreError: 
3834:0:(integrity.c:66:obd_page_dif_generate_buffer()) 
scratch-OST0042-osc-ff35febc655a9000: unexpected used guard number of DIF 5/5, 
data length 4096, sector s
ize 512: rc = -7
Jun  1 11:03:17 a100gpu1 kernel: LustreError: 
3834:0:(osc_request.c:2750:osc_build_rpc()) prep_req failed: -7
Jun  1 11:03:17 a100gpu1 kernel: LustreError: 
3834:0:(osc_cache.c:2186:osc_check_rpcs()) Write request failed with -7

We are running Lustre 2.15.4 over Ethernet on Rocky 8 servers and clients.
The error only appears on the client, nothing is found on the servers around 
that time period.

The errors mostly appear on our Intel ice lake based GPU nodes and less 
frequently on Intel ice lake based CPU nodes. We do not see the errors on our 
AMD Zen 3 nodes (the latter form the majority of our cluster).

The problem was brought to our attention by a few users that were running 
Pytorch code on the GPU nodes, who complained about Pytorch giving an error 
about writing a file and then failing.
When checking the log files the error appears to occur more often and I can't 
find a clear correlation with specific job types and neither with job failures 
(some jobs seem to continue to run after the error appears in the system log 
file).

Has anyone seen this error before? Does somebody know how to fix this?

Kind regards,

Fokke Dijkstra

--
Fokke Dijkstra 
Team High Performance Computing
Center for Information Technology, University of Groningen
Postbus 11044, 9700 CA  Groningen, The Netherlands

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Whamcloud







___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] kernel threads for rpcs in flight

2024-05-02 Thread Andreas Dilger via lustre-discuss
On May 2, 2024, at 18:10, Anna Fuchs 
mailto:anna.fu...@uni-hamburg.de>> wrote:
The number of ptlrpc threads per CPT is set by the "ptlrpcd_partner_group_size" 
module parameter, and defaults to 2 threads per CPT, IIRC.  I don't think that 
clients dynamically start/stop ptlrpcd threads at runtime.
When there are RPCs in the queue for any ptlrpcd it will be woken up and 
scheduled by the kernel, so it will compete with the application threads.  
IIRC, if a ptlrpcd thread is woken up and there are no RPCs in the local CPT 
queue it will try to steal RPCs from another CPT on the assumption that the 
local CPU is not generating any RPCs so it would be beneficial to offload 
threads on another CPU that *is* generating RPCs.  If the application thread is 
extremely CPU hungry, then the kernel will not schedule the ptlrpcd threads on 
those codes very often, and the "idle" core ptlrpcd threads will be be able to 
run more frequently.

Sorry, maybe I am confusing things. I am still not sure how many threads I get.
For example I have a 32 cores AMD Epyc machine as a client and I am running a 
serial stream io application with a single stripesize, 1 OST.
I am struggeling to find out how many CPU partitions I have - is it something 
on the hardware side or something configurable?
There is no file /proc/sys/lnet/cpu_partitions on my client.

This is a module parameter, since it cannot be changed at runtime.  This is 
visible at /sys/module/libcfs/parameters/cpu_npartitions and the default value 
depends on the number of CPU cores and NUMA configuration.  It can be specified 
with "options libcfs cpu_npartitions=" in /etc/modprobe.d/lustre.conf.

Assuming I had 2 CPU partitions, that would result in 4 ptlrpc threads at 
system start, right?

Correct.

Now I set  rpcs_in_flight to 1 or to 8, what effect does that have on the 
number and the activity of the threads?

Setting rpcs_in_flight has no effect on the number of ptlrpcd threads.  The 
ptlrpcd threads process RPCs asynchronously (unlike server threads) so they can 
keep many RPCs in progress.

Serial stream, 1 rpcs_in_flight is waking up only one ptlrpc thread, 3 remain 
inactive/sleep/do nothing?

This depends.  There are two ptlrpcd threads for the CPT that can process the 
RPCs from the one user thread.  If they can send the RPCs quickly enough then 
the other ptlrpcd threads may not steal the RPCs from that CPT.

That said, even a single threaded userspace writer may have up to 8 RPCs in 
flight *per OST* (depending on the file striping and if IO submission allows it 
- buffered or AIO+DIO) so if there are a lot of outstanding RPCs and RPC 
generation takes a long time (e.g. compression) then it may be that all ptlrpcd 
threads will be busy.

Does not seem to be the case, as I've applied the rpctracing (thanks a lot for 
the hint!!), and rpcs_in_flight being 1 still show at least 3 different threads 
from at least 2 different partitions for writing a 1MB file with ten blocks.
I don't get the relationship between these values.

What are the opcodes from the different RPCs?  The ptlrpcd threads are only 
handling asynchronous RPCs like buffered writes, statfs, and a few others.  
Many RPCs are processed in the context of the application thread, not by 
ptlrpcd.

And, if I had compression or any other heavy load, which settings could clearly 
control how many resources I want to give Lustre for this load? I can see a 
clear scaling with higher rpcs in flight, but I am struggeling to understand 
the numbers and attribute them to a specific settings. Uncompressed case 
already benefits a bit by higher RPCs number due to multiple "substreaming", 
but there must be much more happening in parallel behind the scenes for 
compressed case even with rpcs_in_flight=1.

The "cpu_npartitions" module parameter controls how many groups the cores are 
split into.  The "cpu_pattern" parameter can control the specific cores in each 
of the CPTs, which would affect the default per-CPT ptlrpcd threads location. 
It is possible to further use the "ptlrpcd_cpts" and "ptlrpcd_per_cpt_max" 
parameters to control specifically which cores are used for the threads.

It is entirely possible that the number of ptlrpcd threads and CPT 
configuration is becoming sub-optimal as the number of multi-chip package CPUs 
with many cores grows dramatically.  It is a balance between having enough 
threads to maximize performance without having so many that it goes down hill 
again.  Ideally this should all happen without the need to hand-tune the CPT 
and thread count for every CPU on the market.

Cheers, Andreas

Thank you!

Anna


Whether this behavior is optimal or not is subject to debate, and 
investigation/improvements are of course welcome.  Definitely, data checksums 
have some overhead (a few percent), and client-side data compression (which is 
done by ptlrpcd threads) would have a significant usage of CPU cycles, but 
given the large number of CPU cores on client nodes these days this may

Re: [lustre-discuss] kernel threads for rpcs in flight

2024-04-30 Thread Andreas Dilger via lustre-discuss
On Apr 29, 2024, at 02:36, Anna Fuchs 
mailto:anna.fu...@uni-hamburg.de>> wrote:

Hi Andreas.

Thank you very much, that helps a lot.
Sorry for the confusion, I primarily meant the client. The servers rarely have 
to compete with anything else for CPU resources I guess.

The mechanism to start new threads is relatively simple.  Before a server 
thread is processing a new request, if it is the last thread available, and not 
the maximum number of threads are running, then it will try to launch a new 
thread; repeat as needed.  So the thread  count will depend on the client RPC 
load and the RPC processing rate and lock contention on whatever resources 
those RPCs are accessing.

And what conditions are on the client? Are the threads then driven by the 
workload of the application somehow?

The number of ptlrpc threads per CPT is set by the "ptlrpcd_partner_group_size" 
module parameter, and defaults to 2 threads per CPT, IIRC.  I don't think that 
clients dynamically start/stop ptlrpcd threads at runtime.

Imagine an edge case where all but one core are pinned and at 100% constant 
load and one is dumping RAM to Lustre. Presumably, the available core will be 
taken. But will Lustre or the kernel then spawn additional threads and try to 
somehow interleave them with those of the application, or will it simply handle 
it with 1-2 threads on the available core (assume single stream to single OST)? 
In any case, I suppose the I/O transfer would suffer under the resource 
shortage, but my question would be to what extent it would (try to) hinder the 
application. For latency-critical applications, such small delays can already 
lead to idle waves. And surely, the Lustre threads are usually not CPU-hungry, 
but they will when it comes to encryption and compression.

When there are RPCs in the queue for any ptlrpcd it will be woken up and 
scheduled by the kernel, so it will compete with the application threads.  
IIRC, if a ptlrpcd thread is woken up and there are no RPCs in the local CPT 
queue it will try to steal RPCs from another CPT on the assumption that the 
local CPU is not generating any RPCs so it would be beneficial to offload 
threads on another CPU that *is* generating RPCs.  If the application thread is 
extremely CPU hungry, then the kernel will not schedule the ptlrpcd threads on 
those codes very often, and the "idle" core ptlrpcd threads will be be able to 
run more frequently.

Whether this behavior is optimal or not is subject to debate, and 
investigation/improvements are of course welcome.  Definitely, data checksums 
have some overhead (a few percent), and client-side data compression (which is 
done by ptlrpcd threads) would have a significant usage of CPU cycles, but 
given the large number of CPU cores on client nodes these days this may still 
provide a net performance benefit if the IO bottleneck is on the server.

With max_rpcs_in_flight = 1, multiple cores are loaded, presumably alternately, 
but the statistics are too inaccurate to capture this.  The distribution of 
threads to cores is regulated by the Linux kernel, right? Does anyone have 
experience with what happens when all CPUs are under full load with the 
application or something else?

Note that {osc,mdc}.*.max_rpcs_in_flight is a *per target* parameter, so a 
single client can still have tens or hundreds of RPCs in flight to different 
servers.  The client will send many RPC types directly from the process 
context, since they are waiting on the result anyway.  For asynchronous bulk 
RPCs, the ptlrpcd thread will try to process the bulk IO on the same CPT (= 
Lustre CPU Partition Table, roughly aligned to NUMA nodes) as the userspace 
application was running when the request was created.  This minimizes the 
cross-NUMA traffic when accessing pages for bulk RPCs, so long as those cores 
are not busy with userspace tasks.  Otherwise, the ptlrpcd thread on another 
CPT will steal RPCs from the queues.

Do the Lustre threads suffer? Is there a prioritization of the Lustre threads 
over other tasks?

Are you asking about the client or the server?  Many of the client RPCs are 
generated by the client threads, but for the running ptlrpcd threads do not 
have a higher priority than client application threads.  If the application 
threads are running on some cores, but other cores are idle, then the ptlrpcd 
threads on other cores will try to process the RPCs to allow the application 
threads to continue running there.  Otherwise, if all cores are busy (as is 
typical for HPC applications) then they will be scheduled by the kernel as 
needed.

Are there readily available statistics or tools for this scenario?

What statistics are you looking for?  There are "{osc,mdc}.*.stats" and 
"{osc,mdc}.*rpc_stats" that have aggregate information about RPC counts and 
latency.

Oh, right, these tell a lot. Isn't there also something to log the utilization 
and location of these threads? Otherwise, I'll continue trying with perf, which 
seems 

Re: [lustre-discuss] [EXTERNAL] [BULK] Files created in append mode don't obey directory default stripe count

2024-04-29 Thread Andreas Dilger via lustre-discuss
Simon is exactly correct.  This is expected behavior for files opened with 
O_APPEND, at least until LU-12738 is implemented.  Since O_APPEND writes are 
(by definition) entirely serialized, having multiple stripes on such files is 
mostly useless and just adds overhead.

Feel free to read https://jira.whamcloud.com/browse/LU-9341 for the very 
lengthy saga on the history of this behavior.

Cheers, Andreas

On Apr 29, 2024, at 10:42, Simon Guilbault 
mailto:simon.guilba...@calculquebec.ca>> wrote:

This is the expected behaviour. In the original implementation of PFL, when a 
file was open in append mode, the lock from 0 to EOF was initializing all 
stripes of the PFL file. We have a PFL layout on our system with 1 stripe up to 
1 GB, then it increased to 4 and then 32 stripes when the file was getting very 
large. This was a problem with software that was creating 4kb log files (like 
slurm.out) because they were creating files with > 32 stripes because of the 
append mode. This was patched a few releases ago, that behaviour can be 
changed, but I would recommend keeping 1 stripe for files that are using append 
mode.

From the manual:
O_APPEND mode. When files are opened for append, they instantiate all 
uninitialized components expressed in the layout. Typically, log files are 
opened for append, and complex layouts can be inefficient.
Note
The mdd.*.append_stripe_count and mdd.*.append_pool options can be used to 
specify special default striping for files created with O_APPEND.

On Mon, Apr 29, 2024 at 11:21 AM Vicker, Darby J. (JSC-EG111)[Jacobs 
Technology, Inc.] via lustre-discuss 
mailto:lustre-discuss@lists.lustre.org>> wrote:
Wow, I would say that is definitely not expected.  I can recreate this on both 
of our LFS’s.  One is community lustre 2.14, the other is a DDN Exascalar.  
Shown below is our community lustre but we also have a 3-segment PFL on our 
Exascalar and the behavor is the same there.

$ echo > aaa
$ echo >> bbb
$ lfs getstripe aaa bbb
aaa
  lcm_layout_gen:3
  lcm_mirror_count:  1
  lcm_entry_count:   3
lcme_id: 1
lcme_mirror_id:  0
lcme_flags:  init
lcme_extent.e_start: 0
lcme_extent.e_end:   33554432
  lmm_stripe_count:  1
  lmm_stripe_size:   4194304
  lmm_pattern:   raid0
  lmm_layout_gen:0
  lmm_stripe_offset: 6
  lmm_objects:
  - 0: { l_ost_idx: 6, l_fid: [0x10006:0xace8112:0x0] }

lcme_id: 2
lcme_mirror_id:  0
lcme_flags:  0
lcme_extent.e_start: 33554432
lcme_extent.e_end:   10737418240
  lmm_stripe_count:  4
  lmm_stripe_size:   4194304
  lmm_pattern:   raid0
  lmm_layout_gen:0
  lmm_stripe_offset: -1

lcme_id: 3
lcme_mirror_id:  0
lcme_flags:  0
lcme_extent.e_start: 10737418240
lcme_extent.e_end:   EOF
  lmm_stripe_count:  8
  lmm_stripe_size:   4194304
  lmm_pattern:   raid0
  lmm_layout_gen:0
  lmm_stripe_offset: -1

bbb
lmm_stripe_count:  1
lmm_stripe_size:   2097152
lmm_pattern:   raid0
lmm_layout_gen:0
lmm_stripe_offset: 3
obdidx  objidobjid  
  group
 3 179773949   0xab721fd   0


From: lustre-discuss 
mailto:lustre-discuss-boun...@lists.lustre.org>>
 on behalf of Otto, Frank via lustre-discuss 
mailto:lustre-discuss@lists.lustre.org>>
Date: Monday, April 29, 2024 at 8:33 AM
To: lustre-discuss@lists.lustre.org 
mailto:lustre-discuss@lists.lustre.org>>
Subject: [EXTERNAL] [BULK] [lustre-discuss] Files created in append mode don't 
obey directory default stripe count
CAUTION: This email originated from outside of NASA.  Please take care when 
clicking links or opening attachments.  Use the "Report Message" button to 
report suspicious messages to the NASA SOC.


See subject. Is it a known issue? Is it expected? Easy to reproduce:


# lfs getstripe .
.
stripe_count:  4 stripe_size:   1048576 pattern:   raid0 stripe_offset: -1

# echo > aaa
# echo >> bbb
# lfs getstripe .
.
stripe_count:  4 stripe_size:   1048576 pattern:   raid0 stripe_offset: -1

./aaa
lmm_stripe_count:  4
lmm_stripe_size:   1048576
lmm_pattern:   raid0
lmm_layout_gen:0
lmm_stripe_offset: 0
obdidx   objid   objid   group
 02830  0xb0e0
 12894  0xb4e0
 22831  0xb0f0
 32895  0xb4f0

./bbb
lmm_stripe_count:  1
lmm_stripe_size:   1048576
lmm_pattern:   raid0
lmm_layout_gen:0
lmm_stripe_offset: 4
obdidx   objid   objid   group
 42831  0xb0f0



As you see, file "bbb" is created with stripe co

Re: [lustre-discuss] kernel threads for rpcs in flight

2024-04-28 Thread Andreas Dilger via lustre-discuss
On Apr 28, 2024, at 16:54, Anna Fuchs via lustre-discuss 
mailto:lustre-discuss@lists.lustre.org>> wrote:

The setting max_rpcs_in_flight affects, among other things, how many threads 
can be spawned simultaneously for processing the RPCs, right?

The {osc,mdc}.*.max_rpcs_in_flight are actually controlling the maximum number 
of RPCs a *client* will have in flight to any MDT or OST, while the number of 
MDS and OSS threads is controlled on the server with 
mds.MDS.mdt*.threads_{min,max} and ost.OSS.ost*.threads_{min,max} for each of 
the various service portals (which are selected by the client based on the RPC 
type).  The max_rpcs_in_flight allows concurrent operations on the client for 
multiple threads to hide network latency and to improve server utilization, 
without allowing a single client to overwhelm the server.

In tests where the network is clearly a bottleneck, this setting has almost no 
effect - the network cannot keep up with processing the data, there is not so 
much to do in parallel.
With a faster network, the stats show higher CPU utilization on different cores 
(at least on the client).

What is the exact mechanism by which it is decided that a kernel thread is 
spawned for processing a bulk? Is there an RPC queue with timings or something 
similar?
Is it in any way predictable or calculable how many threads a specific workload 
will require (spawn if possible) given the data rates from the network and 
storage devices?

The mechanism to start new threads is relatively simple.  Before a server 
thread is processing a new request, if it is the last thread available, and not 
the maximum number of threads are running, then it will try to launch a new 
thread; repeat as needed.  So the thread  count will depend on the client RPC 
load and the RPC processing rate and lock contention on whatever resources 
those RPCs are accessing.

With max_rpcs_in_flight = 1, multiple cores are loaded, presumably alternately, 
but the statistics are too inaccurate to capture this.  The distribution of 
threads to cores is regulated by the Linux kernel, right? Does anyone have 
experience with what happens when all CPUs are under full load with the 
application or something else?


Note that {osc,mdc}.*.max_rpcs_in_flight is a *per target* parameter, so a 
single client can still have tens or hundreds of RPCs in flight to different 
servers.  The client will send many RPC types directly from the process 
context, since they are waiting on the result anyway.  For asynchronous bulk 
RPCs, the ptlrpcd thread will try to process the bulk IO on the same CPT (= 
Lustre CPU Partition Table, roughly aligned to NUMA nodes) as the userspace 
application was running when the request was created.  This minimizes the 
cross-NUMA traffic when accessing pages for bulk RPCs, so long as those cores 
are not busy with userspace tasks.  Otherwise, the ptlrpcd thread on another 
CPT will steal RPCs from the queues.

Do the Lustre threads suffer? Is there a prioritization of the Lustre threads 
over other tasks?

Are you asking about the client or the server?  Many of the client RPCs are 
generated by the client threads, but for the running ptlrpcd threads do not 
have a higher priority than client application threads.  If the application 
threads are running on some cores, but other cores are idle, then the ptlrpcd 
threads on other cores will try to process the RPCs to allow the application 
threads to continue running there.  Otherwise, if all cores are busy (as is 
typical for HPC applications) then they will be scheduled by the kernel as 
needed.

Are there readily available statistics or tools for this scenario?

What statistics are you looking for?  There are "{osc,mdc}.*.stats" and 
"{osc,mdc}.*rpc_stats" that have aggregate information about RPC counts and 
latency.

Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Whamcloud







___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] ko2iblnd.conf

2024-04-12 Thread Andreas Dilger via lustre-discuss
The ko2iblnd-opa settings are only used if you have Intel OPA instead of 
Mellanox cards (depends on the ko2iblnd-probe script).  You should still have 
ko2iblnd line in the server config that is used for MLX cards in order to set 
the values to match on both sides.

As for the actual settings, someone with more LNet IB experience should chime 
in on what is best to use.  All I know is that they have to be the same on both 
sides or they get unhappy, and the usable values depend on the card type and 
MOFED/OFED version.  As a starting point I would just copy the client ko2iblnd 
options to the server and see if it works.

Cheers, Andreas

On Apr 11, 2024, at 12:02, Daniel Szkola 
mailto:dszk...@fnal.gov>> wrote:

On the server node(s):

options ko2iblnd-opa peer_credits=32 peer_credits_hiw=16 credits=1024 
concurrent_sends=64 ntx=2048 map_on_demand=256 fmr_pool_size=2048 
fmr_flush_trigger=512 fmr_cache=1 conns_per_peer=4

On clients:

options ko2iblnd peer_credits=128 peer_credits_hiw=64 credits=1024 
concurrent_sends=256 ntx=2048 map_on_demand=32 fmr_pool_size=2048 
fmr_flush_trigger=512 fmr_cache=1 conns_per_peer=4

My concern isn’t so much the mismatch because I know that’s an issue but rather 
what numbers we should settle on with a recent lustre build. I also see the 
ko2iblnd-opa in the server config, which means because the server is actually 
loading ko2iblnd that maybe defaults are used?

What made me look was we were seeing lots of:
LNetError: 2961324:0:(o2iblnd_cb.c:2612:kiblnd_passive_connect()) Can't accept 
conn from xxx.xxx.xxx.xxx@o2ib2, queue depth too large:  42 (<=32 wanted)

—
Dan Szkola
FNAL


On Apr 11, 2024, at 12:36 PM, Andreas Dilger 
mailto:adil...@whamcloud.com>> wrote:

[EXTERNAL] – This message is from an external sender


On Apr 11, 2024, at 09:56, Daniel Szkola via lustre-discuss 
mailto:lustre-discuss@lists.lustre.org>> wrote:

Hello all,

I recently discovered some mismatches in our /etc/modprobe.d/ko2iblnd.conf 
files between our clients and servers.

Is it now recommended to keep the defaults on this module and run without a 
config file or are there recommended numbers for lustre-2.15.X?

The only thing I’ve seen that provides any guidance is the Lustre wiki and an 
HP/Cray doc:

https://www.hpe.com/psnow/resources/ebooks/a00113867en_us_v2/Lustre_Server_Recommended_Tuning_Parameters_4.x.html

Anyone have any sage advice on what the ko2iblnd.conf (and possibly 
ko2iblnd-opa.conf and hfi1.conf as well) on modern systems?

It would be useful to know what specific settings are mismatched.  Definitely 
some of them need to be consistent between peers, others depend on your system.

Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Whamcloud









Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Whamcloud







___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] ko2iblnd.conf

2024-04-11 Thread Andreas Dilger via lustre-discuss
On Apr 11, 2024, at 09:56, Daniel Szkola via lustre-discuss 
mailto:lustre-discuss@lists.lustre.org>> wrote:

Hello all,

I recently discovered some mismatches in our /etc/modprobe.d/ko2iblnd.conf 
files between our clients and servers.

Is it now recommended to keep the defaults on this module and run without a 
config file or are there recommended numbers for lustre-2.15.X?

The only thing I’ve seen that provides any guidance is the Lustre wiki and an 
HP/Cray doc:

https://www.hpe.com/psnow/resources/ebooks/a00113867en_us_v2/Lustre_Server_Recommended_Tuning_Parameters_4.x.html

Anyone have any sage advice on what the ko2iblnd.conf (and possibly 
ko2iblnd-opa.conf and hfi1.conf as well) on modern systems?

It would be useful to know what specific settings are mismatched.  Definitely 
some of them need to be consistent between peers, others depend on your system.

Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Whamcloud







___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Could not read from remote repository

2024-04-09 Thread Andreas Dilger via lustre-discuss
On Apr 9, 2024, at 04:16, Jannek Squar via lustre-discuss 
mailto:lustre-discuss@lists.lustre.org>> wrote:

Hey,

I tried to clone the source code via `git clone 
git://git.whamcloud.com/fs/lustre-release.git` but got an error:

"""
fatal: Could not read from remote repository.

Please make sure you have the correct access rights
and the repository exists.
"""

Is there something going on with the repository or is the error probably on my 
side?

The above command worked for me on a login with no SSH key configured:

$ ssh-add -l
Could not open a connection to your authentication agent.
$ git clone git://git.whamcloud.com/fs/lustre-release.git
Cloning into 'lustre-release'...
remote: Counting objects: 386206, done.
remote: Compressing objects: 100% (81406/81406), done.
Receiving objects:  26% (100414/386206), 27.02 MiB | 9.00 MiB/s ...

Do you have connectivity to git.whamcloud.com (e.g. 
ping/traceroute)?

A second option is to clone from the "lustre/lustre-release" repo on GitHub, 
which is itself a clone of git://git.whamcloud.com/

Otherwise, you could create a Gerrit account at https://review.whamcloud.com/ 
and register your SSH public key there and then use:

git clone ssh://review.whamcloud.com:29418/fs/lustre-release

which you would want to do anyway if you are planning to submit any patches.

Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Whamcloud







___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Building Lustre against Mellanox OFED

2024-03-16 Thread Andreas Dilger via lustre-discuss
On Mar 15, 2024, at 09:18, Paul Edmon via lustre-discuss 
mailto:lustre-discuss@lists.lustre.org>> wrote:

I'm working on building Lustre 2.15.4 against recent versions of Mellanox OFED. 
I built OFED against the specific kernel and then install 
mlnx-ofa_kernel-modules for that specific kernel. After which I built lustre 
against that version of OFED using:

rpmbuild --rebuild --without servers --without mpi --with mofed --define 
"_topdir `pwd`" SRPMS/lustre-2.15.4-1.src.rpm

However once I finish building and install I get:

Error: Transaction test error:
  file /etc/depmod.d/zz01-mlnx-ofa_kernel-mlx_compat.conf from install of 
kmod-mlnx-ofa_kernel-23.10-OFED.23.10.2.1.3.1.rhel8u9.x86_64 conflicts with 
file from package 
mlnx-ofa_kernel-modules-23.10-OFED.23.10.2.1.3.1.kver.4.18.0_513.18.1.el8_9.x86_64.x86_64

I saw this earlier message which matches my case: 
http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/2020-December/017407.html
 But no resolution.

Does anyone know the solution to this? Is there a work around?

You probably need to "rpm -Uvh" instead of "rpm -ivh" the new 
kmod-mlnx-ofa_kernel RPM?

Or, if you want to keep both RPMs installed (e.g. for different kernels) then 
you can probably just use "--force" since it looks like the .conf file would 
likely be the same from both packages.

Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Whamcloud







___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] The confusion for mds hardware requirement

2024-03-11 Thread Andreas Dilger via lustre-discuss
All of the numbers in this example are estimates/approximations to give an idea 
about the amount of memory that the MDS may need under normal operating 
circumstances.  However, the MDS will also continue to function with more or 
less memory.  The actual amount of memory in use will change very significantly 
based on application type, workload, etc. and the numbers "256" and "100,000" 
are purely examples of how many files might be in use.

I'm not sure you can "test" those numbers, because whatever number of files you 
test with will be the number of files actually in use.  You could potentially 
_measure_ the number of files/locks in use on a large cluster, but again this 
will be highly site and application dependent.

Cheers, Andreas

On Mar 11, 2024, at 01:24, Amin Brick Mover 
mailto:aminbrickmo...@gmail.com>> wrote:

Hi,  Andreas.

Thank you for your reply.

Can I consider 256 files per core as an empirical parameter? And does the 
parameter '256' need testing based on hardware conditions? Additionally, in the 
calculation formula "12 interactive clients * 100,000 files * 2KB = 2400 MB," 
is the number '100,000' files also an empirical parameter? Do I need to test 
it. Can I directly use the values '256' and '100,000'?

Andreas Dilger mailto:adil...@whamcloud.com>> 
于2024年3月11日周一 05:47写道:
These numbers are just estimates, you can use values more suitable to your 
workload.

Similarly, 32-core clients may be on the low side these days.  NVIDIA DGX nodes 
have 256 cores, though you may not have 1024 of them.

The net answer is that having 64GB+ of RAM is inexpensive these days and 
improves MDS performance, especially if you compare it to the cost of client 
nodes that would sit waiting for filesystem access if the MDS is short of RAM.  
Better to have too much RAM on the MDS than too little.

Cheers, Andreas

On Mar 4, 2024, at 00:56, Amin Brick Mover via lustre-discuss 
mailto:lustre-discuss@lists.lustre.org>> wrote:

In the Lustre Manual 5.5.2.1 section, the examples mentioned:
For example, for a single MDT on an MDS with 1,024 compute nodes, 12 
interactive login nodes, and a
20 million file working set (of which 9 million files are cached on the clients 
at one time):
Operating system overhead = 4096 MB (RHEL8)
File system journal = 4096 MB
1024 * 32-core clients * 256 files/core * 2KB = 16384 MB
12 interactive clients * 100,000 files * 2KB = 2400 MB
20 million file working set * 1.5KB/file = 30720 MB
I'm curious, how were the two numbers, 256 files/core and 100,000 files, 
determined? Why?

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Whamcloud








Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Whamcloud







___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] The confusion for mds hardware requirement

2024-03-10 Thread Andreas Dilger via lustre-discuss
These numbers are just estimates, you can use values more suitable to your 
workload.

Similarly, 32-core clients may be on the low side these days.  NVIDIA DGX nodes 
have 256 cores, though you may not have 1024 of them.

The net answer is that having 64GB+ of RAM is inexpensive these days and 
improves MDS performance, especially if you compare it to the cost of client 
nodes that would sit waiting for filesystem access if the MDS is short of RAM.  
Better to have too much RAM on the MDS than too little.

Cheers, Andreas

On Mar 4, 2024, at 00:56, Amin Brick Mover via lustre-discuss 
mailto:lustre-discuss@lists.lustre.org>> wrote:

In the Lustre Manual 5.5.2.1 section, the examples mentioned:
For example, for a single MDT on an MDS with 1,024 compute nodes, 12 
interactive login nodes, and a
20 million file working set (of which 9 million files are cached on the clients 
at one time):
Operating system overhead = 4096 MB (RHEL8)
File system journal = 4096 MB
1024 * 32-core clients * 256 files/core * 2KB = 16384 MB
12 interactive clients * 100,000 files * 2KB = 2400 MB
20 million file working set * 1.5KB/file = 30720 MB
I'm curious, how were the two numbers, 256 files/core and 100,000 files, 
determined? Why?

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Whamcloud







___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Issues draining OSTs for decommissioning

2024-03-07 Thread Andreas Dilger via lustre-discuss
It's almost certainly just internal files. You could mount as ldiskfs and run 
"ls -lR" to check. 

Cheers, Andreas

> On Mar 6, 2024, at 22:23, Scott Wood via lustre-discuss 
>  wrote:
> 
> Hi folks,
> 
> Time to empty some OSTs to shut down some old arrays.  I've been following 
> the docs from 
> https://doc.lustre.org/lustre_manual.xhtml#lustremaint.remove_ost and am 
> emptying with "lfs find /mnt/lustre/ -obd lustre-OST0060 | lfs_migrate -y" 
> (for the various OSTs) and it's looking pretty good but I do have a few 
> questions:
> 
> Q1) I've dealt with a few edge cases, missed files, etc and now "lfs find" 
> and "rbh-find" both show that the OSTs have nothing left on them but they 
> pretty much all have 236 inodes still allocated.  Is this just overhead? 
> 
> Q2) Also, one OST shows 237 inodes (lustre-OST0074_UUID shown below) but, 
> again, "lfs find" says its empty.  Is that a concern?
> 
> Q3) Lastly, this file system is under load.  Am I safe to deactivate the OSTs 
> while we're running or should I wait till our next maintenance outage?
> 
> For reference:
> [root@hpcpbs02 ~]# lfs df -i |sed -e 's/qimrb/lustre/'
> UUID  Inodes   IUsed   IFree IUse% Mounted on
> ...
> lustre-OST0060_UUID  61002112 23661001876   1% 
> /mnt/lustre[OST:96]
> lustre-OST0061_UUID  61002112 23661001876   1% 
> /mnt/lustre[OST:97]
> lustre-OST0062_UUID  61002112 23661001876   1% 
> /mnt/lustre[OST:98]
> lustre-OST0063_UUID  61002112 23661001876   1% 
> /mnt/lustre[OST:99]
> lustre-OST0064_UUID  61002112 23661001876   1% 
> /mnt/lustre[OST:100]
> lustre-OST0065_UUID  61002112 23661001876   1% 
> /mnt/lustre[OST:101]
> lustre-OST0066_UUID  61002112 23661001876   1% 
> /mnt/lustre[OST:102]
> lustre-OST0067_UUID  61002112 23661001876   1% 
> /mnt/lustre[OST:103]
> lustre-OST0068_UUID  61002112 23661001876   1% 
> /mnt/lustre[OST:104]
> lustre-OST0069_UUID  61002112 23661001876   1% 
> /mnt/lustre[OST:105]
> lustre-OST006a_UUID  61002112 23661001876   1% 
> /mnt/lustre[OST:106]
> lustre-OST006b_UUID  61002112 23661001876   1% 
> /mnt/lustre[OST:107]
> lustre-OST006c_UUID  61002112 23661001876   1% 
> /mnt/lustre[OST:108]
> lustre-OST006d_UUID  61002112 23661001876   1% 
> /mnt/lustre[OST:109]
> lustre-OST006e_UUID  61002112 23661001876   1% 
> /mnt/lustre[OST:110]
> lustre-OST006f_UUID  61002112 23661001876   1% 
> /mnt/lustre[OST:111]
> lustre-OST0070_UUID  61002112 23661001876   1% 
> /mnt/lustre[OST:112]
> lustre-OST0071_UUID  61002112 23661001876   1% 
> /mnt/lustre[OST:113]
> lustre-OST0072_UUID  61002112 23661001876   1% 
> /mnt/lustre[OST:114]
> lustre-OST0073_UUID  61002112 23661001876   1% 
> /mnt/lustre[OST:115]
> lustre-OST0074_UUID  61002112 23761001875   1% 
> /mnt/lustre[OST:116]
> lustre-OST0075_UUID  61002112 23661001876   1% 
> /mnt/lustre[OST:117]
> lustre-OST0076_UUID  61002112 23661001876   1% 
> /mnt/lustre[OST:118]
> lustre-OST0077_UUID  61002112 23661001876   1% 
> /mnt/lustre[OST:119]
> ...
> 
> Cheers!
> Scott
> ___
> lustre-discuss mailing list
> lustre-discuss@lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] lustre-client-dkms-2.15.4 is still checking for python2

2024-02-06 Thread Andreas Dilger via lustre-discuss
I've cherry-picked patch https://review.whamcloud.com/53947 
"LU-15655 contrib: update 
branch_comm to python3" to b2_15 to avoid this issue in the future.  This 
script is for developers and does not affect functionality of the filesystem at 
all.

Cheers, Andreas

On Jan 30, 2024, at 06:32, BALVERS Martin 
mailto:martin.balv...@danone.com>> wrote:

I found the file that still references python2.
The file contrib/scripts/branch_comm contains ‘#!/usr/bin/env python2’
After changing that to python3 and building the dkms-rpm I can install the 
generated lustre-client-dkms-2.15.4-1.el9.noarch.rpm on AlmaLinux 9.3

I have no idea what that script does, or if it functions with python2 instead 
of python2 as env.

Gr,
Martin Balvers

From: Andreas Dilger mailto:adil...@whamcloud.com>>
Sent: Tuesday, January 23, 2024 23:32
To: BALVERS Martin mailto:martin.balv...@danone.com>>
Subject: Re: [lustre-discuss] lustre-client-dkms-2.15.4 is still checking for 
python2

** Caution - this is an external email **
Installing the DKMS is fine.  You can ignore the python2 dependency.

If you can debug *why* it is depending on python2 then a patch would be 
welcome.  Please see:
https://wiki.whamcloud.com/display/PUB/Patch+Landing+Process+Summary


On Jan 23, 2024, at 01:37, BALVERS Martin 
mailto:martin.balv...@danone.com>> wrote:

I have always installed both… Hasn’t caused issues luckily.

The binary installs, but the dkms version insists it needs python2.
If I use the binary, it will break with every minor kernel version update 
right? I’ll have to wait with updating the kernel until the lustre client 
catches up?

Thanks,
Martin Balvers

From: Andreas Dilger mailto:adil...@whamcloud.com>>
Sent: Friday, January 19, 2024 20:00
To: BALVERS Martin mailto:martin.balv...@danone.com>>
Cc: lustre-discuss@lists.lustre.org
Subject: Re: [lustre-discuss] lustre-client-dkms-2.15.4 is still checking for 
python2

** Caution - this is an external email **
It looks like there may be a couple of test tools that are referencing python2, 
but it definitely isn't needed
for normal operation.  Are you using the lustre-client binary or the 
lustre-client-dkms?  Only one is needed.

For the short term it would be possible to override this dependency, but it 
would be good to understand
why this dependency is actually being generated.

On Jan 19, 2024, at 04:06, BALVERS Martin via lustre-discuss 
mailto:lustre-discuss@lists.lustre.org>> wrote:

FYI
It seems that lustre-client-dkms-2.15.4 is still checking for python2 and does 
not install on AlmaLinux 9.3

# dnf --enablerepo=lustre-client install lustre-client lustre-client-dkms
Last metadata expiration check: 0:04:50 ago on Fri Jan 19 11:43:54 2024.
Error:
Problem: conflicting requests
  - nothing provides /usr/bin/python2 needed by 
lustre-client-dkms-2.15.4-1.el9.noarch from lustre-client
(try to add '--skip-broken' to skip uninstallable packages or '--nobest' to use 
not only best candidate packages)

According to the changelog this should have been fixed 
(https://wiki.lustre.org/Lustre_2.15.4_Changelog).

Regards,
Martin Balvers

Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Whamcloud






Ce message électronique et tous les fichiers attachés qu'il contient sont 
confidentiels et destinés exclusivement à l'usage de la personne à laquelle ils 
sont adressés. Si vous avez reçu ce message par erreur, merci de le retourner à 
son émetteur. Les idées et opinions présentées dans ce message sont celles de 
son auteur, et ne représentent pas nécessairement celles de DANONE ou d'une 
quelconque de ses filiales. La publication, l'usage, la distribution, 
l'impression ou la copie non autorisée de ce message et des attachements qu'il 
contient sont strictement interdits.

This e-mail and any files transmitted with it are confidential and intended 
solely for the use of the individual to whom it is addressed. If you have 
received this email in error please send it back to the person that sent it to 
you. Any views or opinions presented are solely those of its author and do not 
necessarily represent those of DANONE or any of its subsidiary companies. 
Unauthorized publication, use, dissemination, forwarding, printing or copying 
of this email and its associated attachments is strictly prohibited.

Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Whamcloud

Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Whamcloud







___
lustre-discuss mailing list
lustre-d

Re: [lustre-discuss] ldiskfs / mdt size limits

2024-02-03 Thread Andreas Dilger via lustre-discuss
Thomas,
You are exactly correct that large MDTs can be useful for DoM if you have HDD 
OSTs. The benefit is relatively small if you have NVMe OSTs. 

If the MDT is larger than 16TB it must be formatted with the extents feature to 
address block numbers over 2^32. Unfortunately, this is _slightly_ less 
efficient than the (in)direct block addressing for very fragmented allocations, 
like those of directories, so this feature is not used for MDTs below 16TiB. 

Cheers, Andreas

> On Feb 2, 2024, at 06:35, Thomas Roth via lustre-discuss 
>  wrote:
> 
> Hi all,
> 
> confused  about size limits:
> 
> I distinctly remember trying to format a ~19 TB disk / LV for use as an MDT, 
> with ldiskfs, and failing to do so: the max size for the underlying ext4 is 
> 16 TB.
> Knew that, had ignoed that, but not a problem back then - just adapted the 
> logical volume's size.
> 
> Now I have a 24T disk, and neither mkfs.lustre nor Lustre itself have show 
> any issues with it.
> 'df -h' does show the 24T, 'df -ih' shows the expected 4G of inodes.
> I suppose this MDS has a lot of space for directories and stuff, or for DOM.
> But why does it work in the first place? ldiskfs extends beyond all limits 
> these days?
> 
> Regards,
> Thomas
> 
> --
> 
> Thomas Roth
> Department: Informationstechnologie
> Location: SB3 2.291
> 
> 
> GSI Helmholtzzentrum für Schwerionenforschung GmbH
> Planckstraße 1, 64291 Darmstadt, Germany, www.gsi.de
> 
> Commercial Register / Handelsregister: Amtsgericht Darmstadt, HRB 1528
> Managing Directors / Geschäftsführung:
> Professor Dr. Paolo Giubellino, Jörg Blaurock
> Chairman of the Supervisory Board / Vorsitzender des GSI-Aufsichtsrats:
> Ministerialdirigent Dr. Volkmar Dietz
> 
> ___
> lustre-discuss mailing list
> lustre-discuss@lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Lustre github mirror out of sync

2024-01-26 Thread Andreas Dilger via lustre-discuss
No particular reason.  I normally sync the github tree manually after Oleg
lands patches to master, but forgot to do it the last couple of times.
It's been updated now.  Thanks for pointing it out.

On Jan 26, 2024, at 00:55, Tommi Tervo  wrote:
> 
> Is sync between https://git.whamcloud.com/fs/lustre-release.git and 
> https://github.com/lustre/lustre-release off on purpose?
> 
> BR,
> Tommi

Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Whamcloud







___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Odd behavior with tunefs.lustre and device index

2024-01-24 Thread Andreas Dilger via lustre-discuss
This is more like a bug report and should be filed in Jira.
That said, no guarantee that someone would be able to
work on this in a timely manner.

On Jan 24, 2024, at 09:47, Backer via lustre-discuss 
mailto:lustre-discuss@lists.lustre.org>> wrote:

Just pushing it on to the top of inbox :)  Or is there any other distribution 
list that is more appropriate for this type of questions? I am also trying 
devel mailing list.

On Sun, 21 Jan 2024 at 18:34, Backer 
mailto:backer.k...@gmail.com>> wrote:
Just to clarify. OSS-2 is completely powered off (hard power off without any 
graceful shutdown) before start working on OSS-3.

On Sun, 21 Jan 2024 at 12:12, Backer 
mailto:backer.k...@gmail.com>> wrote:
Hi All,

I am seeing a behavior with tunefs.lustre. After changing the failover node and 
trying to mount an OST, getting getting the following error:

The target service's index is already in use. (/dev/sdd)

After the above error, and performing --writeconf once, I can repeat these 
steps (see below) any number of times and any OSS without --writeconf.

This is an effort to mount an OST to a new OSS. I reproduced this issue after 
simplifying some steps and reproducing the behavior (see below) consistently. I 
was wondering if anyone could help me to understand this?

[root@OSS-2 opc]# lctl list_nids
10.99.101.18@tcp1
[root@OSS-2 opc]#

[root@OSS-2 opc]# mkfs.lustre --reformat  --ost --fsname="testfs" --index="64"  
--mgsnode "10.99.101.6@tcp1" --mgsnode "10.99.101.7@tcp1" --servicenode 
"10.99.101.18@tcp1" "/dev/sdd"

   Permanent disk data:
Target: testfs:OST0040
Index:  64
Lustre FS:  testfs
Mount type: ldiskfs
Flags:  0x1062
  (OST first_time update no_primnode )
Persistent mount opts: ,errors=remount-ro
Parameters:  mgsnode=10.99.101.6@tcp1:10.99.101.7@tcp1 
failover.node=10.99.101.18@tcp1

device size = 51200MB
formatting backing filesystem ldiskfs on /dev/sdd
target name   testfs:OST0040
kilobytes 52428800
options-J size=1024 -I 512 -i 69905 -q -O 
extents,uninit_bg,mmp,dir_nlink,quota,project,huge_file,^fast_commit,flex_bg -G 
256 -E resize="4290772992",lazy_journal_init="0",lazy_itable_init="0" -F
mkfs_cmd = mke2fs -j -b 4096 -L testfs:OST0040  -J size=1024 -I 512 -i 69905 -q 
-O extents,uninit_bg,mmp,dir_nlink,quota,project,huge_file,^fast_commit,flex_bg 
-G 256 -E resize="4290772992",lazy_journal_init="0",lazy_itable_init="0" -F 
/dev/sdd 52428800k
Writing CONFIGS/mountdata

[root@OSS-2 opc]# tunefs.lustre --dryrun /dev/sdd
checking for existing Lustre data: found

   Read previous values:
Target: testfs-OST0040
Index:  64
Lustre FS:  testfs
Mount type: ldiskfs
Flags:  0x1062
  (OST first_time update no_primnode )
Persistent mount opts: ,errors=remount-ro
Parameters:  mgsnode=10.99.101.6@tcp1:10.99.101.7@tcp1 
failover.node=10.99.101.18@tcp1


   Permanent disk data:
Target: testfs:OST0040
Index:  64
Lustre FS:  testfs
Mount type: ldiskfs
Flags:  0x1062
  (OST first_time update no_primnode )
Persistent mount opts: ,errors=remount-ro
Parameters:  mgsnode=10.99.101.6@tcp1:10.99.101.7@tcp1 
failover.node=10.99.101.18@tcp1

exiting before disk write.
[root@OSS-2 opc]#

[root@OSS-2 opc]# tunefs.lustre --erase-param failover.node --servicenode 
10.99.101.18@tcp1 /dev/sdd
checking for existing Lustre data: found

   Read previous values:
Target: testfs-OST0040
Index:  64
Lustre FS:  testfs
Mount type: ldiskfs
Flags:  0x1062
  (OST first_time update no_primnode )
Persistent mount opts: ,errors=remount-ro
Parameters:  mgsnode=10.99.101.6@tcp1:10.99.101.7@tcp1 
failover.node=10.99.101.18@tcp1


   Permanent disk data:
Target: testfs:OST0040
Index:  64
Lustre FS:  testfs
Mount type: ldiskfs
Flags:  0x1062
  (OST first_time update no_primnode )
Persistent mount opts: ,errors=remount-ro
Parameters:  mgsnode=10.99.101.6@tcp1:10.99.101.7@tcp1 
failover.node=10.99.101.18@tcp1

Writing CONFIGS/mountdata

[root@OSS-2 opc]# mkdir /testfs-OST0040
[root@OSS-2 opc]# mount -t lustre /dev/sdd  /testfs-OST0040
mount.lustre: increased 
'/sys/devices/platform/host5/session3/target5:0:0/5:0:0:1/block/sdd/queue/max_sectors_kb'
 from 1024 to 16384
[root@OSS-2 opc]#

[root@OSS-2 opc]# tunefs.lustre --dryrun /dev/sdd
checking for existing Lustre data: found

   Read previous values:
Target: testfs-OST0040
Index:  64
Lustre FS:  testfs
Mount type: ldiskfs
Flags:  0x1002
  (OST no_primnode )
Persistent mount opts: ,errors=remount-ro
Parameters:  mgsnode=10.99.101.6@tcp1:10.99.101.7@tcp1 
failover.node=10.99.101.18@tcp1


   Permanent disk data:
Target: testfs-OST0040
Index:  64
Lustre FS:  testfs
Mount type: ldiskfs
Flags:  0x1002
  (OST no_primnode )
Persistent mount opts: ,errors=remount-ro
Parameters:  mgsnode=10.99.101.6@tcp1:10.99.101.7@tcp1 
failover.node=10.99.101.18@tcp1

exiting before disk write.
[root@OSS-2 opc]#



Going over to OSS-3 and tryin

Re: [lustre-discuss] OST still has inodes and size after deleting all files

2024-01-19 Thread Andreas Dilger via lustre-discuss


On Jan 19, 2024, at 13:48, Pavlo Khmel via lustre-discuss 
mailto:lustre-discuss@lists.lustre.org>> wrote:

Hi,

I'm trying to remove 4 OSTs.

# lfs osts
OBDS:
0: cluster-OST_UUID ACTIVE
1: cluster-OST0001_UUID ACTIVE
2: cluster-OST0002_UUID ACTIVE
3: cluster-OST0003_UUID ACTIVE
. . .

I moved all files to other OSTs. "lfs find" cannot find any files on these 4 
OSTs.

# time lfs find --ost 0 --ost 1 --ost 2 --ost 3 /cluster

real 936m8.528s
user 13m48.298s
sys 210m1.245s

But still: 2624 inods are in use and 14.5G total size.

# lfs df -i | grep -e OST -e OST0001 -e OST0002 -e OST0003
cluster-OST_UUID  4293438576 644  4293437932   1% /cluster[OST:0]
cluster-OST0001_UUID  4293438576 640  4293437936   1% /cluster[OST:1]
cluster-OST0002_UUID  4293438576 671  4293437905   1% /cluster[OST:2]
cluster-OST0003_UUID  4293438576 669  4293437907   1% /cluster[OST:3]

# lfs df -h | grep -e OST -e OST0001 -e OST0002 -e OST0003
cluster-OST_UUID   29.2T3.8G   27.6T   1% /cluster[OST:0]
cluster-OST0001_UUID   29.2T3.7G   27.6T   1% /cluster[OST:1]
cluster-OST0002_UUID   29.2T3.3G   27.6T   1% /cluster[OST:2]
cluster-OST0003_UUID   29.2T3.7G   27.6T   1% /cluster[OST:3]

I tried to check the file-system for errors:

# umount /lustre/ost01
# e2fsck -fy /dev/mapper/ost01

and

# lctl lfsck_start --device cluster-OST0001
# lctl get_param -n osd-ldiskfs.cluster-OST0001.oi_scrub
. . .
status: completed

I tried to mount OST as ldiskfs and there are several files in /O/0/d*/

# umount /lustre/ost01
# mount -t ldiskfs /dev/mapper/ost01 /mnt/
# ls -Rhl /mnt/O/0/d*/
. . .
/mnt/O/0/d11/:
-rw-rw-rw- 1 user1 group1 603K Nov  8 21:37 450605003
/mnt/O/0/d12/:
-rw-rw-rw- 1 user1 group1 110K Jun 16  2023 450322028
-rw-rw-rw- 1 user1 group1  21M Nov  8 22:17 450605484
. . .

Is it expected behavior? Is it save to delere OST even with those files?

You can run the debugfs "stat" command to print the "fid" xattr and it will 
print the MDT
parent FID for use with "lfs fid2path" on the client to see if there are any 
files related
to these objects.  You could also run "ll_decode_filter_fid" to do the same 
thing on the
mounted ldiskfs filesystem.

It is likely that there are a few stray objects from deleted files, but hard to 
say for sure.

Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Whamcloud







___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] lustre-client-dkms-2.15.4 is still checking for python2

2024-01-19 Thread Andreas Dilger via lustre-discuss
It looks like there may be a couple of test tools that are referencing python2, 
but it definitely isn't needed
for normal operation.  Are you using the lustre-client binary or the 
lustre-client-dkms?  Only one is needed.

For the short term it would be possible to override this dependency, but it 
would be good to understand
why this dependency is actually being generated.

On Jan 19, 2024, at 04:06, BALVERS Martin via lustre-discuss 
mailto:lustre-discuss@lists.lustre.org>> wrote:

FYI
It seems that lustre-client-dkms-2.15.4 is still checking for python2 and does 
not install on AlmaLinux 9.3

# dnf --enablerepo=lustre-client install lustre-client lustre-client-dkms
Last metadata expiration check: 0:04:50 ago on Fri Jan 19 11:43:54 2024.
Error:
Problem: conflicting requests
  - nothing provides /usr/bin/python2 needed by 
lustre-client-dkms-2.15.4-1.el9.noarch from lustre-client
(try to add '--skip-broken' to skip uninstallable packages or '--nobest' to use 
not only best candidate packages)

According to the changelog this should have been fixed 
(https://wiki.lustre.org/Lustre_2.15.4_Changelog).

Regards,
Martin Balvers

Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Whamcloud







___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Lustre errors asking for help

2024-01-17 Thread Andreas Dilger via lustre-discuss
Roman,
have you tried running e2fsck on the underlying device ("-fn" to start)?  It is 
usually best
to run with the latest version of e2fsprogs as it has most fixes.  

It is definitely strange that all OSTs are reporting errors at the same time, 
which makes me
wonder how the underlying hardware is holding up?  Can you log in to the 
controller and check
the RAID status?

The error might be coming from the Object Index on those OSTs.  However, this 
version is old
enough that I'm not sure if OI Scrub is even existed in that version.  
Otherwise, it would be
possible to just remove the OI files and they would be recreated on the next 
mount.

The filesystem currently isn't able to create any new files on those OSTs, so 
that may also
be why the performance is lower.

After 12+ years, it might be time to update to newer storage?  In particular, 
such old HDDs
often fail after a significant power failure, so you might be running on the 
last legs, and
it's a good time to make a backup.  Given the age of the storage, I expect a 
modern HDD or
two would have enough capacity to backup the whole filesystem (even if not 
performing as
well), in case you don't have a chance to upgrade before it finally gives out.

Cheers, Andreas

> On Jan 17, 2024, at 17:55, Baranowski, Roman wrote:
> 
> 
> Dear All,
> 
> We have a legacy version of Lustre installed as part of a DDN storage 
> solution:
> 
> lustre: 2.4.3 (circa 2011)
> 
> kernel: patchless_client
> 
> Build Version: 
> EXAScaler-ddn1.0--PRISTINE-2.6.32-358.23.2.el6_lustre.es279.devel.x86_64
> 
> 
> 
> It has been running fine for years but after a particularly bad power 
> failure,it started producing the following messages:
> 
> Jan 15 10:03:07 mds2 kernel: : LustreError: 
> 3394:0:(osp_precreate.c:989:osp_precreate_thread()) 
> scratch-OST0014-osc-MDT: cannot precreate objects: rc = -116
> Jan 15 10:03:07 mds2 kernel: : LustreError: 
> 3394:0:(osp_precreate.c:989:osp_precreate_thread()) Skipped 210 previous 
> similar messages
> Jan 15 10:07:51 mds2 kernel: : Lustre: scratch-OST000f-osc-MDT: slow 
> creates, last=[0x1000f:0x1217571a:0x0], 
> next=[0x1000f:0x1217571a:0x0], reserved=0, syn_changes=0, 
> syn_rpc_in_progress=0, status=0
> Jan 15 10:07:51 mds2 kernel: : Lustre: Skipped 3 previous similar messages
> Jan 15 10:08:32 oss5 kernel: : LustreError: 
> 26943:0:(ofd_obd.c:1348:ofd_create()) scratch-OST0004: unable to precreate: 
> rc = -116
> Jan 15 10:08:32 oss5 kernel: : LustreError: 
> 26943:0:(ofd_obd.c:1348:ofd_create()) Skipped 66 previous similar messages
> Jan 15 10:09:26 oss4 kernel: : LustreError: 
> 18223:0:(ofd_obd.c:1348:ofd_create()) scratch-OST000f: unable to precreate: 
> rc = -116
> Jan 15 10:09:26 oss4 kernel: : LustreError: 
> 18223:0:(ofd_obd.c:1348:ofd_create()) Skipped 70 previous similar messages
> Jan 15 10:09:37 oss3 kernel: : LustreError: 
> 16621:0:(ofd_obd.c:1348:ofd_create()) scratch-OST0014: unable to precreate: 
> rc = -116
> Jan 15 10:09:37 oss3 kernel: : LustreError: 
> 16621:0:(ofd_obd.c:1348:ofd_create()) Skipped 77 previous similar messages
> Jan 15 10:09:38 mds2 kernel: : Lustre: scratch-OST0014-osc-MDT: slow 
> creates, last=[0x10014:0x11dd257a:0x0], 
> next=[0x10014:0x11dd257a:0x0], reserved=0, syn_changes=0, 
> syn_rpc_in_progress=0, status=-116
> Jan 15 10:13:12 mds2 kernel: : LustreError: 
> 3404:0:(osp_precreate.c:484:osp_precreate_send()) 
> scratch-OST0004-osc-MDT: can't precreate: rc = -116
> Jan 15 10:13:12 mds2 kernel: : LustreError: 
> 3404:0:(osp_precreate.c:484:osp_precreate_send()) Skipped 226 previous 
> similar messages
> Jan 15 10:13:12 mds2 kernel: : LustreError: 
> 3404:0:(osp_precreate.c:484:osp_precreate_send()) Skipped 226 previous 
> similar messages
> Jan 15 10:13:12 mds2 kernel: : LustreError: 
> 3404:0:(osp_precreate.c:989:osp_precreate_thread()) 
> scratch-OST0004-osc-MDT: cannot precreate objects: rc = -116
> Jan 15 10:13:12 mds2 kernel: : LustreError: 
> 3404:0:(osp_precreate.c:989:osp_precreate_thread()) Skipped 226 previous 
> similar messages
> Jan 15 10:18:37 oss5 kernel: : LustreError: 
> 1791:0:(ofd_obd.c:1348:ofd_create()) scratch-OST0004: unable to precreate: rc 
> = -116
> Jan 15 10:18:37 oss5 kernel: : LustreError: 
> 1791:0:(ofd_obd.c:1348:ofd_create()) Skipped 77 previous similar messages
> Jan 15 10:19:36 oss4 kernel: : LustreError: 
> 1687:0:(ofd_obd.c:1348:ofd_create()) scratch-OST000f: unable to precreate: rc 
> = -116
> Jan 15 10:19:36 oss4 kernel: : LustreError: 
> 1687:0:(ofd_obd.c:1348:ofd_create()) Skipped 77 previous similar messages
> Jan 15 10:19:42 oss3 kernel: : LustreError: 
> 1196:0:(ofd_obd.c:1348:ofd_create()) scratch-OST0014: unable to precreate: rc 
> = -116
> Jan 15 10:19:42 oss3 kernel: : LustreError: 
> 1196:0:(ofd_obd.c:1348:ofd_create()) Skipped 75 previous similar messages
> Jan 15 10:23:16 mds2 kernel: : LustreError: 
> 3400:0:(osp_precreate.c:484:osp_precreate_send()) 
> scratch-OST000f-osc-MDT: can't precreate:

Re: [lustre-discuss] LNet Multi-Rail config - with BODY!

2024-01-16 Thread Andreas Dilger via lustre-discuss
Hello Gwen,
I'm not a networking expert, but it seems entirely possible that the MR 
discovery in 2.12.9
isn't doing as well as what is in 2.15.3 (or 2.15.4 for that matter).  It would 
make more sense
to have both nodes running the same (newer) version before digging too deeply 
into this.

We have definitely seen performance > 1 IB interface from a single node in our 
testing,
though I can't say if that was done with lnet_selftest or with something else.

Cheers, Andreas

On Jan 16, 2024, at 08:14, Gwen Dawes via lustre-discuss 
mailto:lustre-discuss@lists.lustre.org>> wrote:

Hi folks,

Let's try that again.

I'm in the luxury position of having four IB cards I'm trying to
squeeze the most performance out of for Lustre I can.

I have a small test setup - two machines - a client (2.12.9) and a
server (2.15.3) with four IB cards each. I'm able to set them up as
Multi-Rail and each one can discover the other as such. However, I
can't seem to get lnet_selftest to give me more speed than a single
interface, as reported by ib_send_bw.

Am I missing some config here? Is LNet just not capable of doing more
than one connection per NID?

Gwen
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Whamcloud







___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Mixing ZFS and LDISKFS

2024-01-12 Thread Andreas Dilger via lustre-discuss
All of the OSTs and MDTs are "independently managed" (have their own connection 
state between each client and target) so this should be possible, though I 
don't know of sites that are doing this.  Possibly this makes sense to put NVMe 
flash OSTs on ldiskfs, and HDD OSTs on ZFS, and then put them in OST pools so 
that they are managed separately.

On Jan 12, 2024, at 10:38, Backer 
mailto:backer.k...@gmail.com>> wrote:

Thank you Andreas! How about mixing OSTs?  The requirement is to do RAID with 
small volumes using ZFS and have a large OST. This is to reduce the number of 
OSTs overall as the cluster being extended.

On Fri, 12 Jan 2024 at 11:26, Andreas Dilger 
mailto:adil...@whamcloud.com>> wrote:
Yes, some systems use ldiskfs for the MDT (for performance) and ZFS for the 
OSTs (for low-cost RAID).  The IOPS performance of ZFS is low vs. ldiskfs, but 
the streaming bandwidth is fine.

Cheers, Andreas

> On Jan 12, 2024, at 08:40, Backer via lustre-discuss 
> mailto:lustre-discuss@lists.lustre.org>> 
> wrote:
>
> 
> Hi,
>
> Could we mix ZFS and LDISKFS together in a cluster?
>
> Thank you,
>
>
> ___
> lustre-discuss mailing list
> lustre-discuss@lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Whamcloud







___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Recommendation on number of OSTs

2024-01-12 Thread Andreas Dilger via lustre-discuss
I would recommend *not* to use too many OSTs as this causes fragmentation of 
the free space, and excess overhead in managing the connections.  Today, single 
OSTs can be up to 500TiB in size (or larger, though not necessarily optimal for 
performance). Depending on your cluster size and total capacity, it is typical 
for large systems to have a couple hundred OSTs, 2-4 per OSS balancing the 
storage and network bandwidth.

On Jan 12, 2024, at 07:37, Backer via lustre-discuss 
mailto:lustre-discuss@lists.lustre.org>> wrote:

Hi All,

What is the recommendation on the total number of OSTs?

In order to maximize throughput, go for more number of OSS with small OSTs. 
This means that it will end up with 1000s of OSTs. Any suggestions or 
recommendations?

Thank you,

Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Whamcloud







___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Mixing ZFS and LDISKFS

2024-01-12 Thread Andreas Dilger via lustre-discuss
Yes, some systems use ldiskfs for the MDT (for performance) and ZFS for the 
OSTs (for low-cost RAID).  The IOPS performance of ZFS is low vs. ldiskfs, but 
the streaming bandwidth is fine. 

Cheers, Andreas

> On Jan 12, 2024, at 08:40, Backer via lustre-discuss 
>  wrote:
> 
> 
> Hi,
> 
> Could we mix ZFS and LDISKFS together in a cluster? 
> 
> Thank you,
> 
> 
> ___
> lustre-discuss mailing list
> lustre-discuss@lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Symbols not found in newly built lustre?

2024-01-11 Thread Andreas Dilger via lustre-discuss
On Jan 10, 2024, at 02:37, Jan Andersen mailto:j...@comind.io>> 
wrote:

I am running Rocky 8.9 (uname -r: 4.18.0-513.9.1.el8_9.x86_64) and have, 
apparently successfully, built the lustre rpms:

[root@mds lustre-release]# ll *2.15.4-1.el8.x86_64.rpm
-rw-r--r--. 1 root root  4640828 Jan 10 09:19 
kmod-lustre-2.15.4-1.el8.x86_64.rpm
-rw-r--r--. 1 root root 42524976 Jan 10 09:19 
kmod-lustre-debuginfo-2.15.4-1.el8.x86_64.rpm
-rw-r--r--. 1 root root   555320 Jan 10 09:19 
kmod-lustre-osd-ldiskfs-2.15.4-1.el8.x86_64.rpm
-rw-r--r--. 1 root root  4877728 Jan 10 09:19 
kmod-lustre-osd-ldiskfs-debuginfo-2.15.4-1.el8.x86_64.rpm
-rw-r--r--. 1 root root50936 Jan 10 09:19 
kmod-lustre-tests-2.15.4-1.el8.x86_64.rpm
-rw-r--r--. 1 root root   312812 Jan 10 09:19 
kmod-lustre-tests-debuginfo-2.15.4-1.el8.x86_64.rpm
-rw-r--r--. 1 root root   868948 Jan 10 09:19 lustre-2.15.4-1.el8.x86_64.rpm
-rw-r--r--. 1 root root  1550612 Jan 10 09:19 
lustre-debuginfo-2.15.4-1.el8.x86_64.rpm
-rw-r--r--. 1 root root  3982392 Jan 10 09:19 
lustre-debugsource-2.15.4-1.el8.x86_64.rpm
-rw-r--r--. 1 root root   216548 Jan 10 09:19 
lustre-devel-2.15.4-1.el8.x86_64.rpm
-rw-r--r--. 1 root root46220 Jan 10 09:19 
lustre-iokit-2.15.4-1.el8.x86_64.rpm
-rw-r--r--. 1 root root20028 Jan 10 09:19 
lustre-osd-ldiskfs-mount-2.15.4-1.el8.x86_64.rpm
-rw-r--r--. 1 root root36788 Jan 10 09:19 
lustre-osd-ldiskfs-mount-debuginfo-2.15.4-1.el8.x86_64.rpm
-rw-r--r--. 1 root root12136 Jan 10 09:19 
lustre-resource-agents-2.15.4-1.el8.x86_64.rpm
-rw-r--r--. 1 root root 16264600 Jan 10 09:19 
lustre-tests-2.15.4-1.el8.x86_64.rpm
-rw-r--r--. 1 root root   519492 Jan 10 09:19 
lustre-tests-debuginfo-2.15.4-1.el8.x86_64.rpm

I installed them with 'dnf install *2.15.4-1.el8.x86_64.rpm' and that didn't 
return an error, and they show up as installed. However, depmod won't load them 
because they can't find a number of symbols - these symbols are all defined in 
the source code, as far as I can tell, so what is going on?

[root@mds lustre-release]# depmod -v | grep lustre

I'm not sure where there is a problem?

   -v, --verbose
   In verbose mode, depmod will print (to stdout) all the symbols each
   module depends on and the module's file name which provides that
   symbol.

This is printing out *all* of the needed symbols, and it appears they are being 
met by the specified modules.

I used to use "depmod -ae" to check all modules and only print errors.

Cheers, Andreas

/lib/modules/4.18.0-513.9.1.el8_9.x86_64/extra/lustre/fs/ptlrpc_gss.ko needs 
"sptlrpc_unpack_user_desc": 
/lib/modules/4.18.0-513.9.1.el8_9.x86_64/extra/lustre/fs/ptlrpc.ko
/lib/modules/4.18.0-513.9.1.el8_9.x86_64/extra/lustre/fs/ptlrpc_gss.ko needs 
"obd_timeout": 
/lib/modules/4.18.0-513.9.1.el8_9.x86_64/extra/lustre/fs/obdclass.ko
/lib/modules/4.18.0-513.9.1.el8_9.x86_64/extra/lustre/fs/ptlrpc_gss.ko needs 
"cfs_crypto_hash_final": 
/lib/modules/4.18.0-513.9.1.el8_9.x86_64/extra/lustre/net/libcfs.ko
/lib/modules/4.18.0-513.9.1.el8_9.x86_64/extra/lustre/fs/ptlrpc_gss.ko needs 
"LNetPrimaryNID": 
/lib/modules/4.18.0-513.9.1.el8_9.x86_64/extra/lustre/net/lnet.ko
/lib/modules/4.18.0-513.9.1.el8_9.x86_64/extra/lustre/fs/ptlrpc_gss.ko needs 
"sunrpc_cache_lookup_rcu": 
/lib/modules/4.18.0-513.9.1.el8_9.x86_64/kernel/net/sunrpc/sunrpc.ko.xz
/lib/modules/4.18.0-513.9.1.el8_9.x86_64/extra/lustre/net/ko2iblnd.ko needs 
"lnet_cpt_of_nid": 
/lib/modules/4.18.0-513.9.1.el8_9.x86_64/extra/lustre/net/lnet.ko
/lib/modules/4.18.0-513.9.1.el8_9.x86_64/extra/lustre/net/ko2iblnd.ko needs 
"cfs_cpt_bind": 
/lib/modules/4.18.0-513.9.1.el8_9.x86_64/extra/lustre/net/libcfs.ko
/lib/modules/4.18.0-513.9.1.el8_9.x86_64/extra/lustre/net/ko2iblnd.ko needs 
"__ib_alloc_pd": 
/lib/modules/4.18.0-513.9.1.el8_9.x86_64/kernel/drivers/infiniband/core/ib_core.ko.xz
/lib/modules/4.18.0-513.9.1.el8_9.x86_64/extra/lustre/net/ko2iblnd.ko needs 
"rdma_resolve_addr": 
/lib/modules/4.18.0-513.9.1.el8_9.x86_64/kernel/drivers/infiniband/core/rdma_cm.ko.xz
/lib/modules/4.18.0-513.9.1.el8_9.x86_64/extra/lustre/fs/fld.ko needs 
"class_export_put": 
/lib/modules/4.18.0-513.9.1.el8_9.x86_64/extra/lustre/fs/obdclass.ko
/lib/modules/4.18.0-513.9.1.el8_9.x86_64/extra/lustre/fs/fld.ko needs 
"RQF_FLD_QUERY": 
/lib/modules/4.18.0-513.9.1.el8_9.x86_64/extra/lustre/fs/ptlrpc.ko
/lib/modules/4.18.0-513.9.1.el8_9.x86_64/extra/lustre/fs/fld.ko needs 
"cfs_fail_loc": 
/lib/modules/4.18.0-513.9.1.el8_9.x86_64/extra/lustre/net/libcfs.ko
/lib/modules/4.18.0-513.9.1.el8_9.x86_64/extra/lustre/fs/lfsck.ko needs 
"ptlrpc_set_destroy": 
/lib/modules/4.18.0-513.9.1.el8_9.x86_64/extra/lustre/fs/ptlrpc.ko
/lib/modules/4.18.0-513.9.1.el8_9.x86_64/extra/lustre/fs/lfsck.ko needs 
"class_export_put": 
/lib/modules/4.18.0-513.9.1.el8_9.x86_64/extra/lustre/fs/obdclass.ko
/lib/modules/4.18.0-513.9.1.el8_9.x86_64/extra/lustre/fs/lfsck.ko needs 
"seq_client_fini": 
/lib/modules/4.18.0-513.9.1.el8_9.x86_64/extra/lustre/fs/fid.ko

Re: [lustre-discuss] 2.15.4 o2iblnd on RoCEv2?

2024-01-10 Thread Andreas Dilger via lustre-discuss
Granted that I'm not an LNet expert, but "errno: -1 descr: cannot parse net 
'<255:65535>' " doesn't immediately lead me to the same conclusion as if 
"unknown internface 'ib0' " were printed for the error message.  Also "errno: 
-1" is "-EPERM = Operation not permitted", and doesn't give the same 
information as "-ENXIO = No such device or address" or even "-EINVAL = Invalid 
argument" would.

That said, I can't even offer a patch for this myself, since that exact error 
message is used in a few different places, though I suspect it is coming from 
lustre_lnet_config_ni().

Looking further into this, now that I've found where (I think) the error 
message is generated, it seems that "errno: -1" is not "-EPERM" but rather 
"LUSTRE_CFG_RC_BAD_PARAM", which is IMHO a travesty to use different error 
numbers (and then print them after "errno:") instead of existing POSIX error 
codes that could fill the same role (with some creative mapping):

#define LUSTRE_CFG_RC_NO_ERR 0  => fine
#define LUSTRE_CFG_RC_BAD_PARAM -1  => -EINVAL
#define LUSTRE_CFG_RC_MISSING_PARAM -2  => -EFAULT
#define LUSTRE_CFG_RC_OUT_OF_RANGE_PARAM-3  => -ERANGE
#define LUSTRE_CFG_RC_OUT_OF_MEM-4  => -ENOMEM
#define LUSTRE_CFG_RC_GENERIC_ERR   -5  => -ENODATA
#define LUSTRE_CFG_RC_NO_MATCH  -6  => -ENOMSG
#define LUSTRE_CFG_RC_MATCH -7  => -EXFULL
#define LUSTRE_CFG_RC_SKIP  -8  => -EBADSLT
#define LUSTRE_CFG_RC_LAST_ELEM -9  => -ECHRNG
#define LUSTRE_CFG_RC_MARSHAL_FAIL  -10 => -ENOSTR

I don't think "overloading" the POSIX error codes to mean something similar is 
worse than using random numbers to report errors.  Also, in some cases (even in 
lustre_lnet_config_ni()) it is using "rc = -errno" so the LUSTRE_CFG_RC_* 
errors are *already* conflicting with POSIX error numbers, and it impossible to 
distinguish between them...

The main question is whether changing these numbers will break a user->kernel 
interface, or if these definitions are only in userspace?It looks like 
lnetctl.c is only ever checking "!= LUSTRE_CFG_RC_NO_ERR", so maybe it is fine? 
 None of the values currently overlap, so it would be possible to start 
accepting either of the values for the return in the user tools, and then at 
some point in the future start actually returning them...  Something for the 
LNet folks to figure out.

Cheers, Andreas

On Jan 10, 2024, at 13:29, Jeff Johnson 
mailto:jeff.john...@aeoncomputing.com>> wrote:

A LU ticket and patch for lnetctl or for me being an under-caffeinated
idiot? ;-)

On Wed, Jan 10, 2024 at 12:06 PM Andreas Dilger 
mailto:adil...@whamcloud.com>> wrote:

It would seem that the error message could be improved in this case?  Could you 
file an LU ticket for that with the reproducer below, and ideally along with a 
patch?

Cheers, Andreas
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Whamcloud







___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] 2.15.4 o2iblnd on RoCEv2?

2024-01-10 Thread Andreas Dilger via lustre-discuss
It would seem that the error message could be improved in this case?  Could you 
file an LU ticket for that with the reproducer below, and ideally along with a 
patch?

Cheers, Andreas

> On Jan 10, 2024, at 11:37, Jeff Johnson  
> wrote:
> 
> Man am I an idiot. Been up all night too many nights in a row and not
> enough coffee. It helps if you use the correct --net designation. I
> was typing ib0 instead of o2ib0. Declaring as o2ib0 works fine.
> 
> (cleanup from previous)
> lctl net down && lustre_rmmod
> 
> (new attempt)
> modprobe lnet -v
> lnetctl lnet configure
> lnetctl net add --if enp1s0np0 --net o2ib0
> lnetctl net show
> net:
>- net type: lo
>  local NI(s):
>- nid: 0@lo
>  status: up
>- net type: o2ib
>  local NI(s):
>- nid: 10.0.50.27@o2ib
>  status: up
>  interfaces:
>  0: enp1s0np0
> 
> Lots more to test and verify but the original mailing list submission
> was total pilot error on my part. Apologies to all who spent cycles
> pondering this nothingburger.
> 
> 
> 
> 
>> On Tue, Jan 9, 2024 at 7:45 PM Jeff Johnson
>>  wrote:
>> 
>> Howdy intrepid Lustrefarians,
>> 
>> While starting down the debug rabbit hole I thought I'd raise my hand
>> and see if anyone has a few magic beans to spare.
>> 
>> I cannot get lnet (via lnetctl) to init a o2iblnd interface on a
>> RoCEv2 interface.
>> 
>> Running `lnetctl net add --net ib0 --if enp1s0np0` results in
>> net:
>>  errno: -1
>>  descr: cannot parse net '<255:65535>'
>> 
>> Nothing in dmesg to indicate why. Search engines aren't coughing up
>> much here either.
>> 
>> Env: Rocky 8.9 x86_64, MOFED 5.8-4.1.5.0, Lustre 2.15.4
>> 
>> I'm able to run mpi over the RoCEv2 interface. Utils like ibstatus and
>> ibdev2netdev report it correctly. ibv_rc_pingpong works fine between
>> nodes.
>> 
>> Configuring as socklnd works fine. `lnetctl net add --net tcp0 --if
>> enp1s0np0 && lnetctl net show`
>> [root@r2u11n3 ~]# lnetctl net show
>> net:
>>- net type: lo
>>  local NI(s):
>>- nid: 0@lo
>>  status: up
>>- net type: tcp
>>  local NI(s):
>>- nid: 10.0.50.27@tcp
>>  status: up
>>  interfaces:
>>  0: enp1s0np0
>> 
>> I verified the RoCEv2 interface using nVidia's `cma_roce_mode` as well
>> as sysfs references
>> 
>> [root@r2u11n3 ~]# cma_roce_mode -d mlx5_0 -p 1
>> RoCE v2
>> 
>> Ideas? Suggestions? Incense?
>> 
>> Thanks,
>> 
>> --Jeff
> 
> 
> 
> --
> --
> Jeff Johnson
> Co-Founder
> Aeon Computing
> 
> jeff.john...@aeoncomputing.com
> www.aeoncomputing.com
> t: 858-412-3810 x1001   f: 858-412-3845
> m: 619-204-9061
> 
> 4170 Morena Boulevard, Suite C - San Diego, CA 92117
> 
> High-Performance Computing / Lustre Filesystems / Scale-out Storage
> ___
> lustre-discuss mailing list
> lustre-discuss@lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Extending Lustre file system

2024-01-08 Thread Andreas Dilger via lustre-discuss
I would recommend *against* mounting all 175 OSTs at the same time.  There are 
(or at least were*) some issues with the MGS registration RPCs timing out when 
too many config changes happen at once.  Your "mount and wait 2 sec" is more 
robust and doesn't take very much time (a few minutes) vs. having to restart if 
some of the OSTs have problems registering.  Also, the config logs will have 
the OSTs in a nice order, which doesn't affect any functionality, but makes it 
easier for the admin to see if some device is connected in "lctl dl" output.

Cheers, Andreas


[*] some fixes have landed over time to improve registration RPC resend.

On Jan 8, 2024, at 11:57, Thomas Roth via lustre-discuss 
mailto:lustre-discuss@lists.lustre.org>> wrote:

Yes, sorry, I meant the actual procedure of mounting the OSTs for the first 
time.

Last year I did that with 175 OSTs - replacements for EOL hardware. All OSTs 
had been formatted with a specific index, so probably creating a suitable 
/etc/fstab everywhere and sending a 'mount -a -t lustre' to all OSTs 
simultaneously would have worked.

But why the hurry? Instead, I logged in to my new OSS, mounted the OSTs with 2 
sec between each mount command, watched the OSS log, watched the MDS log, saw 
the expected log messages, proceeded to the new OSS - all fine ;-)  Such a 
leisurely approach takes its time, of course.

Once all OSTs were happily incorporated, we put the max_create_count (set to 0 
before) to some finite value and started file migration. As long as the 
migration is more effective, faster, than the users's file creations, the 
result should be evenly filled OSTs with a good mixture of files (file sizes, 
ages, types).


Cheers
Thomas

On 1/8/24 19:07, Andreas Dilger wrote:
The need to rebalance depends on how full the existing OSTs are.  My 
recommendation if you know that the data will continue to grow is to add new 
OSTs when the existing ones are at 60-70% full, and add them in larger groups 
rather than one at a time.
Cheers, Andreas
On Jan 8, 2024, at 09:29, Thomas Roth via lustre-discuss 
mailto:lustre-discuss@lists.lustre.org>> wrote:

Just mount the OSTs, one by one and perhaps not if your system is heavily 
loaded. Follow what happens in the MDS log and the OSS log.
And try to rebalance the OSTs fill levels afterwards - very empty OSTs will 
attract all new files, which might be hot and direct your users's fire to your 
new OSS only.

Regards,
Thomas

On 1/8/24 15:38, Backer via lustre-discuss wrote:
Hi,
Good morning and happy new year!
I have a quick question on extending a lustre file system. The extension is 
performed online. I am looking for any best practices or anything to watchout 
while doing the file system extension. The file system extension is done adding 
new OSS and many OSTs within these servers.
Really appreciate your help on this.
Regards,
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Whamcloud







___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Extending Lustre file system

2024-01-08 Thread Andreas Dilger via lustre-discuss
The need to rebalance depends on how full the existing OSTs are.  My 
recommendation if you know that the data will continue to grow is to add new 
OSTs when the existing ones are at 60-70% full, and add them in larger groups 
rather than one at a time.

Cheers, Andreas

> On Jan 8, 2024, at 09:29, Thomas Roth via lustre-discuss 
>  wrote:
> 
> Just mount the OSTs, one by one and perhaps not if your system is heavily 
> loaded. Follow what happens in the MDS log and the OSS log.
> And try to rebalance the OSTs fill levels afterwards - very empty OSTs will 
> attract all new files, which might be hot and direct your users's fire to 
> your new OSS only.
> 
> Regards,
> Thomas
> 
>> On 1/8/24 15:38, Backer via lustre-discuss wrote:
>> Hi,
>> Good morning and happy new year!
>> I have a quick question on extending a lustre file system. The extension is 
>> performed online. I am looking for any best practices or anything to 
>> watchout while doing the file system extension. The file system extension is 
>> done adding new OSS and many OSTs within these servers.
>> Really appreciate your help on this.
>> Regards,
>> ___
>> lustre-discuss mailing list
>> lustre-discuss@lists.lustre.org
>> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
> ___
> lustre-discuss mailing list
> lustre-discuss@lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Building lustre on rocky 8.8 fails?

2024-01-06 Thread Andreas Dilger via lustre-discuss
Why not download the matching kernel and Lustre RPMs together?  I would 
recommend RHEL8 servers as the most stable, RHEL9 hasn't been run for very long 
as a Lustre server.

On Jan 5, 2024, at 02:41, Jan Andersen mailto:j...@comind.io>> 
wrote:

Hi Xinliang and Andreas,

Thanks for helping with this!

I tried out your suggestions, and it compiled fine; however, things had become 
quite messy on the server, so I decided to reinstall the Rocky 8.8 and start 
over. Again, lustre built successfully, but for some reason, when you download 
the source code package from the repository, you get a slightly different 
version from the kernel's, with the result that lustre places its modules in a 
place the running kernel can't find.

So I built the kernel from source with the correct version, rebooted, cloned 
lustre again, ./configured etc, and now:

[root@mds lustre-release]# make
make  all-recursive
make[1]: Entering directory '/root/lustre-release'
Making all in ldiskfs
make[2]: Entering directory '/root/lustre-release/ldiskfs'
make[2]: *** No rule to make target 
'../ldiskfs/kernel_patches/series/ldiskfs-', needed by 'sources'.  Stop.
make[2]: Leaving directory '/root/lustre-release/ldiskfs'
make[1]: *** [autoMakefile:680: all-recursive] Error 1
make[1]: Leaving directory '/root/lustre-release'
make: *** [autoMakefile:546: all] Error 2

Which I don't quite understand, because I still have all the necessary packages 
and tools from before.

Before I barge ahead and try yet another permutation, do you have any advice so 
I might avoid problems? I can reinstall the OS, which will be a bit of a pain, 
but not that bad - but then which version; 8.8 or 8.9, or even 9? Or is there a 
simplish thing I can do to avoid all that?

/jan

On 03/01/2024 02:17, Xinliang Liu wrote:
On Wed, 3 Jan 2024 at 10:08, Xinliang Liu 
mailto:xinliang@linaro.org> 
> wrote:
   Hi Jan,
   On Tue, 2 Jan 2024 at 22:29, Jan Andersen 
mailto:j...@comind.io> > wrote:
   I have installed Rocky 8.8 on a new server (Dell PowerEdge R640):
   [root@mds 4.18.0-513.9.1.el8_9.x86_64]# cat /etc/*release*
   Rocky Linux release 8.8 (Green Obsidian)
   NAME="Rocky Linux"
   VERSION="8.8 (Green Obsidian)"
   ID="rocky"
   ID_LIKE="rhel centos fedora"
   VERSION_ID="8.8"
   PLATFORM_ID="platform:el8"
   PRETTY_NAME="Rocky Linux 8.8 (Green Obsidian)"
   ANSI_COLOR="0;32"
   LOGO="fedora-logo-icon"
   CPE_NAME="cpe:/o:rocky:rocky:8:GA"
   HOME_URL="https://rockylinux.org/ "
   BUG_REPORT_URL="https://bugs.rockylinux.org/ 
"
   SUPPORT_END="2029-05-31"
   ROCKY_SUPPORT_PRODUCT="Rocky-Linux-8"
   ROCKY_SUPPORT_PRODUCT_VERSION="8.8"
   REDHAT_SUPPORT_PRODUCT="Rocky Linux"
   REDHAT_SUPPORT_PRODUCT_VERSION="8.8"
   Rocky Linux release 8.8 (Green Obsidian)
   Rocky Linux release 8.8 (Green Obsidian)
   Derived from Red Hat Enterprise Linux 8.8
   Rocky Linux release 8.8 (Green Obsidian)
   cpe:/o:rocky:rocky:8:GA
   I downloaded the kernel source (I don't remember the exact command):
   [root@mds 4.18.0-513.9.1.el8_9.x86_64]# ll /usr/src/kernels
   total 8
   drwxr-xr-x. 24 root root 4096 Jan  2 13:49 4.18.0-513.9.1.el8_9.x86_64/
   drwxr-xr-x. 23 root root 4096 Jan  2 11:41 
4.18.0-513.9.1.el8_9.x86_64+debug/
   Copied the config from /boot and ran:
   yes "" | make oldconfig
   After that I cloned the Lustre source and configured (according to my 
notes):
   git clone git://git.whamcloud.com/fs/lustre-release.git 

   cd lustre-release
   git checkout 2.15.3
   dnf install libtool
   dnf install flex
   dnf install bison
   dnf install openmpi-devel
   dnf install python3-devel
   dnf install python3
   dnf install kernel-devel kernel-headers
   dnf install elfutils-libelf-devel
   dnf install keyutils keyutils-libs-devel
   dnf install libmount
   dnf --enablerepo=powertools install libmount-devel
   dnf install libnl3 libnl3-devel
   dnf config-manager --set-enabled powertools
   dnf install libyaml-devel
   dnf install patch
   dnf install e2fsprogs-devel
   dnf install kernel-core
   dnf install kernel-modules
   dnf install rpm-build
   dnf config-manager --enable devel
   dnf config-manager --enable powertools
   dnf config-manager --set-enabled ha
   dnf install kernel-debuginfo
   sh autogen.sh
   ./configure
   This appeared to finish without errors:
   ...
   config.status: executing libtool commands
   CC:gcc
   LD:/usr/bin/ld -m elf_x86_64
   CPPFLAGS:  -include /root/lustre-release/undef.h -include 
/root/lustre-release/config.h -I/root/lustre-release/lnet/include/uapi 
-I/root/lustre-release/lustre/

Re: [lustre-discuss] Error: GPG check FAILED when trying to install e2fsprogs

2024-01-03 Thread Andreas Dilger via lustre-discuss
Sorry, those packages are not signed, you'll just have to install them without 
a signature. 

Cheers, Andreas

> On Jan 3, 2024, at 09:10, Jan Andersen  wrote:
> 
> I have finally managed to build the lustre rpms, but when I try to install 
> them with:
> 
> dnf install ./*.rpm
> 
> I get a list of errors like
> 
> ... nothing provides ldiskfsprogs >= 1.44.3.wc1 ...
> 
> In a previous communication I was advised that:
> 
> You may need to add ldiskfsprogs rpm repo and enable ha and powertools repo
> first.
> 
> sudo dnf config-manager --add-repo 
> https://downloads.whamcloud.com/public/e2fsprogs/latest/el8/
> sudo dnf config-manager --set-enabled ha
> sudo dnf config-manager --set-enabled powertools
> 
> However, when I try to install e2fsprogs:
> 
> Package e2fsprogs-1.47.0-wc6.el8.x86_64.rpm is not signed
> Package e2fsprogs-devel-1.47.0-wc6.el8.x86_64.rpm is not signed
> Package e2fsprogs-libs-1.47.0-wc6.el8.x86_64.rpm is not signed
> Package libcom_err-1.47.0-wc6.el8.x86_64.rpm is not signed
> Package libcom_err-devel-1.47.0-wc6.el8.x86_64.rpm is not signed
> Package libss-1.47.0-wc6.el8.x86_64.rpm is not signed
> The downloaded packages were saved in cache until the next successful 
> transaction.
> You can remove cached packages by executing 'dnf clean packages'.
> Error: GPG check FAILED
> 
> And now I'm stuck with that - I imagine I need to add some appropriate GPG 
> key; where can I find that?
> 
> /jan
> ___
> lustre-discuss mailing list
> lustre-discuss@lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Building lustre on rocky 8.8 fails?

2024-01-02 Thread Andreas Dilger via lustre-discuss
Try 2.15.4, as it may fix the EL8.8 build issue. 

Cheers, Andreas

> On Jan 2, 2024, at 07:30, Jan Andersen  wrote:
> 
> I have installed Rocky 8.8 on a new server (Dell PowerEdge R640):
> 
> [root@mds 4.18.0-513.9.1.el8_9.x86_64]# cat /etc/*release*
> Rocky Linux release 8.8 (Green Obsidian)
> NAME="Rocky Linux"
> VERSION="8.8 (Green Obsidian)"
> ID="rocky"
> ID_LIKE="rhel centos fedora"
> VERSION_ID="8.8"
> PLATFORM_ID="platform:el8"
> PRETTY_NAME="Rocky Linux 8.8 (Green Obsidian)"
> ANSI_COLOR="0;32"
> LOGO="fedora-logo-icon"
> CPE_NAME="cpe:/o:rocky:rocky:8:GA"
> HOME_URL="https://rockylinux.org/";
> BUG_REPORT_URL="https://bugs.rockylinux.org/";
> SUPPORT_END="2029-05-31"
> ROCKY_SUPPORT_PRODUCT="Rocky-Linux-8"
> ROCKY_SUPPORT_PRODUCT_VERSION="8.8"
> REDHAT_SUPPORT_PRODUCT="Rocky Linux"
> REDHAT_SUPPORT_PRODUCT_VERSION="8.8"
> Rocky Linux release 8.8 (Green Obsidian)
> Rocky Linux release 8.8 (Green Obsidian)
> Derived from Red Hat Enterprise Linux 8.8
> Rocky Linux release 8.8 (Green Obsidian)
> cpe:/o:rocky:rocky:8:GA
> 
> I downloaded the kernel source (I don't remember the exact command):
> 
> [root@mds 4.18.0-513.9.1.el8_9.x86_64]# ll /usr/src/kernels
> total 8
> drwxr-xr-x. 24 root root 4096 Jan  2 13:49 4.18.0-513.9.1.el8_9.x86_64/
> drwxr-xr-x. 23 root root 4096 Jan  2 11:41 4.18.0-513.9.1.el8_9.x86_64+debug/
> 
> Copied the config from /boot and ran:
> 
> yes "" | make oldconfig
> 
> After that I cloned the Lustre source and configured (according to my notes):
> 
> git clone git://git.whamcloud.com/fs/lustre-release.git
> cd lustre-release
> git checkout 2.15.3
> 
> dnf install libtool
> dnf install flex
> dnf install bison
> dnf install openmpi-devel
> dnf install python3-devel
> dnf install python3
> dnf install kernel-devel kernel-headers
> dnf install elfutils-libelf-devel
> dnf install keyutils keyutils-libs-devel
> dnf install libmount
> dnf --enablerepo=powertools install libmount-devel
> dnf install libnl3 libnl3-devel
> dnf config-manager --set-enabled powertools
> dnf install libyaml-devel
> dnf install patch
> dnf install e2fsprogs-devel
> dnf install kernel-core
> dnf install kernel-modules
> dnf install rpm-build
> dnf config-manager --enable devel
> dnf config-manager --enable powertools
> dnf config-manager --set-enabled ha
> dnf install kernel-debuginfo
> 
> sh autogen.sh
> ./configure
> 
> This appeared to finish without errors:
> 
> ...
> config.status: executing libtool commands
> 
> CC:gcc
> LD:/usr/bin/ld -m elf_x86_64
> CPPFLAGS:  -include /root/lustre-release/undef.h -include 
> /root/lustre-release/config.h -I/root/lustre-release/lnet/include/uapi 
> -I/root/lustre-release/lustre/include/uapi 
> -I/root/lustre-release/libcfs/include -I/root/lustre-release/lnet/utils/ 
> -I/root/lustre-release/lustre/include
> CFLAGS:-g -O2 -Wall -Werror
> EXTRA_KCFLAGS: -include /root/lustre-release/undef.h -include 
> /root/lustre-release/config.h  -g -I/root/lustre-release/libcfs/include 
> -I/root/lustre-release/libcfs/include/libcfs 
> -I/root/lustre-release/lnet/include/uapi -I/root/lustre-release/lnet/include 
> -I/root/lustre-release/lustre/include/uapi 
> -I/root/lustre-release/lustre/include -Wno-format-truncation 
> -Wno-stringop-truncation -Wno-stringop-overflow
> 
> Type 'make' to build Lustre.
> 
> However, when I run make:
> 
> [root@mds lustre-release]# make
> make  all-recursive
> make[1]: Entering directory '/root/lustre-release'
> Making all in ldiskfs
> make[2]: Entering directory '/root/lustre-release/ldiskfs'
> make[2]: *** No rule to make target 
> '../ldiskfs/kernel_patches/series/ldiskfs-', needed by 'sources'.  Stop.
> make[2]: Leaving directory '/root/lustre-release/ldiskfs'
> make[1]: *** [autoMakefile:649: all-recursive] Error 1
> make[1]: Leaving directory '/root/lustre-release'
> make: *** [autoMakefile:521: all] Error 2
> 
> Alternatively, I tried make rpms which results in:
> 
> ...
> rpmbuilddir=`mktemp -t -d rpmbuild-lustre-$USER-`; \
> make  \
>rpmbuilddir="$rpmbuilddir" rpm-local || exit 1; \
> cp ./rpm/* .; \
> /usr/bin/rpmbuild \
>--define "_tmppath $rpmbuilddir/TMP" \
>--define "_topdir $rpmbuilddir" \
>--define "dist %{nil}" \
>-ts lustre-2.15.3.tar.gz || exit 1; \
> cp $rpmbuilddir/SRPMS/lustre-2.15.3-*.src.rpm . || exit 1; \
> rm -rf $rpmbuilddir
> make[1]: Entering directory '/root/lustre-release'
> make[1]: Leaving directory '/root/lustre-release'
> error: line 239: Dependency tokens must begin with alpha-numeric, '_' or '/': 
> BuildRequires: %kernel_module_package_buildreqs
> make: *** [autoMakefile:1237: srpm] Error 1
> 
> 
> So, I'm stuck - it seems this is something I do a lot; how do I move forward 
> from here?
> 
> ___
> lustre-discuss mailing list
> lustre-discuss@lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
___
lustre-discuss

Re: [lustre-discuss] Lustre server still try to recover the lnet reply to the depreciated clients

2023-12-08 Thread Andreas Dilger via lustre-discuss
If you are evicting a client by NID, then use the "nid:" keyword:

lctl set_param mdt.*.evict_client=nid:10.68.178.25@tcp

Otherwise it is expecting the input to be in the form of a client UUID (to allow
evicting a single export from a client mounting the filesystem multiple times).

That said, the client *should* be evicted by the server automatically, so it 
isn't
clear why this isn't happening.  Possibly this is something at the LNet level
(which unfortunately I don't know much about)? 

Cheers, Andreas

> On Dec 6, 2023, at 13:23, Huang, Qiulan via lustre-discuss 
>  wrote:
> 
> 
> 
> Hello all,
> 
> 
> We removed some clients two weeks ago but we see the Lustre server is still 
> trying to handle the lnet recovery reply to those clients (the error log is 
> posted as below). And they are still listed in the exports dir.
> 
> 
> I tried to run  to evict the clients but failed with  the error "no exports 
> found"
> 
> lctl set_param mdt.*.evict_client=10.68.178.25@tcp
> 
> 
> Do you know how to clean up the removed the depreciated clients? Any 
> suggestions would be greatly appreciated.
> 
> 
> 
> For example:
> 
> [root@mds2 ~]# ll /proc/fs/lustre/mdt/data-MDT/exports/10.67.178.25@tcp/
> total 0
> -r--r--r-- 1 root root 0 Dec  5 15:41 export
> -r--r--r-- 1 root root 0 Dec  5 15:41 fmd_count
> -r--r--r-- 1 root root 0 Dec  5 15:41 hash
> -rw-r--r-- 1 root root 0 Dec  5 15:41 ldlm_stats
> -r--r--r-- 1 root root 0 Dec  5 15:41 nodemap
> -r--r--r-- 1 root root 0 Dec  5 15:41 open_files
> -r--r--r-- 1 root root 0 Dec  5 15:41 reply_data
> -rw-r--r-- 1 root root 0 Aug 14 10:58 stats
> -r--r--r-- 1 root root 0 Dec  5 15:41 uuid
> 
> 
> 
> 
> 
> /var/log/messages:Dec  6 12:50:17 mds2 kernel: LNetError: 
> 11579:0:(lib-move.c:4005:lnet_handle_recovery_reply()) Skipped 1 previous 
> similar message
> /var/log/messages:Dec  6 13:05:17 mds2 kernel: LNetError: 
> 11579:0:(lib-move.c:4005:lnet_handle_recovery_reply()) peer NI 
> (10.67.178.25@tcp) recovery failed with -110
> /var/log/messages:Dec  6 13:05:17 mds2 kernel: LNetError: 
> 11579:0:(lib-move.c:4005:lnet_handle_recovery_reply()) Skipped 1 previous 
> similar message
> /var/log/messages:Dec  6 13:20:17 mds2 kernel: LNetError: 
> 11579:0:(lib-move.c:4005:lnet_handle_recovery_reply()) peer NI 
> (10.67.178.25@tcp) recovery failed with -110
> /var/log/messages:Dec  6 13:20:17 mds2 kernel: LNetError: 
> 11579:0:(lib-move.c:4005:lnet_handle_recovery_reply()) Skipped 1 previous 
> similar message
> /var/log/messages:Dec  6 13:35:17 mds2 kernel: LNetError: 
> 11579:0:(lib-move.c:4005:lnet_handle_recovery_reply()) peer NI 
> (10.67.178.25@tcp) recovery failed with -110
> /var/log/messages:Dec  6 13:35:17 mds2 kernel: LNetError: 
> 11579:0:(lib-move.c:4005:lnet_handle_recovery_reply()) Skipped 1 previous 
> similar message
> /var/log/messages:Dec  6 13:50:17 mds2 kernel: LNetError: 
> 11579:0:(lib-move.c:4005:lnet_handle_recovery_reply()) peer NI 
> (10.67.178.25@tcp) recovery failed with -110
> /var/log/messages:Dec  6 13:50:17 mds2 kernel: LNetError: 
> 11579:0:(lib-move.c:4005:lnet_handle_recovery_reply()) Skipped 1 previous 
> similar message
> /var/log/messages:Dec  6 14:05:17 mds2 kernel: LNetError: 
> 11579:0:(lib-move.c:4005:lnet_handle_recovery_reply()) peer NI 
> (10.67.178.25@tcp) recovery failed with -110
> /var/log/messages:Dec  6 14:05:17 mds2 kernel: LNetError: 
> 11579:0:(lib-move.c:4005:lnet_handle_recovery_reply()) Skipped 1 previous 
> similar message
> /var/log/messages:Dec  6 14:20:16 mds2 kernel: LNetError: 
> 11579:0:(lib-move.c:4005:lnet_handle_recovery_reply()) peer NI 
> (10.67.178.25@tcp) recovery failed with -110
> /var/log/messages:Dec  6 14:20:16 mds2 kernel: LNetError: 
> 11579:0:(lib-move.c:4005:lnet_handle_recovery_reply()) Skipped 1 previous 
> similar message
> /var/log/messages:Dec  6 14:30:17 mds2 kernel: LNetError: 
> 3806712:0:(lib-move.c:4005:lnet_handle_recovery_reply()) peer NI 
> (10.67.176.25@tcp) recovery failed with -111
> /var/log/messages:Dec  6 14:30:17 mds2 kernel: LNetError: 
> 3806712:0:(lib-move.c:4005:lnet_handle_recovery_reply()) Skipped 3 previous 
> similar messages
> /var/log/messages:Dec  6 14:47:14 mds2 kernel: LNetError: 
> 3812070:0:(lib-move.c:4005:lnet_handle_recovery_reply()) peer NI 
> (10.67.176.25@tcp) recovery failed with -111
> /var/log/messages:Dec  6 14:47:14 mds2 kernel: LNetError: 
> 3812070:0:(lib-move.c:4005:lnet_handle_recovery_reply()) Skipped 8 previous 
> similar messages
> /var/log/messages:Dec  6 15:02:14 mds2 kernel: LNetError: 
> 3817248:0:(lib-move.c:4005:lnet_handle_recovery_reply()) peer NI 
> (10.67.176.25@tcp) recovery failed with -111
> 
> 
> Regards,
> Qiulan
> ___
> lustre-discuss mailing list
> lustre-discuss@lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Whamcloud







___
lustre-

Re: [lustre-discuss] Error messages (ex: not available for connect from 0@lo) on server boot with Lustre 2.15.3 and 2.15.4-RC1

2023-12-07 Thread Andreas Dilger via lustre-discuss
Aurelien,
there have beeen a number of questions about this message.

> Lustre: lustrevm-OST0001: deleting orphan objects from 0x0:227 to 0x0:513

This is not marked LustreError, so it is just an advisory message.

This can sometimes be useful for debugging issues related to MDT->OST 
connections.
It is already printed with D_INFO level, so the lowest printk level available.
Would rewording the message make it more clear that this is a normal situation
when the MDT and OST are establishing connections?

Cheers, Andreas

On Dec 5, 2023, at 02:13, Aurelien Degremont  wrote:
> 
> > Now what is the messages about "deleting orphaned objects" ? Is it normal 
> > also ?
> 
> Yeah, this is kind of normal, and I'm even thinking we should lower the 
> message verbosity...
> Andreas, do you agree that could become a simple CDEBUG(D_HA, ...) instead of 
> LCONSOLE(D_INFO, ...)?
> 
> 
> Aurélien
> 
> Audet, Martin wrote on lundi 4 décembre 2023 20:26:
>> Hello Andreas,
>> 
>> Thanks for your response. Happy to learn that the "errors" I was reporting 
>> aren't really errors.
>> 
>> I now understand that the 3 messages about LDISKFS were only normal messages 
>> resulting from mounting the file systems (I was fooled by vim showing this 
>> message in red, like important error messages, but this is simply a false 
>> positive result of its syntax highlight rules probably triggered by the 
>> "errors=" string which is only a mount option...).
>> 
>> Now what is the messages about "deleting orphaned objects" ? Is it normal 
>> also ? We boot the clients VMs always after the server is ready and we 
>> shutdown clients cleanly well before the vlmf Lustre server is (also 
>> cleanly) shutdown. It is a sign of corruption ? How come this happen if 
>> shutdowns are clean ?
>> 
>> Thanks (and sorry for the beginners questions),
>> 
>> Martin
>> 
>> Andreas Dilger  wrote on December 4, 2023 5:25 AM:
>>> It wasn't clear from your rail which message(s) are you concerned about?  
>>> These look like normal mount message(s) to me. 
>>> 
>>> The "error" is pretty normal, it just means there were multiple services 
>>> starting at once and one wasn't yet ready for the other. 
>>> 
>>>  LustreError: 137-5: lustrevm-MDT_UUID: not available for 
>>> connect
>>>  from 0@lo (no target). If you are running an HA pair check that 
>>> the target
>>> is mounted on the other server.
>>> 
>>> It probably makes sense to quiet this message right at mount time to avoid 
>>> this. 
>>> 
>>> Cheers, Andreas
>>> 
 On Dec 1, 2023, at 10:24, Audet, Martin via lustre-discuss 
  wrote:
 
 
 Hello Lustre community,
 
 Have someone ever seen messages like these on in "/var/log/messages" on a 
 Lustre server ?
 
 Dec  1 11:26:30 vlfs kernel: Lustre: Lustre: Build Version: 2.15.4_RC1
 Dec  1 11:26:30 vlfs kernel: LDISKFS-fs (sdd): mounted filesystem with 
 ordered data mode. Opts: errors=remount-ro,no_mbcache,nodelalloc
 Dec  1 11:26:30 vlfs kernel: LDISKFS-fs (sdc): mounted filesystem with 
 ordered data mode. Opts: errors=remount-ro,no_mbcache,nodelalloc
 Dec  1 11:26:30 vlfs kernel: LDISKFS-fs (sdb): mounted filesystem with 
 ordered data mode. Opts: user_xattr,errors=remount-ro,no_mbcache,nodelalloc
 Dec  1 11:26:36 vlfs kernel: LustreError: 137-5: lustrevm-MDT_UUID: 
 not available for connect from 0@lo (no target). If you are running an HA 
 pair check that the target is mounted on the other server.
 Dec  1 11:26:36 vlfs kernel: Lustre: lustrevm-OST0001: Imperative Recovery 
 not enabled, recovery window 300-900
 Dec  1 11:26:36 vlfs kernel: Lustre: lustrevm-OST0001: deleting orphan 
 objects from 0x0:227 to 0x0:513
 
 This happens on every boot on a Lustre server named vlfs (a AlmaLinux 8.9 
 VM hosted on a VMware) playing the role of both MGS and OSS (it hosts an 
 MDT two OST using "virtual" disks). We chose LDISKFS and not ZFS. Note 
 that this happens at every boot, well before the clients (AlmaLinux 9.3 or 
 8.9 VMs) connect and even when the clients are powered off. The network 
 connecting the clients and the server is a "virtual" 10GbE network (of 
 course there is no virtual IB). Also we had the same messages previously 
 with Lustre 2.15.3 using an AlmaLinux 8.8 server and AlmaLinux 8.8 / 9.2 
 clients (also using VMs). Note also that we compile ourselves the Lustre 
 RPMs from the sources from the git repository. We also chose to use a 
 patched kernel. Our build procedure for RPMs seems to work well because 
 our real cluster run fine on CentOS 7.9 with Lustre 2.12.9 and IB (MOFED) 
 networking.
 
 So has anyone seen these messages ?
 
 Are they problematic ? If yes, how do we avoid them ?
 
 We would like to make sure our small test system using VMs works well 
 before we upgrade our real cluster.
 
 Thanks in advance !
 
>

Re: [lustre-discuss] Lustre caching and NUMA nodes

2023-12-05 Thread Andreas Dilger via lustre-discuss

On Dec 4, 2023, at 15:06, John Bauer 
mailto:bau...@iodoctors.com>> wrote:

I have a an OSC caching question.  I am running a dd process which writes an 
8GB file.  The file is on lustre, striped 8x1M. This is run on a system that 
has 2 NUMA nodes (cpu sockets). All the data is apparently stored on one NUMA 
node (node1 in the plot below) until node1 runs out of free memory.  Then it 
appears that dd comes to a stop (no more writes complete) until lustre dumps 
the data from the node1.  Then dd continues writing, but now the data is stored 
on the second NUMA node, node0.  Why does lustre go to the trouble of dumping 
node1 and then not use node1's memory, when there was always plenty of free 
memory on node0?

I'll forego the explanation of the plot.  Hopefully it is clear enough.  If 
someone has questions about what the plot is depicting, please ask.

https://www.dropbox.com/scl/fi/pijgnnlb8iilkptbeekaz/dd.png?rlkey=3abonv5tx8w5w5m08bn24qb7x&dl=0

Hi John,
thanks for your detailed analysis.  It would be good to include the client 
kernel and Lustre version in this case, as the page cache behaviour can vary 
dramatically between different versions.

The allocation of the page cache pages may actually be out of the control of 
Lustre, since they are typically being allocated by the kernel VM affine to the 
core where the process that is doing the IO is running.  It may be that the 
"dd" is rescheduled to run on node0 during the IO, since the ptlrpcd threads 
will be busy processing all of the RPCs during this time, and then dd will 
start allocating pages from node0.

That said, it isn't clear why the client doesn't start flushing the dirty data 
from cache earlier?  Is it actually sending the data to the OSTs, but then 
waiting for the OSTs to reply that the data has been committed to the storage 
before dropping the cache?

It would be interesting to plot the osc.*.rpc_stats::write_rpcs_in_flight and 
::pending_write_pages to see if the data is already in flight.  The 
osd-ldiskfs.*.brw_stats on the server would also useful to graph over the same 
period, if possible.

It *does* look like the "node1 dirty" is kept at a low value for the entire 
run, so it at least appears that RPCs are being sent, but there is no page 
reclaim triggered until memory is getting low.  Doing page reclaim is really 
the kernel's job, but it seems possible that the Lustre client may not be 
suitably notifying the kernel about the dirty pages and kicking it in the butt 
earlier to clean up the pages.

PS: my preference would be to just attach the image to the email instead of 
hosting it externally, since it is only 55 KB.  Is this blocked by the list 
server?

Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Whamcloud







___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Debian 11: configure fails

2023-12-04 Thread Andreas Dilger via lustre-discuss
Which version of Lustre are you trying to build?

On Dec 4, 2023, at 05:48, Jan Andersen mailto:j...@comind.io>> 
wrote:

My system:

root@debian11:~/lustre-release# uname -r
5.10.0-26-amd64

Lustre:

git clone git://git.whamcloud.com/fs/lustre-release.git

I'm building the client with:

./configure --config-cache --disable-server --enable-client 
--with-linux=/usr/src/linux-headers-5.10.0-26-amd64 --without-zfs 
--disable-ldiskfs --disable-gss --disable-gss-keyring --disable-snmp 
--enable-modules
...
checking for /usr/src/nvidia-fs-2.18.3/config-host.h... no
./configure: line 97149: syntax error near unexpected token `LIBNL3,'
./configure: line 97149: `  PKG_CHECK_MODULES(LIBNL3, libnl-genl-3.0 >= 
3.1)'

root@debian11:~/lustre-release# dpkg -l | grep libnl-genl
ii  libnl-genl-3-200:amd64   3.4.0-1+b1 amd64   
 library for dealing with netlink sockets - generic netlink
ii  libnl-genl-3-dev:amd64   3.4.0-1+b1 amd64   
 development library and headers for libnl-genl-3

What is going wrong here - does configure require libnl-genl-3.0?
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Whamcloud







___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Error messages (ex: not available for connect from 0@lo) on server boot with Lustre 2.15.3 and 2.15.4-RC1

2023-12-04 Thread Andreas Dilger via lustre-discuss
It wasn't clear from your rail which message(s) are you concerned about?  These 
look like normal mount message(s) to me.

The "error" is pretty normal, it just means there were multiple services 
starting at once and one wasn't yet ready for the other.

 LustreError: 137-5: lustrevm-MDT_UUID: not available for connect
 from 0@lo (no target). If you are running an HA pair check that the 
target
is mounted on the other server.

It probably makes sense to quiet this message right at mount time to avoid this.

Cheers, Andreas

On Dec 1, 2023, at 10:24, Audet, Martin via lustre-discuss 
 wrote:



Hello Lustre community,


Have someone ever seen messages like these on in "/var/log/messages" on a 
Lustre server ?


Dec  1 11:26:30 vlfs kernel: Lustre: Lustre: Build Version: 2.15.4_RC1
Dec  1 11:26:30 vlfs kernel: LDISKFS-fs (sdd): mounted filesystem with ordered 
data mode. Opts: errors=remount-ro,no_mbcache,nodelalloc
Dec  1 11:26:30 vlfs kernel: LDISKFS-fs (sdc): mounted filesystem with ordered 
data mode. Opts: errors=remount-ro,no_mbcache,nodelalloc
Dec  1 11:26:30 vlfs kernel: LDISKFS-fs (sdb): mounted filesystem with ordered 
data mode. Opts: user_xattr,errors=remount-ro,no_mbcache,nodelalloc
Dec  1 11:26:36 vlfs kernel: LustreError: 137-5: lustrevm-MDT_UUID: not 
available for connect from 0@lo (no target). If you are running an HA pair 
check that the target is mounted on the other server.
Dec  1 11:26:36 vlfs kernel: Lustre: lustrevm-OST0001: Imperative Recovery not 
enabled, recovery window 300-900
Dec  1 11:26:36 vlfs kernel: Lustre: lustrevm-OST0001: deleting orphan objects 
from 0x0:227 to 0x0:513


This happens on every boot on a Lustre server named vlfs (a AlmaLinux 8.9 VM 
hosted on a VMware) playing the role of both MGS and OSS (it hosts an MDT two 
OST using "virtual" disks). We chose LDISKFS and not ZFS. Note that this 
happens at every boot, well before the clients (AlmaLinux 9.3 or 8.9 VMs) 
connect and even when the clients are powered off. The network connecting the 
clients and the server is a "virtual" 10GbE network (of course there is no 
virtual IB). Also we had the same messages previously with Lustre 2.15.3 using 
an AlmaLinux 8.8 server and AlmaLinux 8.8 / 9.2 clients (also using VMs). Note 
also that we compile ourselves the Lustre RPMs from the sources from the git 
repository. We also chose to use a patched kernel. Our build procedure for RPMs 
seems to work well because our real cluster run fine on CentOS 7.9 with Lustre 
2.12.9 and IB (MOFED) networking.

So has anyone seen these messages ?

Are they problematic ? If yes, how do we avoid them ?

We would like to make sure our small test system using VMs works well before we 
upgrade our real cluster.

Thanks in advance !

Martin Audet

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] OST is not mounting

2023-11-07 Thread Andreas Dilger via lustre-discuss
The OST went read-only because that is what happens when the block device 
disappears underneath it. That is a behavior of ext4 and other local 
filesystems as well. 

If you look in the console logs you would see SCSI errors and the filesystem 
being remounted read-only. 

To have reliability in the face of such storage issues you need to use 
dm-multipath. 

Cheers, Andreas

> On Nov 5, 2023, at 09:13, Backer via lustre-discuss 
>  wrote:
> 
> - Why did OST become in this state after the write failure and was mounted 
> RO.  The write error was due to iSCSI target going offline and coming back 
> after a few seconds later. 
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


[lustre-discuss] Possible change to "lfs find -size" default units?

2023-11-04 Thread Andreas Dilger via lustre-discuss
I've recently realized that "lfs find -size N" defaults to looking for files of 
N *bytes* by default, unlike regular find(1) that is assuming 512-byte blocks 
by default if no units are given.

I'm wondering if it would be disruptive to users if the default unit for -size 
was changed to 512-byte blocks instead of bytes, or if it is most common to 
specify a unit suffix like "lfs find -size +1M" and the change would mostly be 
unnoticed?  I would add a 'c' suffix for compatibility with find(1) to allow 
specifying an exact number of chars (bytes).

On the other hand, possibly this would be *less* confusing for users that are 
already used to the behavior of regular "find"?

Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Whamcloud







___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Lustre-Manual on lfsck - non-existing entries?

2023-10-31 Thread Andreas Dilger via lustre-discuss
On Oct 31, 2023, at 13:12, Thomas Roth via lustre-discuss 
 wrote:
> 
> Hi all,
> 
> after starting an `lctl lfsck_start -A  -C -o` and the oi_scrub having 
> completed, I would check the layout scan as described in the Lustre manual, 
> "36.4.3.3. LFSCK status of layout via procfs", by
> 
> > lctl get_param -n mdd.FSNAME-MDT_target.lfsck_layout
> 
> Doesn't work, and inspection of 'ls /sys/fs/lustre/mdd/FSNAME-MDT/' shows:
> > ...
> > lfsck_async_windows
> > lfsck_speed_limit
> ...
> 
> as the only entries showing the string "lfsck".
> 
> > lctl lfsck_query -M FSNAME-MDT -t layout
> 
> does show some info, although it is not what the manual describes as output 
> of the `lctl get_param` command.
> 
> 
> Issue with the manual or issue with our Lustre?

Are you perhaps running the "lctl get_param" as a non-root user?  One of the 
wonderful quirks of the kernel is that they don't want new parameters stored in 
procfs, and they don't want "complex" parameters (more than one value) stored 
in sysfs, so by necessity this means anything "complex" needs to go into 
debugfs (/sys/kernel/debug) but that was changed at some point to only be 
accessible by root.

As such, you need to be root to access any of the "complex" parameters/stats:

  $ lctl get_param mdd.*.lfsck_layout
  error: get_param: param_path 'mdd/*/lfsck_layout': No such file or directory

  $ sudo lctl get_param mdd.*.lfsck_layout
  mdd.myth-MDT.lfsck_layout=
  name: lfsck_layout
  magic: 0xb1732fed
  version: 2
  status: completed
  flags:
  param: all_targets
  last_completed_time: 1694676243
  time_since_last_completed: 4111337 seconds
  latest_start_time: 1694675639
  time_since_latest_start: 4111941 seconds
  last_checkpoint_time: 1694676243
  time_since_last_checkpoint: 4111337 seconds
  latest_start_position: 12
  last_checkpoint_position: 4194304
  first_failure_position: 0
  success_count: 6
  repaired_dangling: 0
  repaired_unmatched_pair: 0
  repaired_multiple_referenced: 0
  repaired_orphan: 0
  repaired_inconsistent_owner: 0
  repaired_others: 0
  skipped: 0
  failed_phase1: 0
  failed_phase2: 0
  checked_phase1: 3791402
  checked_phase2: 0
  run_time_phase1: 595 seconds
  run_time_phase2: 8 seconds
  average_speed_phase1: 6372 items/sec
  average_speed_phase2: 0 objs/sec
  real_time_speed_phase1: N/A
  real_time_speed_phase2: N/A
  current_position: N/A

  $ sudo ls /sys/kernel/debug/lustre/mdd/myth-MDT/
  total 0
  0 changelog_current_mask  0 changelog_users  0 lfsck_namespace
  0 changelog_mask  0 lfsck_layout

Getting an update to the manual to clarify this requirement would be welcome.

Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Whamcloud







___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] very slow mounts with OSS node down and peer discovery enabled

2023-10-26 Thread Andreas Dilger via lustre-discuss
I can't comment on the LNet peer discovery part, but I would definitely not 
tecommend to leave the lnet_transaction_timeout that low for normal usage. This 
can cause messages to be dropped while the server is processing them and 
introduce failures needlessly. 

Cheers, Andreas

> On Oct 26, 2023, at 09:48, Bertschinger, Thomas Andrew Hjorth via 
> lustre-discuss  wrote:
> 
> Hello,
> 
> Recently we had an OSS node down for an extended period with hardware 
> problems. While the node was down, mounting lustre on a client took an 
> extremely long time to complete (20-30 minutes). Once the fs is mounted, all 
> operations are normal and there isn't any noticeable impact from the absent 
> node.
> 
> While the client is mounting, the client's debug log shows entries like this 
> slowly going by:
> 
> 0020:0080:87.0:1698333195.993098:0:3801046:0:(obd_config.c:1384:class_process_config())
>  processing cmd: cf005
> 0020:0080:87.0:1698333195.993099:0:3801046:0:(obd_config.c:1396:class_process_config())
>  adding mapping from uuid 10.1.2.3@o2ib to nid 0x50abcd123 (10.1.2.4@o2ib)
> 
> and there is a "llog_process_th" kernel thread hanging in 
> lnet_discover_peer_locked().
> 
> We have peer discovery enabled on our clients, but disabling peer discovery 
> on a client causes the mount to complete quickly. Also, once the down OSS was 
> fixed and powered back on, mounting completed normally again.
> 
> We also found that reducing the following timeout sped up the mount by a 
> factor of ~10:
> 
> $ lnetctl set transaction_timeout 5# was 50 originally
> 
> Is such a dramatic slowdown normal in this situation? Is there any fix (aside 
> from disabling peer discovery or tuning down the timeout) that could speed up 
> mounts in case we have another OSS down in the future?
> 
> Lustre version (server and client): 2.15.3
> 
> Thanks, 
> Thomas Bertschinger
> ___
> lustre-discuss mailing list
> lustre-discuss@lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] re-registration of MDTs and OSTs

2023-10-23 Thread Andreas Dilger via lustre-discuss
On Oct 18, 2023, at 13:04, Peter Grandi via lustre-discuss 
mailto:lustre-discuss@lists.lustre.org>> wrote:

So I have been upgrading my one and only MDT to a larger ZFS
pool, by the classic route of creating a new pool, new MDT, and
then 'zfs send'/zfs receive' for the copy over (BTW for those
who may not be aware 'zfs send' output can be put into a file to
do offline backups of a Lustre ZFS filesystem instance).

At first I just created an empty MGT on the new devices (on a
server with the same NID as the old MGS), with the assumption
that given that MDTs and OSTs have unique (filesystem instance,
target type, index number) triples, with NIDs being just the
address to find the MGS, or where they can be found, they would
just register themselves with the MGT on startup.

But I found that there was a complaint that they were in a
registered state, and the MGT did not have their registration
entries. I am not sure that is the purpose of that check. So
I just copied over the old MGT where they were registered, and
all was fine.

* Is there a way to re-register MDTs and OSTs belonging to a
 given filesystem instance into a new different MGT?

If you run the "writeconf" process documented in the Lustre Manual, the MDT(s)
and OST(s) will re-register themselves with the MGT.

* Is there a purpose to check whether MDTs and OSTs are or not
 registered in a given MGT?

Yes, this prevents the MDTs/OSTs from accidentally becoming part of a
different filesystem that might have been incorrectly formatted with the
same fsname (e.g. "lustre" has been used as the fsname more than
once).

* Is there a downside to register MDTs and OSTs in a different
 MGT from that which they were registered with initially?

Not too much.  The new MGT will not have any of the old configuration
parameters, but running "writeconf" will also reset any "conf_param"
parameters so not much different (but it will not reset "set_param -P"
parameters).

My guess is that the MGT does not just contain the identities
and addresses of MDTs and OSTs of one or more filesystem
instance, but also a parameter list

If so, is there are way to dump the parameter for a filesystem
instance so it can be restored to a different MGT?

Yes, the "lctl --device MGS llog_print CONFIG_LOG" command
will dump all of the config commands for a particular MDT/OST
or the "params" log for "set_param -P".

The parameters can be restored from a file with "lctl set_param -F".

See the lctl-set_param.8 and lctl-llog_print.8 man pages for details.

Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Whamcloud







___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] setting quotas from within a container

2023-10-21 Thread Andreas Dilger via lustre-discuss
Hi Lisa,
The first question to ask is which Lustre version you are using?

Second, are you using subdirectory mounts or other UID/GID mapping for the 
container? That could happen at both the Lustre level or by the kernel itself.  
If you aren't sure, you could try creating a new file as root inside the 
container, then "ls -l" the file from outside the container to see if it is 
owned by root.

You could try running "strace lfs setquota" to see what operation the -EPERM = 
-1 error is coming from. 

The other important question is whether you really want to allow root inside 
the container to be able to set the quota, or whether this should be reserved 
for root outside the container?

Cheers, Andreas

> On Oct 21, 2023, at 09:18, Lisa Gerhardt via lustre-discuss 
>  wrote:
> 
> 
> Hello,
> I'm trying to set user quotas from within a container run as root. I can 
> successfully do things like "lfs setstripe", but "lfs setquota" fails with 
> 
> lfs setquota: quotactl failed: Operation not permitted
> setquota failed: Operation not permitted
> 
> I suspect it might have something to do with how the file system is mounted 
> in the container. I'm wondering if anyone has any experience with this or if 
> someone could point me to some documentation to help me understand what 
> "setquota" is doing differently from "setstripe" to see where things are 
> going off the rails.
> 
> Thanks,
> Lisa
> ___
> lustre-discuss mailing list
> lustre-discuss@lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] mount not possible: "no server support"

2023-10-19 Thread Andreas Dilger via lustre-discuss


On Oct 19, 2023, at 19:58, Benedikt Alexander Braunger via lustre-discuss 
mailto:lustre-discuss@lists.Lustre.org>> wrote:

Hi Lustrers,

I'm currently struggling with a unmountable Lustre filesystem. The client only 
says "no server support", no further logs on client or server.
I first thought this might be related to the usage of fscrypt but I already 
recreated the whole filesystem from scratch and the error still persists.
Now I have no more idea what to look for.

Here the full CLI log:

[root@dstorsec01vl]# uname -a
Linux dstorsec01vl 5.14.0-284.11.1.el9_2.x86_64 #1 SMP PREEMPT_DYNAMIC Wed Apr 
12 10:45:03 EDT 2023 x86_64 x86_64 x86_64 GNU/Linux

[root@dstorsec01vl]# modprobe lnet
[root@dstorsec01vl]# modprobe lustre
[root@dstorsec01vl]# lnetctl ping pstormgs01@tcp
ping:
- primary nid: 10.106.104.160@tcp
  Multi-Rail: False
  peer ni:
- nid: 10.106.104.160@tcp

[root@dstorsec01vl]# mount -t lustre pstormgs01@tcp:sif0 /mnt/
mount.lustre: cannot mount pstormgs01@tcp:sif0: no server support

It looks like this is failing because the mount device is missing ":/"  in it, 
which mount.lustre uses to decide whether this is a client or server 
mountpoint.  you should be using:

client# mount -t lustre pstormgs01@tcp:/sif0 /mnt/sif0

and this should work.  It probably makes sense to improve the error message to 
be more clear, like:

   mount.lustre: cannot mount block device 'pstormgs01@tcp:sif0': no server 
support

or similar

Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Whamcloud







___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] backup restore docs not quite accurate?

2023-10-18 Thread Andreas Dilger via lustre-discuss
Removing the OI files is for ldiskfs backup/restore (eg. after tar/untar) when 
the inode numbers are changed. That is not needed for ZFS send/recv because the 
inode numbers stay the same after such an operation. 

If that isn't clear in the manual it should be fixed. 

Cheers, Andreas

> On Oct 18, 2023, at 23:42, Peter Grandi via lustre-discuss 
>  wrote:
> 
> I was asking this on the Slack channel but as I was typing it
> looks too complicated for chat, so here:
> 
> Lustre 2.15.2, EL8, MGT, MDT and OSTs on ZFS.
> 
> I am trying to copy the MGT and the MDT to larger zpools, so I
> created them on another server, and used 'zfs send' to copy them
> while the Lustre instance was frozen (for the last incremental).
> 
> The I have put the new zpool drives in the old server with same
> NID, and I following the "Operations Manual" here:
> 
> https://doc.lustre.org/lustre_manual.xhtml#backup_fs_level.restore
> "Remove old OI and LFSCK files.[oss]# rm -rf oi.16* lfsck_* LFSCK
> Remove old CATALOGS. [oss]# rm -f CATALOGS"
> 
> But I am getting a lot of error when removing "oi.16*", of the
> "directory not empty" sort. For example "cannot remove
> 'oi.16/0x200011b90:0xabe1:0x0': Directory not empty"
> 
> Please suggest some options.
> ___
> lustre-discuss mailing list
> lustre-discuss@lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] [EXTERNAL] [BULK] Re: Ongoing issues with quota

2023-10-18 Thread Andreas Dilger via lustre-discuss
The zero-length objects are created for the file stripes, but if the MDT inodes 
were deleted, but something went wrong with the MDT before the OST objects were 
deleted, then the objects would be left behind. 

If the objects are in lost+found with the FID as the filename, then the file 
itself is almost certainly already deleted, so fid2path would just return the 
file in lost+found. 

I don't think there would be any problem to delete them. 

Cheers, Andreas

> On Oct 18, 2023, at 08:30, Daniel Szkola  wrote:
> 
> In this case almost all, if not all, of the files look a lot like this:
> 
> -r 1 someuser   somegroup 0 Dec 31  1969 
> '[0x200012392:0xe0ad:0x0]-R-0’
> 
> stat shows:
> 
> # stat [0x200012392:0xe0ad:0x0]-R-0
>  File: [0x200012392:0xe0ad:0x0]-R-0
>  Size: 0 Blocks: 1  IO Block: 4194304 regular empty file
> Device: a75b4da0h/2807778720dInode: 144116440360870061  Links: 1
> Access: (0400/-r)  Uid: (43667/  someuser)   Gid: ( 9349/somegroup)
> Access: 1969-12-31 18:00:00.0 -0600
> Modify: 1969-12-31 18:00:00.0 -0600
> Change: 1969-12-31 18:00:00.0 -0600
> Birth: 2023-01-11 13:01:40.0 -0600
> 
> Not sure what these were or how they ended up in lost+found. I took this 
> lustre fs over from folks who have moved on and I’m still trying to wrap my 
> head around some of the finer details. In a normal linux fs, usually, not 
> always, the blocks will have data in them. These are all zero-length. My 
> inclination is to see if I can delete them and be done with it, but I’m a bit 
> paranoid.
> 
> —
> Dan Szkola
> FNAL
> 
> 
> 
> 
> 
>> On Oct 17, 2023, at 4:23 PM, Andreas Dilger  wrote:
>> 
>> The files reported in .lustre/lost+found *ARE* the objects on the OSTs (at 
>> least when accessed through a Lustre mountpoint, not if accessed directly on 
>> the MDT mounted as ldiskfs), so when they are deleted the space on the OSTs 
>> will be freed.
>> 
>> As for identification, the OST objects do not have any name information, but 
>> they should have UID/GID/PROJID and timestamps that might help 
>> identification.
>> 
>> Cheers, Andreas
>> 
 On Oct 18, 2023, at 03:42, Daniel Szkola  wrote:
>>> 
>>> OK, so I did find the hidden .lustre directory (thanks Darby) and there are 
>>> many, many files in the lost+found directory. I can run ’stat’ on them and 
>>> get some info. Is there anything else I can do to tell what these were? Is 
>>> it safe to delete them? Is there anyway to tell if there are matching files 
>>> on the OST(s) that also need to be deleted?
>>> 
>>> —
>>> Dan Szkola
>>> FNAL 
>>> 
 On Oct 10, 2023, at 3:44 PM, Vicker, Darby J. (JSC-EG111)[Jacobs 
 Technology, Inc.]  wrote:
 
> I don’t have a .lustre directory at the filesystem root.
 
 It's there, but doesn't show up even with 'ls -a'.  If you cd into it or 
 ls it, it's there.  Lustre magic.  :)
 
 -Original Message-
 From: lustre-discuss >>> > on behalf of Daniel 
 Szkola via lustre-discuss >>> >
 Reply-To: Daniel Szkola mailto:dszk...@fnal.gov>>
 Date: Tuesday, October 10, 2023 at 2:30 PM
 To: Andreas Dilger mailto:adil...@whamcloud.com>>
 Cc: lustre >>> >
 Subject: [EXTERNAL] [BULK] Re: [lustre-discuss] Ongoing issues with quota
 
 
 CAUTION: This email originated from outside of NASA. Please take care when 
 clicking links or opening attachments. Use the "Report Message" button to 
 report suspicious messages to the NASA SOC.
 
 
 
 
 
 
 
 
 Hello Andreas,
 
 
 lfs df -i reports 19,204,412 inodes used. When I did the full robinhood 
 scan, it reported scanning 18,673,874 entries, so fairly close.
 
 
 I don’t have a .lustre directory at the filesystem root.
 
 
 Another interesting aspect of this particular issue is I can run lctl 
 lfsck and every time I get:
 
 
 layout_repaired: 1468299
 
 
 But it doesn’t seem to be actually repairing anything because if I run it 
 again, I’ll get the same or a similar number.
 
 
 I run it like this:
 lctl lfsck_start -t layout -t namespace -o -M lfsc-MDT
 
 
 —
 Dan Szkola
 FNAL
 
 
 
 
> On Oct 10, 2023, at 10:47 AM, Andreas Dilger  > wrote:
> 
> There is a $ROOT/.lustre/lost+found that you could check.
> 
> What does "lfs df -i" report for the used inode count? Maybe it is RBH 
> that is reporting the wrong count?
> 
> The other alternative would be to mount the MDT filesystem directly as 
> type ZFS and see what df -i and find report?
> 
> Cheers, Andreas
> 
>> On Oct 10, 2023, at 22:16, Daniel Szkola via lustre-discuss 
>> > 

Re: [lustre-discuss] [EXTERNAL] [BULK] Re: Ongoing issues with quota

2023-10-17 Thread Andreas Dilger via lustre-discuss
The files reported in .lustre/lost+found *ARE* the objects on the OSTs (at 
least when accessed through a Lustre mountpoint, not if accessed directly on 
the MDT mounted as ldiskfs), so when they are deleted the space on the OSTs 
will be freed.

As for identification, the OST objects do not have any name information, but 
they should have UID/GID/PROJID and timestamps that might help identification.

Cheers, Andreas

On Oct 18, 2023, at 03:42, Daniel Szkola 
mailto:dszk...@fnal.gov>> wrote:

OK, so I did find the hidden .lustre directory (thanks Darby) and there are 
many, many files in the lost+found directory. I can run ’stat’ on them and get 
some info. Is there anything else I can do to tell what these were? Is it safe 
to delete them? Is there anyway to tell if there are matching files on the 
OST(s) that also need to be deleted?

—
Dan Szkola
FNAL

On Oct 10, 2023, at 3:44 PM, Vicker, Darby J. (JSC-EG111)[Jacobs Technology, 
Inc.] mailto:darby.vicke...@nasa.gov>> wrote:

I don’t have a .lustre directory at the filesystem root.

It's there, but doesn't show up even with 'ls -a'.  If you cd into it or ls it, 
it's there.  Lustre magic.  :)

-Original Message-
From: lustre-discuss 
mailto:lustre-discuss-boun...@lists.lustre.org>
 > on behalf of Daniel Szkola 
via lustre-discuss 
mailto:lustre-discuss@lists.lustre.org> 
>
Reply-To: Daniel Szkola mailto:dszk...@fnal.gov> 
>
Date: Tuesday, October 10, 2023 at 2:30 PM
To: Andreas Dilger mailto:adil...@whamcloud.com> 
>
Cc: lustre 
mailto:lustre-discuss@lists.lustre.org> 
>
Subject: [EXTERNAL] [BULK] Re: [lustre-discuss] Ongoing issues with quota


CAUTION: This email originated from outside of NASA. Please take care when 
clicking links or opening attachments. Use the "Report Message" button to 
report suspicious messages to the NASA SOC.








Hello Andreas,


lfs df -i reports 19,204,412 inodes used. When I did the full robinhood scan, 
it reported scanning 18,673,874 entries, so fairly close.


I don’t have a .lustre directory at the filesystem root.


Another interesting aspect of this particular issue is I can run lctl lfsck and 
every time I get:


layout_repaired: 1468299


But it doesn’t seem to be actually repairing anything because if I run it 
again, I’ll get the same or a similar number.


I run it like this:
lctl lfsck_start -t layout -t namespace -o -M lfsc-MDT


—
Dan Szkola
FNAL




On Oct 10, 2023, at 10:47 AM, Andreas Dilger 
mailto:adil...@whamcloud.com> 
> wrote:

There is a $ROOT/.lustre/lost+found that you could check.

What does "lfs df -i" report for the used inode count? Maybe it is RBH that is 
reporting the wrong count?

The other alternative would be to mount the MDT filesystem directly as type ZFS 
and see what df -i and find report?

Cheers, Andreas

On Oct 10, 2023, at 22:16, Daniel Szkola via lustre-discuss 
mailto:lustre-discuss@lists.lustre.org> 
> wrote:

OK, I disabled, waited for a while, then reenabled. I still get the same 
numbers. The only thing I can think is somehow the count is correct, despite 
the huge difference. Robinhood and find show about 1.7M files, dirs, and links. 
The quota is showing a bit over 3.1M inodes used. We only have one MDS and MGS. 
Any ideas where the discrepancy may lie? Orphans? Is there a lost+found area in 
lustre?

—
Dan Szkola
FNAL


On Oct 10, 2023, at 8:24 AM, Daniel Szkola 
mailto:dszk...@fnal.gov> > wrote:

Hi Robert,

Thanks for the response. Do you remember exactly how you did it? Did you bring 
everything down at any point? I know you can do this:

lctl conf_param fsname.quota.mdt=none

but is that all you did? Did you wait or bring everything down before 
reenabling? I’m worried because that allegedly just enables/disables 
enforcement and space accounting is always on. Andreas stated that quotas are 
controlled by ZFS, but there has been no quota support enabled on any of the 
ZFS volumes in our lustre filesystem.

—
Dan Szkola
FNAL

On Oct 10, 2023, at 2:17 AM, Redl, Robert 
mailto:robert.r...@lmu.de> > 
wrote:

Dear Dan,

I had a similar problem some time ago. We are also using ZFS for MDT and OSTs. 
For us, the used disk space was reported wrong. The problem was fixed by 
switching quota support off on the MGS and then on again.

Cheers,
Robert

Am 09.10.2023 um 17:55 schrieb Daniel Szkola via lustre-discuss 
mailto:lustre-discuss@lists.lustre.org> 
>:

Thanks, I will look into the ZFS quota since we are using ZFS for all storage, 
MDT and OSTs.

In our case, there is a single MDS/MDT. I have used Robinhood and lfs find (by 
group) commands to verify what the numbers should apparently be.

—
Dan Szkola
FNAL

On Oct 9, 2

Re: [lustre-discuss] OSS on compute node

2023-10-13 Thread Andreas Dilger via lustre-discuss
On Oct 13, 2023, at 20:58, Fedele Stabile 
mailto:fedele.stab...@fis.unical.it>> wrote:

Hello everyone,
We are in progress to integrate Lustre on our little HPC Cluster and we would 
like to know if it is possible to use the same node in a cluster to act as an 
OSS with disks and to also use it as a Compute Node and then install a Lustre 
Client.
I know that the OSS server require a modified kernel so I suppose it can be 
installed in a virtual machine using kvm on a compute node.

There isn't really a problem with running a client + OSS on the same node 
anymore, nor is there a problem with an OSS running inside a VM (if you have 
SR-IOV and enough CPU+RAM to run the server).

*HOWEVER*, I don't think it would be good to have the client mounted on the *VM 
host*, and then run the OSS on a *VM guest*.  That could lead to deadlocks and 
priority inversion if the client becomes busy, but depends on the local OSS to 
flush dirty data from RAM and the OSS cannot run in the VM because it doesn't 
have any RAM...

If the client and OSS are BOTH run in VMs, or neither run in VMs, or only the 
client run in a VM, then that should be OK, but may have reduced performance 
due to the server contending with the client application.

Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Whamcloud







___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Ongoing issues with quota

2023-10-10 Thread Andreas Dilger via lustre-discuss
There is a $ROOT/.lustre/lost+found that you could check. 

What does "lfs df -i" report for the used inode count?  Maybe it is RBH that is 
reporting the wrong count?

The other alternative would be to mount the MDT filesystem directly as type ZFS 
and see what df -i and find report?  

Cheers, Andreas

> On Oct 10, 2023, at 22:16, Daniel Szkola via lustre-discuss 
>  wrote:
> 
> OK, I disabled, waited for a while, then reenabled. I still get the same 
> numbers. The only thing I can think is somehow the count is correct, despite 
> the huge difference. Robinhood and find show about 1.7M files, dirs, and 
> links. The quota is showing a bit over 3.1M inodes used. We only have one MDS 
> and MGS. Any ideas where the discrepancy may lie? Orphans? Is there a 
> lost+found area in lustre?
> 
> —
> Dan Szkola
> FNAL
> 
> 
>> On Oct 10, 2023, at 8:24 AM, Daniel Szkola  wrote:
>> 
>> Hi Robert,
>> 
>> Thanks for the response. Do you remember exactly how you did it? Did you 
>> bring everything down at any point? I know you can do this:
>> 
>> lctl conf_param fsname.quota.mdt=none
>> 
>> but is that all you did? Did you wait or bring everything down before 
>> reenabling? I’m worried because that allegedly just enables/disables 
>> enforcement and space accounting is always on. Andreas stated that quotas 
>> are controlled by ZFS, but there has been no quota support enabled on any of 
>> the ZFS volumes in our lustre filesystem.
>> 
>> —
>> Dan Szkola
>> FNAL
>> 
 On Oct 10, 2023, at 2:17 AM, Redl, Robert  wrote:
>>> 
>>> Dear Dan,
>>> 
>>> I had a similar problem some time ago. We are also using ZFS for MDT and 
>>> OSTs. For us, the used disk space was reported wrong. The problem was fixed 
>>> by switching quota support off on the MGS and then on again. 
>>> 
>>> Cheers,
>>> Robert
>>> 
 Am 09.10.2023 um 17:55 schrieb Daniel Szkola via lustre-discuss 
 :
 
 Thanks, I will look into the ZFS quota since we are using ZFS for all 
 storage, MDT and OSTs.
 
 In our case, there is a single MDS/MDT. I have used Robinhood and lfs find 
 (by group) commands to verify what the numbers should apparently be.
 
 —
 Dan Szkola
 FNAL
 
> On Oct 9, 2023, at 10:13 AM, Andreas Dilger  wrote:
> 
> The quota accounting is controlled by the backing filesystem of the OSTs 
> and MDTs.
> 
> For ldiskfs/ext4 you could run e2fsck to re-count all of the inode and 
> block usage. 
> 
> For ZFS you would have to ask on the ZFS list to see if there is some way 
> to re-count the quota usage. 
> 
> The "inode" quota is accounted from the MDTs, while the "block" quota is 
> accounted from the OSTs. You might be able to see with "lfs quota -v -g 
> group" to see if there is one particular MDT that is returning too many 
> inodes. 
> 
> Possibly if you have directories that are striped across many MDTs it 
> would inflate the used inode count. For example, if every one of the 426k 
> directories reported by RBH was striped across 4 MDTs then you would see 
> the inode count add up to 3.6M. 
> 
> If that was the case, then I would really, really advise against striping 
> every directory in the filesystem.  That will cause problems far worse 
> than just inflating the inode quota accounting. 
> 
> Cheers, Andreas
> 
>> On Oct 9, 2023, at 22:33, Daniel Szkola via lustre-discuss 
>>  wrote:
>> 
>> Is there really no way to force a recount of files used by the quota? 
>> All indications are we have accounts where files were removed and this 
>> is not reflected in the used file count in the quota. The space used 
>> seems correct but the inodes used numbers are way high. There must be a 
>> way to clear these numbers and have a fresh count done.
>> 
>> —
>> Dan Szkola
>> FNAL
>> 
>>> On Oct 4, 2023, at 11:37 AM, Daniel Szkola via lustre-discuss 
>>>  wrote:
>>> 
>>> Also, quotas on the OSTS don’t add up to near 3 million files either:
>>> 
>>> [root@lustreclient scratch]# ssh ossnode0 lfs quota -g somegroup -I 0 
>>> /lustre1
>>> Disk quotas for grp somegroup (gid 9544):
>>> Filesystem  kbytes   quota   limit   grace   files   quota   limit   
>>> grace
>>>  1394853459   0 1913344192   -  132863   0   0  
>>>  -
>>> [root@lustreclient scratch]# ssh ossnode0 lfs quota -g somegroup -I 1 
>>> /lustre1
>>> Disk quotas for grp somegroup (gid 9544):
>>> Filesystem  kbytes   quota   limit   grace   files   quota   limit   
>>> grace
>>>  1411579601   0 1963246413   -  120643   0   0  
>>>  -
>>> [root@lustreclient scratch]# ssh ossnode1 lfs quota -g somegroup -I 2 
>>> /lustre1
>>> Disk quotas for grp somegroup (gid 9544):
>>> Filesystem  kbytes   quota   limit   grace   files   quota   limit  

Re: [lustre-discuss] Ongoing issues with quota

2023-10-09 Thread Andreas Dilger via lustre-discuss
The quota accounting is controlled by the backing filesystem of the OSTs and 
MDTs.

For ldiskfs/ext4 you could run e2fsck to re-count all of the inode and block 
usage. 

For ZFS you would have to ask on the ZFS list to see if there is some way to 
re-count the quota usage. 

The "inode" quota is accounted from the MDTs, while the "block" quota is 
accounted from the OSTs. You might be able to see with "lfs quota -v -g group" 
to see if there is one particular MDT that is returning too many inodes. 

Possibly if you have directories that are striped across many MDTs it would 
inflate the used inode count. For example, if every one of the 426k directories 
reported by RBH was striped across 4 MDTs then you would see the inode count 
add up to 3.6M. 

If that was the case, then I would really, really advise against striping every 
directory in the filesystem.  That will cause problems far worse than just 
inflating the inode quota accounting. 

Cheers, Andreas

> On Oct 9, 2023, at 22:33, Daniel Szkola via lustre-discuss 
>  wrote:
> 
> Is there really no way to force a recount of files used by the quota? All 
> indications are we have accounts where files were removed and this is not 
> reflected in the used file count in the quota. The space used seems correct 
> but the inodes used numbers are way high. There must be a way to clear these 
> numbers and have a fresh count done.
> 
> —
> Dan Szkola
> FNAL
> 
>> On Oct 4, 2023, at 11:37 AM, Daniel Szkola via lustre-discuss 
>>  wrote:
>> 
>> Also, quotas on the OSTS don’t add up to near 3 million files either:
>> 
>> [root@lustreclient scratch]# ssh ossnode0 lfs quota -g somegroup -I 0 
>> /lustre1
>> Disk quotas for grp somegroup (gid 9544):
>>Filesystem  kbytes   quota   limit   grace   files   quota   limit   grace
>>   1394853459   0 1913344192   -  132863   0   0  
>>  -
>> [root@lustreclient scratch]# ssh ossnode0 lfs quota -g somegroup -I 1 
>> /lustre1
>> Disk quotas for grp somegroup (gid 9544):
>>Filesystem  kbytes   quota   limit   grace   files   quota   limit   grace
>>   1411579601   0 1963246413   -  120643   0   0  
>>  -
>> [root@lustreclient scratch]# ssh ossnode1 lfs quota -g somegroup -I 2 
>> /lustre1
>> Disk quotas for grp somegroup (gid 9544):
>>Filesystem  kbytes   quota   limit   grace   files   quota   limit   grace
>>   1416507527   0 1789950778   -  190687   0   0  
>>  -
>> [root@lustreclient scratch]# ssh ossnode1 lfs quota -g somegroup -I 3 
>> /lustre1
>> Disk quotas for grp somegroup (gid 9544):
>>Filesystem  kbytes   quota   limit   grace   files   quota   limit   grace
>>   1636465724   0 1926578117   -  195034   0   0  
>>  -
>> [root@lustreclient scratch]# ssh ossnode2 lfs quota -g somegroup -I 4 
>> /lustre1
>> Disk quotas for grp somegroup (gid 9544):
>>Filesystem  kbytes   quota   limit   grace   files   quota   limit   grace
>>   2202272244   0 3020159313   -  185097   0   0  
>>  -
>> [root@lustreclient scratch]# ssh ossnode2 lfs quota -g somegroup -I 5 
>> /lustre1
>> Disk quotas for grp somegroup (gid 9544):
>>Filesystem  kbytes   quota   limit   grace   files   quota   limit   grace
>>   1324770165   0 1371244768   -  145347   0   0  
>>  -
>> [root@lustreclient scratch]# ssh ossnode3 lfs quota -g somegroup -I 6 
>> /lustre1
>> Disk quotas for grp somegroup (gid 9544):
>>Filesystem  kbytes   quota   limit   grace   files   quota   limit   grace
>>   2892027349   0 3221225472   -  169386   0   0  
>>  -
>> [root@lustreclient scratch]# ssh ossnode3 lfs quota -g somegroup -I 7 
>> /lustre1
>> Disk quotas for grp somegroup (gid 9544):
>>Filesystem  kbytes   quota   limit   grace   files   quota   limit   grace
>>   2076201636   0 2474853207   -  171552   0   0  
>>  -
>> 
>> 
>> —
>> Dan Szkola
>> FNAL
>> 
 On Oct 4, 2023, at 8:45 AM, Daniel Szkola via lustre-discuss 
  wrote:
>>> 
>>> No combination of ossnodek runs has helped with this.
>>> 
>>> Again, robinhood shows 1796104 files for the group, an 'lfs find -G gid' 
>>> found 1796104 files as well.
>>> 
>>> So why is the quota command showing over 3 million inodes used?
>>> 
>>> There must be a way to force it to recount or clear all stale quota data 
>>> and have it regenerate it?
>>> 
>>> Anyone?
>>> 
>>> —
>>> Dan Szkola
>>> FNAL
>>> 
>>> 
 On Sep 27, 2023, at 9:42 AM, Daniel Szkola via lustre-discuss 
  wrote:
 
 We have a lustre filesystem that we just upgraded to 2.15.3, however this 
 problem has been going on for some time.
 
 The quota command shows this:
 
 Disk quotas for grp somegroup (gid 9544):
  Filesystemused   quota   limit   grace   files   quota   limit   grace
/lustre1  13.38T 40T 45T   - 

Re: [lustre-discuss] OST went back in time: no(?) hardware issue

2023-10-04 Thread Andreas Dilger via lustre-discuss
On Oct 3, 2023, at 16:22, Thomas Roth via lustre-discuss 
 wrote:
> 
> Hi all,
> 
> in our Lustre 2.12.5 system, we have "OST went back in time" after OST 
> hardware replacement:
> - hardware had reached EOL
> - we set `max_create_count=0` for these OSTs, searched for and migrated off 
> the files of these OSTs
> - formatted the new OSTs with `--replace` and the old indices
> - all OSTs are on ZFS
> - set the OSTs `active=0` on our 3 MDTs
> - moved in the new hardware, reused the old NIDs, old OST indices, mounted 
> the OSTs
> - set the OSTs `active=1`
> - ran `lfsck` on all servers
> - set `max_create_count=200` for these OSTs
> 
> Now the "OST went back in time" messages appeard in the MDS logs.
> 
> This doesn't quite fit the description in the manual. There were no crashes 
> or power losses. I cannot understand how which cache might have been lost.
> The transaction numbers quoted in the error are both large, eg. `transno 
> 55841088879 was previously committed, server now claims 4294992012`
> 
> What should we do? Give `lfsck` another try?

Nothing really to see here I think?

Did you delete LAST_RCVD during the replacement and the OST didn't know what 
transno was assigned to the last RPCs it sent?  The still-mounted clients have 
a record of this transno and are surprised that it was reset.  If you unmount 
and remount the clients the error would go away.

I'm not sure if the clients might try to preserve the next 55B RPCs in memory 
until the committed transno on the OST catches up, or if they just accept the 
new transno and get on with life?

Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Whamcloud







___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Failing build of lustre client on Debian 12

2023-10-04 Thread Andreas Dilger via lustre-discuss
On Oct 4, 2023, at 16:26, Jan Andersen mailto:j...@comind.io>> 
wrote:

Hi,

I've just successfully built the lustre 2.15.3 client on Debian 11 and need to 
do the same on Debian 12; however, configure fails with:

checking if Linux kernel was built with CONFIG_FHANDLE in or as module... no
configure: error:

Lustre fid handling requires that CONFIG_FHANDLE is enabled in your kernel.



As far as I can see, CONFIG_FHANDLE is in fact enabled - eg:

root@debian12:~/lustre-release# grep CONFIG_FHANDLE /boot/config-6.1.38
CONFIG_FHANDLE=y

I've tried to figure out how configure checks for this, but the script is 
rather dense and I haven't penetrated it (yet). It seems to me there is an 
error in the way it checks. What is the best way forward, considering that I've 
already invested a lot of time and effort in setting up a slurm cluster with 
Debian 12?

You could change the AC_MSG_ERROR() to AC_MESSAGE_WARN() or similar, if you 
think the check is wrong.  It would be wortwhile to check if a patch has 
already been submitted to fix this on the master branch.  Otherwise, getting a 
proper patch submitted to fix the check would be better than just ignoring the 
error and leaving it for the next person to fix.

Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Whamcloud







___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Cannot mount MDT after upgrading from Lustre 2.12.6 to 2.15.3

2023-10-01 Thread Andreas Dilger via lustre-discuss
On Oct 1, 2023, at 00:36, Tung-Han Hsieh via lustre-discuss 
 wrote:
> I should apologize for replying late. Here I would like to clarify why in my 
> opinion the Lustre ldiskfs code is not self-contained.
> 
> In the past, to compile lustre with ldiskfs, we needed to patch Linux kernel 
> using the patches provided by Lustre source code. And YES, for the recent 
> Lustre versions, the necessary patches are fewer, and even OK without 
> applying any patches to Linux kernel.  However, there are another patches to 
> Lustre ldiskfs code, namely copying Linux kernel ext4fs code into Lustre 
> source tree, applying patches provided by Lustre, and make it becomes the 
> ldiskfs code, during the compile time.

Hello T.H.,
it is true that Lustre needs to patch the ldiskfs code in order to properly 
integrate with the ext4/jbd2 transaction handling, and to add some features 
that ext4 is lacking that Lustre depends on.  Ideally we could get these 
features integrated into the upstream ext4 code to avoid the need to patch it, 
or at least minimize the number of patches, but that is often difficult and 
time consuming and doesn't get done as often as anyone would want it to.

Lustre is an open-source project, and if you are depending on the Linux-5.4 
stable kernel branch for your servers, it would be welcome for you to submit 
patches to update the kernel patch series if there are issues arising with 
these patches with a new stable kernel, as other developers maintain the kernel 
patch series for the distros that they are interested in (primarily RHEL 
derivatives, but more recently Ubuntu).  

> Here is the compilation log indicating the patching procedure. To compile 
> lustre-2.15.3 with Linux-5.4.135, after running
> 
> ./configure --prefix=/opt/lustre --with-linux=/usr/src/linux-5.4.135 
> --with-o2ib=no --with-ldiskfsprogs=/opt/e2fs --enable-mpitests=no 
> 
> and then run "make". The log reads:
> 
> 
> rm -rf linux-stage linux sources trace
> mkdir -p linux-stage/fs/ext4 linux-stage/include/linux \
>  linux-stage/include/trace/events
> cp /usr/src/linux-5.4.135/fs/ext4/file.c 
> /usr/src/linux-5.4.135/fs/ext4/ioctl.c /usr/src/linux-5.4.135/fs/ext4/dir.c 
>  linux-stage/fs/ext4
> if test -n "" ; then \
> cp  linux-stage/include/linux; \
> fi
> if test -n "/usr/src/linux-5.4.135/include/trace/events/ext4.h" ; then \
> cp /usr/src/linux-5.4.135/include/trace/events/ext4.h 
> linux-stage/include/trace/events; \
> fi
> ln -s ../../ldiskfs/kernel_patches/patches linux-stage/patches
> ln -s ../../ldiskfs/kernel_patches/series/ldiskfs-5.4.136-ml.series 
> linux-stage/
> series
> cd linux-stage && quilt push -a -q
> Applying patch patches/rhel8/ext4-inode-version.patch
> Applying patch patches/linux-5.4/ext4-lookup-dotdot.patch
> Applying patch patches/suse15/ext4-print-inum-in-htree-warning.patch
> Applying patch patches/rhel8/ext4-prealloc.patch
> Applying patch patches/ubuntu18/ext4-osd-iop-common.patch
> Applying patch patches/ubuntu19/ext4-misc.patch
> Applying patch patches/rhel8/ext4-mballoc-extra-checks.patch
> Applying patch patches/linux-5.4/ext4-hash-indexed-dir-dotdot-update.patch
> Applying patch patches/linux-5.4/ext4-kill-dx-root.patch
> Applying patch patches/rhel7.6/ext4-mballoc-pa-free-mismatch.patch
> Applying patch patches/linux-5.4/ext4-data-in-dirent.patch
> Applying patch patches/rhel8/ext4-nocmtime.patch
> Applying patch patches/base/ext4-htree-lock.patch
> Applying patch patches/linux-5.4/ext4-pdirop.patch
> Applying patch patches/rhel8/ext4-max-dir-size.patch
> Applying patch 
> patches/rhel8/ext4-corrupted-inode-block-bitmaps-handling-patches.patch
> Applying patch 
> patches/linux-5.4/ext4-give-warning-with-dir-htree-growing.patch
> Applying patch patches/ubuntu18/ext4-jcb-optimization.patch
> Applying patch patches/linux-5.4/ext4-attach-jinode-in-writepages.patch
> Applying patch patches/rhel8/ext4-dont-check-before-replay.patch
> Applying patch 
> patches/rhel7.6/ext4-use-GFP_NOFS-in-ext4_inode_attach_jinode.patch
> Applying patch patches/rhel7.6/ext4-export-orphan-add.patch
> Applying patch patches/rhel8/ext4-export-mb-stream-allocator-variables.patch
> Applying patch patches/ubuntu19/ext4-iget-with-flags.patch
> Applying patch patches/linux-5.4/export-ext4fs-dirhash-helper.patch
> Applying patch patches/linux-5.4/ext4-misc.patch
> Applying patch patches/linux-5.4/ext4-simple-blockalloc.patch
> Applying patch patches/linux-5.4/ext4-xattr-disable-credits-check.patch
> Applying patch patches/base/ext4-no-max-dir-size-limit-for-iam-objects.patch
> Applying patch patches/rhel8/ext4-ialloc-uid-gid-and-pass-owner-down.patch
> Applying patch patches/base/ext4-projid-xattrs.patch
> Applying patch patches/linux-5.4/ext4-enc-flag.patch
> Applying patch patches/base/ext4-delayed-iput.patch
> Now at patch patches/base/ext4-delayed-iput.patch
> =

Re: [lustre-discuss] Adding lustre clients into the Debian

2023-10-01 Thread Andreas Dilger via lustre-discuss
On Oct 1, 2023, at 05:54, Arman Khalatyan via lustre-discuss 
 wrote:
> 
> Hello everyone,
> 
> We are in the process of integrating the Lustre client into Debian. Are there 
> any legal concerns or significant obstacles to this? We're curious why it 
> hasn't been included in the official Debian repository so far. There used to 
> be an old, unmaintained Lustre client branch dating back to around 2013.

I don't think there is any particular barrier to add a Debian Lustre package 
today.  As you wrote, there was some effort put into that in the past, but I 
think there wasn't someone with the time and Debian-specific experience 
maintain it and get it included into a release.  Lustre is GPLv2 licensed, with 
some LGPL user library API code, so there aren't any legal issues.

The Lustre CI system builds Ubuntu clients, and I'd be happy to see patches to 
improve the Debian packaging in the Lustre tree.  Most Lustre users are using 
RHEL derivatives, and secondarily Ubuntu, so there hasn't been anyone to work 
on specific Debian packaging in some time.  Thomas (CC'd) was most active on 
the Debian front, and might be able to comment on this in more detail, and 
hopefully can also help review the patches.

> You can check our wishlist on Debian here: https://bugs.debian.org/1053214
> 
> At AIP, one of our colleagues is responsible for maintaining Astropy, so we 
> have some experience with Debian.
> 
> I've also set up a CI system in our GitLab, which includes a simple build and 
> push to a public S3 bucket. This is primarily for testing purposes to see if 
> it functions correctly for others...

There is an automated build farm for Lustre, with patches submitted via Gerrit 
(https://wiki.lustre.org/Using_Gerrit), and while there aren't Debian 
build/test nodes to test the correctness of changes for Debian, this will at 
least avoid regressions with Ubuntu or, though unlikely, with RHEL.  My 
assumption would be that any Debian-specific changes would continue to work 
with Ubuntu?

Depending on how actively you want to build/test Lustre, it is also possible to 
configure your CI system to asynchronously follow patches under development in 
Gerrit to provide build and/or test feedback on the patches before they land.  
The contrib/scripts/gerrit_checkpatch.py script follows new patch submissions 
and runs code style reviews on patches after they are pushed, and posts 
comments back to Gerrit.

A step beyond this, once your CI system is working reliably, it is possible to 
also post review comments directly into the patches in Gerrit, as the Gerrit 
Janitor does (https://github.com/verygreen/lustretester/).  The 
gerrit_build-and-test-new.py script is derived from gerrit_checkpatch.py, but 
implements a more complex set of operations on each patch - static code 
analysis, build, test.  It will add both general patch comments on 
success/failure of the build/test, or comments on specific lines in the patch.  
In some cases, Gerrit Janitor will also give negative code reviews in cases 
when a newly-added regression test added by a patch is failing regularly in its 
testing.

Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Whamcloud


___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Adding lustre clients into the Debian

2023-10-01 Thread Andreas Dilger via lustre-discuss
On Oct 1, 2023, at 05:54, Arman Khalatyan via lustre-discuss 
mailto:lustre-discuss@lists.lustre.org>> wrote:

Hello everyone,

We are in the process of integrating the Lustre client into Debian. Are there 
any legal concerns or significant obstacles to this? We're curious why it 
hasn't been included in the official Debian repository so far. There used to be 
an old, unmaintained Lustre client branch dating back to around 2013.

I don't think there is any particular barrier to add a Debian Lustre package 
today.  As you wrote, there was some effort put into that in the past, but I 
think there wasn't someone with the time and Debian-specific experience 
maintain it and get it included into a release.  Lustre is GPLv2 licensed, with 
some LGPL user library API code, so there aren't any legal issues.

The Lustre CI system builds Ubuntu clients, and I'd be happy to see patches to 
improve the Debian packaging in the Lustre tree.  Most Lustre users are using 
RHEL derivatives, and secondarily Ubuntu, so there hasn't been anyone to work 
on specific Debian packaging in some time.  Thomas (CC'd) was most active on 
the Debian front, and might be able to comment on this in more detail, and 
hopefully can also help review the patches.

You can check our wishlist on Debian here: https://bugs.debian.org/1053214

At AIP, one of our colleagues is responsible for maintaining Astropy, so we 
have some experience with Debian.

I've also set up a CI system in our GitLab, which includes a simple build and 
push to a public S3 bucket. This is primarily for testing purposes to see if it 
functions correctly for others...

There is an automated build farm for Lustre, with patches submitted via Gerrit 
(https://wiki.lustre.org/Using_Gerrit), and while there aren't Debian 
build/test nodes to test the correctness of changes for Debian, this will at 
least avoid regressions with Ubuntu or, though unlikely, with RHEL.  My 
assumption would be that any Debian-specific changes would continue to work 
with Ubuntu?

Depending on how actively you want to build/test Lustre, it is also possible to 
configure your CI system to asynchronously follow patches under development in 
Gerrit to provide build and/or test feedback on the patches before they land.  
The contrib/scripts/gerrit_checkpatch.py script follows new patch submissions 
and runs code style reviews on patches after they are pushed, and posts 
comments back to Gerrit.

A step beyond this, once your CI system is working reliably, it is possible to 
also post review comments directly into the patches in Gerrit, as the Gerrit 
Janitor does (https://github.com/verygreen/lustretester/).  The 
gerrit_build-and-test-new.py script is derived from gerrit_checkpatch.py, but 
implements a more complex set of operations on each patch - static code 
analysis, build, test.  It will add both general patch comments on 
success/failure of the build/test, or comments on specific lines in the patch.  
In some cases, Gerrit Janitor will also give negative code reviews in cases 
when a newly-added regression test added by a patch is failing regularly in its 
testing.

Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Whamcloud







___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Cannot mount MDT after upgrading from Lustre 2.12.6 to 2.15.3

2023-09-28 Thread Andreas Dilger via lustre-discuss
On Sep 26, 2023, at 13:44, Audet, Martin via lustre-discuss 
mailto:lustre-discuss@lists.lustre.org>> wrote:

Hello all,

I would appreciate if the community would give more attention to this issue 
because upgrading from 2.12.x to 2.15.x, two LTS versions, is something that we 
can expect many cluster admin will try to do in the next few months...

Who, in particular, is "the community"?

That term implies a collective effort, and I'd welcome feedback from your 
testing of the upgrade process.  It is definitely possible for an individual to 
install Lustre 2.12.9 on one or more VMs (or better, use a clone of your 
current server OS image), format a small test filesystem with the current 
configuration, copy some data into it, and then follow your planned process to 
upgrade to 2.15.3 (which should mostly just be "unmount everything, install new 
RPMs, mount").  That is prudent system administration to test the process in 
advance of changing your production system.

We ourselves plan to upgrade a small Lustre (production) system from 2.12.9 to 
2.15.3 in the next couple of weeks...

After seeing problems reports like this we start feeling a bit nervous...

The documentation for doing this major update appears to me as not very 
specific...

Patches with improvements to the process described in the manual are welcome.  
Please see https://wiki.lustre.org/Lustre_Manual_Changes for details on how to 
submit your contributions.

In this document for example, 
https://doc.lustre.org/lustre_manual.xhtml#upgradinglustre , the update process 
appears not so difficult and there is no mention of using "tunefs.lustre 
--writeconf" for this kind of update.

Or am I missing something ?

I think you are answering your own question here...  The documented upgrade 
process has no mention of running "writeconf", but it was run for an unknown 
reason. This introduced an unknown problem with the configuration files that 
prevented the target from mounting.

Then, rather than re-running writeconf to fix the configuration files, the 
entire MDT was copied to a new storage device (a large no-op IMHO, since any 
issue with the MDT config files would be copied along with it) and writeconf 
was run again to regenerate the configs, which could have been done just as 
easily on the original MDT.

So the relatively straight forward upgrade process was turned into a 
complicated process for no apparent reason.

There have been 2.12->2.15 upgrades done already in production without issues, 
and this is also tested continuously during development.  Of course there are a 
wide variety of different configurations, features, and hardware on which 
Lustre is run, and it isn't possible to test even a fraction of all 
configurations.  I don't think one problem report on the mailing list is an 
indication that there are fundamental issues with the upgrade process.

Cheers, Andreas

Thanks in advance for providing more tips for this kind of update.

Martin Audet

From: lustre-discuss 
mailto:lustre-discuss-boun...@lists.lustre.org>>
 on behalf of Tung-Han Hsieh via lustre-discuss 
mailto:lustre-discuss@lists.lustre.org>>
Sent: September 23, 2023 2:20 PM
To: lustre-discuss@lists.lustre.org
Subject: [lustre-discuss] Cannot mount MDT after upgrading from Lustre 2.12.6 
to 2.15.3

***Attention*** This email originated from outside of the NRC. ***Attention*** 
Ce courriel provient de l'extérieur du CNRC.
Dear All,

Today we tried to upgrade Lustre file system from version 2.12.6 to 2.15.3. But 
after the work, we cannot mount MDT successfully. Our MDT is ldiskfs backend. 
The procedure of upgrade is

1. Install the new version of e2fsprogs-1.47.0
2. Install Lustre-2.15.3
3. After reboot, run: tunefs.lustre --writeconf /dev/md0

Then when mounting MDT, we got the error message in dmesg:

===
[11662.434724] LDISKFS-fs (md0): mounted filesystem with ordered data mode. 
Opts: user_xattr,errors=remount-ro,no_mbcache,nodelalloc
[11662.584593] Lustre: 3440:0:(scrub.c:189:scrub_file_load()) chome-MDT: 
reset scrub OI count for format change (LU-16655)
[11666.036253] Lustre: MGS: Logs for fs chome were removed by user request.  
All servers must be restarted in order to regenerate the logs: rc = 0
[11666.523144] Lustre: chome-MDT: Imperative Recovery not enabled, recovery 
window 300-900
[11666.594098] LustreError: 3440:0:(mdd_device.c:1355:mdd_prepare()) 
chome-MDD: get default LMV of root failed: rc = -2
[11666.594291] LustreError: 
3440:0:(obd_mount_server.c:2027:server_fill_super()) Unable to start targets: -2
[11666.594951] Lustre: Failing over chome-MDT
[11672.868438] Lustre: 3440:0:(client.c:2295:ptlrpc_expire_one_request()) @@@ 
Request sent has timed out for slow reply: [sent 1695492248/real 1695492248]  
req@5dfd9b53 x1777852464760768/t0(0) 
o251->MGC192.168.32.240@o2ib@0@lo:26/25 lens 224/224 e 0 

Re: [lustre-discuss] No port 988?

2023-09-26 Thread Andreas Dilger via lustre-discuss
On Sep 26, 2023, at 06:12, Jan Andersen mailto:j...@comind.io>> 
wrote:

Hi,

I've built and installed lustre on two VirtualBoxes running Rocky 8.8 and 
formatted one as the MGS/MDS and the other as OSS, following a presentation 
from Oak Ridge National Laboratory: "Creating a Lustre Test System from Source 
with Virtual Machines" (sorry, no link; it was a while ago I downloaded them).

There are a number of such resources linked from the https://wiki.lustre.org/ 
front page.

I can mount the filesystems on the MDS, but when I try from the OSS, it just 
times out - from dmesg:

[root@oss1 log]# dmesg | grep -i lustre
[  564.028680] Lustre: Lustre: Build Version: 2.15.58_42_ga54a206
[  625.567672] LustreError: 15f-b: lustre-OST: cannot register this server 
with the MGS: rc = -110. Is the MGS running?
[  625.567767] LustreError: 1789:0:(tgt_mount.c:2216:server_fill_super()) 
Unable to start targets: -110
[  625.567851] LustreError: 1789:0:(tgt_mount.c:1752:server_put_super()) no obd 
lustre-OST
[  625.567894] LustreError: 1789:0:(tgt_mount.c:132:server_deregister_mount()) 
lustre-OST not registered
[  625.588244] Lustre: server umount lustre-OST complete
[  625.588251] LustreError: 1789:0:(tgt_mount.c:2365:lustre_tgt_fill_super()) 
Unable to mount  (-110)

Both 'nmap' and 'netstat -nap' show that there is nothing listening on port 988:

[root@mds ~]# netstat -nap | grep -i listen
tcp0  0 0.0.0.0:111 0.0.0.0:* LISTEN  1/systemd
tcp0  0 0.0.0.0:22  0.0.0.0:* LISTEN  806/sshd
tcp6   0  0 :::111  :::* LISTEN  1/systemd
tcp6   0  0 :::22   :::* LISTEN  806/sshd

What should be listening on 988?

The  MGS should be listening on port 988, running on the "mgsnode" that was 
specified at format time for the OSTs and MDTs.

It is possible to have the MGS and MDS share the same storage device for simple 
configurations, but in production they are usually running on separate devices 
so they can be started/stopped independently, even if they are running on the 
same server.

Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Whamcloud







___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] [BULK] Re: [EXTERNAL] Re: Data recovery with lost MDT data

2023-09-25 Thread Andreas Dilger via lustre-discuss
Probably using "stat" on each file is slow, since this is getting the file size 
from each OST object. You could try the "xstat" utility in the lustre-tests RPM 
(or build it directly) as it will only query the MDS for the requested 
attributes (owner at minimum).

Then you could split into per-date directories in a separate phase, if needed, 
run in parallel.

I can't suggest anything about the 13M entry directory, but it _should_ be much 
faster than 1 file per 30s even at that size. I suspect that the script is 
still doing something bad, since shell and GNU utilities are terrible for doing 
extra stat/cd/etc five times on each file that is accessed, renamed, etc.

You would be better off to use "find -print" to generate the pathnames and then 
operate on those, maybe with "xargs -P" and/or run multiple scripts in parallel 
on chunks of the file?

Cheers, Andreas

On Sep 25, 2023, at 17:34, Vicker, Darby J. (JSC-EG111)[Jacobs Technology, 
Inc.]  wrote:


I’ve been attempting to move these lost+found files into subdirectories by user 
and date but I’m running into issues.

My initial attempt was to loop through each file in .lustre/lost+found, stat 
the file and then move it to a subdirectory.  However, both the stat and the 
move command are each taking about 30 seconds.  With 13 million files, this 
isn’t going to work as that would be about 25 years to organize the files. :)  
If it makes a difference, the bash script is simple:



source=/scratch-lustre/.lustre/lost+found/MDT
dest=/scratch-lustre/work

# Cd to the source
cd $source

# Loop through files
for file in * ; do
echo "file: $file"

echo "   stat-ing file"
time read -r user date time <<< $( stat --format "%U %y" $file )
echo "   user='$user' date='$date' time='$time'"

# Build the new destination in the user's directory
newdir=$dest/$user/lost+found/$date
echo "   newdir=$newdir"

echo "   checking/making direcotry"
if [ ! -d $newdir ] ; then
time mkdir -p $newdir
time chown ${user}: $newdir
fi

echo "   moving file"
time mv $file $newdir
done


I’m pretty sure the time to operate on these files is due to the very large 
number of files in a single directory.  But confirmation of that would be good. 
 I say that because I know too many files in a single directory can cause 
filesystem performance issues.  But also, the few files I have moved out of 
lost+found, I can stat and otherwise operate on very quickly.

My next attempt was to move each file into pre-created subdirectories (this 
eliminates the stat).  But again, this was serial and each move is 
(unsurprisingly) taking 30 seconds.  “Only” 12.5 years to move the files.  :)

I’m currently attempting to speed this up by getting the entire file list (all 
13e6 files) and moving groups of (10,000) files into a subdirectory at once 
(i.e. mv file1 file2 … fileN subdir1).  I’m hoping a group move is faster and 
more efficient than moving each file individually.  Seems like it should be.  
That attempt is running now and I’m waiting for the first command to complete 
so I don’t know if this is faster yet or not.  (More accurately, I’ve written 
this script in perl and I think the output is buffered so I’m waiting to the 
the first output.)

Other suggestions welcome if you have ideas how to move these files into 
subdirectories more efficiently.


From: lustre-discuss  on behalf of 
"Vicker, Darby J. (JSC-EG111)[Jacobs Technology, Inc.] via lustre-discuss" 

Reply-To: "Vicker, Darby J. (JSC-EG111)[Jacobs Technology, Inc.]" 

Date: Monday, September 25, 2023 at 8:56 AM
To: Andreas Dilger 
Cc: "lustre-discuss@lists.lustre.org" 
Subject: Re: [lustre-discuss] [BULK] Re: [EXTERNAL] Re: Data recovery with lost 
MDT data

Our lfsck finished.  It repair a lot and we have over 13 million files in 
lost+found to go through.  I'll be writing a script to move these to somewhere 
accessible by the users and grouped by owner and probably date too (trying not 
to get too many files in a single directory).  Thanks again for the help with 
this.


For the benefit of others, this is how we started our lfsck:

[root@hpfs-fsl-mds1 hpfs3-eg3]# lctl set_param printk=+lfsck
[root@hpfs-fsl-mds1 hpfs3-eg3]# lctl lfsck_start -M scratch-MDT -o
Started LFSCK on the device scratch-MDT: scrub layout namespace
[root@hpfs-fsl-mds1 hpfs3-eg3]#



It took most of the weekend to run.  Here are the results.



[root@hpfs-fsl-mds1 ~]# lctl lfsck_query -M scratch-MDT

layout_mdts_init: 0

layout_mdts_scanning-phase1: 0

layout_mdts_scanning-phase2: 0

layout_mdts_completed: 1

layout_mdts_failed: 0

layout_mdts_stopped: 0

layout_mdts_paused: 0

layout_mdts_crashed: 0

layout_mdts_partial: 0

layout_mdts_co-failed: 0

layout_mdts_co-stopped: 0

layout_mdts_co-paused: 0

layout_mdts_unknown: 0

layout_osts_init: 0

layout_osts_scanning-phase1: 0

layout_osts_scanning-phase2: 0

layout_osts_completed: 22

layout_osts_failed: 0

layout_osts_stopped: 0

Re: [lustre-discuss] [EXTERNAL EMAIL] Re: Lustre 2.15.3: patching the kernel fails

2023-09-22 Thread Andreas Dilger via lustre-discuss
On Sep 22, 2023, at 01:45, Jan Andersen mailto:j...@comind.io>> 
wrote:

Hi Andreas,

Thank you for your insightful reply. I didn't know Rocky; I see there's a 
version 9 as well - is ver 8 better, since it is more mature?

There is an el9.2 ldiskfs series that would likely also apply to the Rocky9.2 
kernel of the same version.  We are currently using el8.8 servers in production 
and I'm not sure how many people are using 9.2 yet.  On the client side, 
Debian/Ubuntu are widely used.

You mention zfs, which I really liked when I worked on Solaris, but when I 
tried it on Linux it seemed to perform poorly, but that was in Ubuntu; is it 
better in Redhat et al.?

I would think Ubuntu/Debian is working with ZFS better (and may even have ZFS 
.deb packages available in the distro, which RHEL will likely never have).  
It's true the ZFS performance is worse than ldiskfs, but can make it easier to 
use.  That is up to you.

Cheers, Andreas


/jan

On 21/09/2023 18:40, Andreas Dilger wrote:
The first yes toon to ask is what is your end goal?  If you just want to build 
only a client that is mounting to an existing server, then you can disable the 
server functionality:
./configure --disable-server
and it should build fine.
If you want to also build a server, and *really* want it to run Debian instead 
of eg. Rocky 8, then you could disable ldiskfs and use ZFS:
./configure --disable-ldiskfs
You need to have installed ZFS first (either pre-packaged or built yourself), 
but it is less kernel-specific than ldiskfs.
Cheers, Andreas
On Sep 21, 2023, at 10:35, Jan Andersen mailto:j...@comind.io>> 
wrote:

My system: Debian 11, kernel version 5.10.0-13-amd64; I have the following 
source code:

# ll /usr/src/
total 117916
drwxr-xr-x  2 root root  4096 Aug 21 09:19 linux-config-5.10/
drwxr-xr-x  4 root root  4096 Jul 25  2022 linux-headers-5.10.0-12-amd64/
drwxr-xr-x  4 root root  4096 Jul 25  2022 linux-headers-5.10.0-12-common/
drwxr-xr-x  4 root root  4096 Jul 25  2022 linux-headers-5.10.0-13-amd64/
drwxr-xr-x  4 root root  4096 Jul 25  2022 linux-headers-5.10.0-13-common/
drwxr-xr-x  4 root root  4096 Aug 11 09:59 linux-headers-5.10.0-24-amd64/
drwxr-xr-x  4 root root  4096 Aug 11 09:59 linux-headers-5.10.0-24-common/
drwxr-xr-x  4 root root  4096 Aug 21 09:19 linux-headers-5.10.0-25-amd64/
drwxr-xr-x  4 root root  4096 Aug 21 09:19 linux-headers-5.10.0-25-common/
lrwxrwxrwx  1 root root24 Jun 30  2022 linux-kbuild-5.10 -> 
../lib/linux-kbuild-5.10
-rw-r--r--  1 root root161868 Aug 16 21:52 linux-patch-5.10-rt.patch.xz
drwxr-xr-x 25 root root  4096 Jul 14 21:24 linux-source-5.10/
-rw-r--r--  1 root root 120529768 Aug 16 21:52 linux-source-5.10.tar.xz
drwxr-xr-x  2 root root  4096 Jan 30  2023 percona-server/
lrwxrwxrwx  1 root root28 Jul 29  2022 vboxhost-6.1.36 -> 
/opt/VirtualBox/src/vboxhost
lrwxrwxrwx  1 root root32 Apr 17 19:32 vboxhost-7.0.8 -> 
../share/virtualbox/src/vboxhost


I have downloaded the source code of lustre 2.15.3:

# git checkout 2.15.3
# git clone git://git.whamcloud.com/fs/lustre-release.git

- and I'm trying to build it, following https://wiki.lustre.org/Compiling_Lustre

I've got through 'autogen.sh' and 'configure' and most of 'make debs', but when 
it comes to patching:

cd linux-stage && quilt push -a -q
Applying patch patches/rhel8/ext4-inode-version.patch
Applying patch patches/linux-5.4/ext4-lookup-dotdot.patch
Applying patch patches/suse15/ext4-print-inum-in-htree-warning.patch
Applying patch patches/linux-5.8/ext4-prealloc.patch
Applying patch patches/ubuntu18/ext4-osd-iop-common.patch
Applying patch patches/linux-5.10/ext4-misc.patch
1 out of 4 hunks FAILED
Patch patches/linux-5.10/ext4-misc.patch does not apply (enforce with -f)
make[2]: *** [autoMakefile:645: sources] Error 1
make[2]: Leaving directory '/root/repos/lustre-release/ldiskfs'
make[1]: *** [autoMakefile:652: all-recursive] Error 1
make[1]: Leaving directory '/root/repos/lustre-release'
make: *** [autoMakefile:524: all] Error 2


My best guess is that it is because the running kernel version doesn't exactly 
match the kernel source tree, but I can't seem to find that version. Am I right 
- and if so, where would I go to download the right kernel tree?

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Whamcloud







___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] [EXTERNAL] Re: Data recovery with lost MDT data

2023-09-22 Thread Andreas Dilger via lustre-discuss
On Sep 21, 2023, at 16:06, Vicker, Darby J. (JSC-EG111)[Jacobs Technology, 
Inc.] mailto:darby.vicke...@nasa.gov>> wrote:

I knew an lfsck would identify the orphaned objects.  That’s great that it will 
move those objects to an area we can triage.  With ownership still intact (and 
I assume time stamps too), I think this will be helpful for at least some of 
the users to recover some of their data.  Thanks Andreas.

I do have another question.  Even with the MDT loss, the top level user 
directories on the file system are still showing current modification times.  I 
was a little surprised to see this – my expectation was that the most current 
time would be from the snapshot that we accidentally reverted to, 6/20/2023 in 
this case.  Does this make sense?

The timestamps of the directories are only stored on the MDT (unlike regular 
files which keep of the timestamp on both the MDT and OST).  Is it possible 
that users (or possibly recovered clients with existing mountpoints) have 
started to access the filesystem in the past few days since it was recovered, 
or an admin was doing something that would have caused the directories to be 
modified?


Is it possible you have a newer copy of the MDT than you thought?

[dvicker@dvicker ~]$ ls -lrt /ephemeral/ | tail
  4 drwx-- 2 abjuarez   abjuarez 4096 Sep 12 
13:24 abjuarez/
  4 drwxr-x--- 2 ksmith29   ksmith29 4096 Sep 13 
15:37 ksmith29/
  4 drwxr-xr-x55 bjjohn10   bjjohn10 4096 Sep 13 
16:36 bjjohn10/
  4 drwxrwx--- 3 cbrownsc   ccp_fast 4096 Sep 14 
12:27 cbrownsc/
  4 drwx-- 3 fgholiza   fgholiza 4096 Sep 18 
06:41 fgholiza/
  4 drwx-- 5 mtfoste2   mtfoste2 4096 Sep 19 
11:35 mtfoste2/
  4 drwx-- 4 abeniniabenini  4096 Sep 19 
15:33 abenini/
  4 drwx-- 9 pdetremp   pdetremp 4096 Sep 19 
16:49 pdetremp/
[dvicker@dvicker ~]$



From: Andreas Dilger mailto:adil...@whamcloud.com>>
Date: Thursday, September 21, 2023 at 2:33 PM
To: "Vicker, Darby J. (JSC-EG111)[Jacobs Technology, Inc.]" 
mailto:darby.vicke...@nasa.gov>>
Cc: "lustre-discuss@lists.lustre.org" 
mailto:lustre-discuss@lists.lustre.org>>
Subject: [EXTERNAL] Re: [lustre-discuss] Data recovery with lost MDT data

CAUTION: This email originated from outside of NASA.  Please take care when 
clicking links or opening attachments.  Use the "Report Message" button to 
report suspicious messages to the NASA SOC.


In the absence of backups, you could try LFSCK to link all of the orphan OST 
objects into .lustre/lost+found (see lctl-lfsck_start.8 man page for details).

The data is still in the objects, and they should have UID/GID/PRJID assigned 
(if used) but they have no filenames.  It would be up to you to make e.g. 
per-user lost+found directories in their home directories and move the files 
where they could access them and see if they want to keep or delete the files.

How easy/hard this is to do depends on whether the files have any content that 
can help identify them.

There was a Lustre hackathon project to save the Lustre JobID in a "user.job" 
xattr on every object, exactly to help identify the provenance of files after 
the fact (regardless of whether there is corruption), but it only just landed 
to master and will be in 2.16. That is cold comfort, but would help in the 
future.
Cheers, Andreas


On Sep 20, 2023, at 15:34, Vicker, Darby J. (JSC-EG111)[Jacobs Technology, 
Inc.] via lustre-discuss 
mailto:lustre-discuss@lists.lustre.org>> wrote:
Hello,

We have recently accidentally deleted some of our MDT data.  I think its gone 
for good but looking for advice to see if there is any way to recover.  
Thoughts appreciated.

We run two LFS’s on the same set of hardware.  We didn’t set out to do this, 
but it kind of evolved.  The original setup was only a single filesystem and 
was all ZFS – MDT and OST’s.  Eventually, we had some small file workflows that 
we wanted to get better performance on.  To address this, we stood up another 
filesystem on the same hardware and used a an ldiskfs MDT.  However, since were 
already using ZFS, under the hood the storage device we build the ldisk MDT on 
comes from ZFS.  That gets presented to the OS as /dev/zd0.  We do a nightly 
backup of the MDT by cloning the ZFS dataset (this creates /dev/zd16, for 
whatever reason), snapshot the clone, mount that as ldiskfs, tar up the data 
and then destroy the snapshot and clone.  Well, occasionally this process gets 
interrupted, leaving the ZFS snapshot and clone hanging around.  This is where 
things go south.  Something happens that swaps the clone with the primary 
dataset.  ZFS says you’re working with the primary but its really the clone, 
and via versa.  This happened about a year ago and we caught it, were able to 
“zfs pro

Re: [lustre-discuss] Data recovery with lost MDT data

2023-09-21 Thread Andreas Dilger via lustre-discuss
In the absence of backups, you could try LFSCK to link all of the orphan OST 
objects into .lustre/lost+found (see lctl-lfsck_start.8 man page for details).

The data is still in the objects, and they should have UID/GID/PRJID assigned 
(if used) but they have no filenames.  It would be up to you to make e.g. 
per-user lost+found directories in their home directories and move the files 
where they could access them and see if they want to keep or delete the files.

How easy/hard this is to do depends on whether the files have any content that 
can help identify them.

There was a Lustre hackathon project to save the Lustre JobID in a "user.job" 
xattr on every object, exactly to help identify the provenance of files after 
the fact (regardless of whether there is corruption), but it only just landed 
to master and will be in 2.16. That is cold comfort, but would help in the 
future.

Cheers, Andreas

On Sep 20, 2023, at 15:34, Vicker, Darby J. (JSC-EG111)[Jacobs Technology, 
Inc.] via lustre-discuss  wrote:


Hello,

We have recently accidentally deleted some of our MDT data.  I think its gone 
for good but looking for advice to see if there is any way to recover.  
Thoughts appreciated.

We run two LFS’s on the same set of hardware.  We didn’t set out to do this, 
but it kind of evolved.  The original setup was only a single filesystem and 
was all ZFS – MDT and OST’s.  Eventually, we had some small file workflows that 
we wanted to get better performance on.  To address this, we stood up another 
filesystem on the same hardware and used a an ldiskfs MDT.  However, since were 
already using ZFS, under the hood the storage device we build the ldisk MDT on 
comes from ZFS.  That gets presented to the OS as /dev/zd0.  We do a nightly 
backup of the MDT by cloning the ZFS dataset (this creates /dev/zd16, for 
whatever reason), snapshot the clone, mount that as ldiskfs, tar up the data 
and then destroy the snapshot and clone.  Well, occasionally this process gets 
interrupted, leaving the ZFS snapshot and clone hanging around.  This is where 
things go south.  Something happens that swaps the clone with the primary 
dataset.  ZFS says you’re working with the primary but its really the clone, 
and via versa.  This happened about a year ago and we caught it, were able to 
“zfs promote” to swap them back and move on.  More details on the ZFS and this 
mailing list here.

https://zfsonlinux.topicbox.com/groups/zfs-discuss/Tcb8a3ef663db0031-M5a79e71768b20b2389efc4a4

http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/2022-June/018154.html

It happened again earlier this week but we didn’t remember to check this and, 
in an effort to get the backups going again, destroyed what we thought were the 
snapshot and clone.  In reality, we destroyed the primary dataset.  Even more 
unfortunately, the stale “snapshot” was about 3 months old.  This stale 
snapshot was also preventing our MDT backups from running so we don’t have 
those to restore from either.  (I know, we need better monitoring and alerting 
on this, we learned that lesson the hard way.  We had it in place after the 
June 2022 incident, it just wasn’t working properly.)  So at the end of the 
day, the data lives on the OST’s we just can’t access it due to the lost 
metadata.  Is there any chance at data recovery.  I don’t think so but want to 
explore any options.

Darby

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Lustre 2.15.3: patching the kernel fails

2023-09-21 Thread Andreas Dilger via lustre-discuss
The first yes toon to ask is what is your end goal?  If you just want to build 
only a client that is mounting to an existing server, then you can disable the 
server functionality:

./configure --disable-server

and it should build fine.

If you want to also build a server, and *really* want it to run Debian instead 
of eg. Rocky 8, then you could disable ldiskfs and use ZFS:

./configure --disable-ldiskfs

You need to have installed ZFS first (either pre-packaged or built yourself), 
but it is less kernel-specific than ldiskfs. 

Cheers, Andreas

> On Sep 21, 2023, at 10:35, Jan Andersen  wrote:
> 
> My system: Debian 11, kernel version 5.10.0-13-amd64; I have the following 
> source code:
> 
> # ll /usr/src/
> total 117916
> drwxr-xr-x  2 root root  4096 Aug 21 09:19 linux-config-5.10/
> drwxr-xr-x  4 root root  4096 Jul 25  2022 linux-headers-5.10.0-12-amd64/
> drwxr-xr-x  4 root root  4096 Jul 25  2022 linux-headers-5.10.0-12-common/
> drwxr-xr-x  4 root root  4096 Jul 25  2022 linux-headers-5.10.0-13-amd64/
> drwxr-xr-x  4 root root  4096 Jul 25  2022 linux-headers-5.10.0-13-common/
> drwxr-xr-x  4 root root  4096 Aug 11 09:59 linux-headers-5.10.0-24-amd64/
> drwxr-xr-x  4 root root  4096 Aug 11 09:59 linux-headers-5.10.0-24-common/
> drwxr-xr-x  4 root root  4096 Aug 21 09:19 linux-headers-5.10.0-25-amd64/
> drwxr-xr-x  4 root root  4096 Aug 21 09:19 linux-headers-5.10.0-25-common/
> lrwxrwxrwx  1 root root24 Jun 30  2022 linux-kbuild-5.10 -> 
> ../lib/linux-kbuild-5.10
> -rw-r--r--  1 root root161868 Aug 16 21:52 linux-patch-5.10-rt.patch.xz
> drwxr-xr-x 25 root root  4096 Jul 14 21:24 linux-source-5.10/
> -rw-r--r--  1 root root 120529768 Aug 16 21:52 linux-source-5.10.tar.xz
> drwxr-xr-x  2 root root  4096 Jan 30  2023 percona-server/
> lrwxrwxrwx  1 root root28 Jul 29  2022 vboxhost-6.1.36 -> 
> /opt/VirtualBox/src/vboxhost
> lrwxrwxrwx  1 root root32 Apr 17 19:32 vboxhost-7.0.8 -> 
> ../share/virtualbox/src/vboxhost
> 
> 
> I have downloaded the source code of lustre 2.15.3:
> 
> # git checkout 2.15.3
> # git clone git://git.whamcloud.com/fs/lustre-release.git
> 
> - and I'm trying to build it, following 
> https://wiki.lustre.org/Compiling_Lustre
> 
> I've got through 'autogen.sh' and 'configure' and most of 'make debs', but 
> when it comes to patching:
> 
> cd linux-stage && quilt push -a -q
> Applying patch patches/rhel8/ext4-inode-version.patch
> Applying patch patches/linux-5.4/ext4-lookup-dotdot.patch
> Applying patch patches/suse15/ext4-print-inum-in-htree-warning.patch
> Applying patch patches/linux-5.8/ext4-prealloc.patch
> Applying patch patches/ubuntu18/ext4-osd-iop-common.patch
> Applying patch patches/linux-5.10/ext4-misc.patch
> 1 out of 4 hunks FAILED
> Patch patches/linux-5.10/ext4-misc.patch does not apply (enforce with -f)
> make[2]: *** [autoMakefile:645: sources] Error 1
> make[2]: Leaving directory '/root/repos/lustre-release/ldiskfs'
> make[1]: *** [autoMakefile:652: all-recursive] Error 1
> make[1]: Leaving directory '/root/repos/lustre-release'
> make: *** [autoMakefile:524: all] Error 2
> 
> 
> My best guess is that it is because the running kernel version doesn't 
> exactly match the kernel source tree, but I can't seem to find that version. 
> Am I right - and if so, where would I go to download the right kernel tree?
> 
> ___
> lustre-discuss mailing list
> lustre-discuss@lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] File size discrepancy on lustre

2023-09-15 Thread Andreas Dilger via lustre-discuss
Are you using any file mirroring (FLR, "lfs mirror extend") on the files, 
perhaps before the "lfs getstripe" was run?

On Sep 15, 2023, at 08:12, Kurt Strosahl via lustre-discuss 
mailto:lustre-discuss@lists.lustre.org>> wrote:

Good Morning,

We have encountered a very odd issue.  Where files are being created that 
show as double in size under du, then they do using ls or du --apparent-size.

under ls we see 119G
~> ls -lh \
> szscl21_24_128_b1p50_t_x4p300_um0p0840_sm0p0743_n1p265.genprop.n162.strange.t_0_22_26_28_31.sdb3160b
-rw-rw-r-- 1 edwards lattice 119G Sep 14 21:48 
szscl21_24_128_b1p50_t_x4p300_um0p0840_sm0p0743_n1p265.genprop.n162.strange.t_0_22_26_28_31.sdb3160b

which du --apparent-size agrees with
~> du -h --apparent-size \
> szscl21_24_128_b1p50_t_x4p300_um0p0840_sm0p0743_n1p265.genprop.n162.strange.t_0_22_26_28_31.sdb3160b
119G
szscl21_24_128_b1p50_t_x4p300_um0p0840_sm0p0743_n1p265.genprop.n162.strange.t_0_22_26_28_31.sdb3160b
under du we see 273G

However du itself shows more then double (so we are beyond "padding out a 
block" size).
~> du -h \
> szscl21_24_128_b1p50_t_x4p300_um0p0840_sm0p0743_n1p265.genprop.n162.strange.t_0_22_26_28_31.sdb3160b
273G
szscl21_24_128_b1p50_t_x4p300_um0p0840_sm0p0743_n1p265.genprop.n162.strange.t_0_22_26_28_31.sdb3160b

There is nothing unusual going on via the file layout according to lfs 
getstripe:
~> lfs getstripe \
> szscl21_24_128_b1p50_t_x4p300_um0p0840_sm0p0743_n1p265.genprop.n162.strange.t_0_22_26_28_31.sdb3160b
szscl21_24_128_b1p50_t_x4p300_um0p0840_sm0p0743_n1p265.genprop.n162.strange.t_0_22_26_28_31.sdb3160b
lmm_stripe_count:  1
lmm_stripe_size:   1048576
lmm_pattern:   raid0
lmm_layout_gen:0
lmm_stripe_offset: 0
lmm_pool:  production
obdidx   objid   objid   group
 0 7431775   0x71665f0

Client is running:
lustre-client-2.12.6-1.el7.centos.x86_64

lustre servers are:
lustre-osd-zfs-mount-2.12.9-1.el7.x86_64
kmod-lustre-osd-zfs-2.12.9-1.el7.x86_64
kernel-3.10.0-1127.8.2.el7_lustre.x86_64
lustre-2.12.9-1.el7.x86_64
kernel-devel-3.10.0-1127.8.2.el7_lustre.x86_64
kmod-lustre-2.12.9-1.el7.x86_64
kmod-zfs-0.7.13-1.el7.jlab.x86_64
libzfs2-0.7.13-1.el7.x86_64
zfs-0.7.13-1.el7.x86_64

w/r,
Kurt J. Strosahl (he/him)
System Administrator: Lustre, HPC
Scientific Computing Group, Thomas Jefferson National Accelerator Facility
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Whamcloud







___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Getting started with Lustre on RHEL 8.8

2023-09-12 Thread Andreas Dilger via lustre-discuss
On Sep 12, 2023, at 22:31, Cyberxstudio cxs 
mailto:cyberxstudio.cl...@gmail.com>> wrote:

Hi I get this error while installing lustre and other packages

[root@localhost ~]# yum --nogpgcheck --enablerepo=lustre-server install \
> kmod-lustre-osd-ldiskfs \
> lustre-dkms \
> lustre-osd-ldiskfs-mount \
> lustre-osd-zfs-mount \
> lustre \
> lustre-resource-agents \
> zfs
Updating Subscription Management repositories.
Last metadata expiration check: 0:00:58 ago on Wed 13 Sep 2023 09:27:59 AM PKT.
Error:
 Problem: conflicting requests
  - nothing provides resource-agents needed by 
lustre-resource-agents-2.15.3-1.el8.x86_64
(try to add '--skip-broken' to skip uninstallable packages or '--nobest' to use 
not only best candidate packages)

You don't need this package to start. It is used for HA failover of storage 
between servers with Corosync/Pacemaker.

You also do not need the "lustre-dkms" package - that is for building Lustre 
clients from scratch.

You also only need one of ldiskfs or ZFS.  If you don't have RAID storage, then 
ZFS is probably more useful,
while ldiskfs is more of a 'traditional" filesystem (based on ext4).

Cheers, Andreas

[root@localhost ~]# dnf install resource-agents
Updating Subscription Management repositories.
Last metadata expiration check: 0:02:01 ago on Wed 13 Sep 2023 09:27:59 AM PKT.
No match for argument: resource-agents
Error: Unable to find a match: resource-agents
[root@localhost ~]#

On Wed, Sep 13, 2023 at 9:10 AM Cyberxstudio cxs 
mailto:cyberxstudio.cl...@gmail.com>> wrote:
Thank you for the information.

On Tue, Sep 12, 2023 at 8:40 PM Andreas Dilger 
mailto:adil...@whamcloud.com>> wrote:
Hello,
The preferred path to set up Lustre depends on what you are planning to do with 
it?  If for regular usage it is easiest to start with RPMs built for the distro 
from 
https://downloads.whamcloud.com/public/lustre/latest-release/
 (you can also use the server RPMs for a client if you want).

The various "client" packages for RHEL, SLES, Ubuntu can install directly on 
the vendor kernels, but the provided server RPMs also need the matching kernel. 
 You only need one of the ldiskfs (ext4) or ZFS packages, not both.

It isn't *necessary* to build/patch your kernel for the server, though the 
pre-built server download packages have patched the kernel to add integrated 
T10-PI support (which many users do not need).  You can get unpatched el8 
server RPMs directly from the builders:
https://build.whamcloud.com/job/lustre-b2_15-patchless/48/arch=x86_64,build_type=server,distro=el8.7,ib_stack=inkernel/artifact/artifacts/

If you plan to run on non-standard kernels, then you can build RPMs for your 
particular kernel. The easiest way is to just rebuild the SPRM package:
 https://wiki.whamcloud.com/pages/viewpage.action?pageId=8556211

If you want to do Lustre development you should learn how to build from a Git 
checkout:
https://wiki.whamcloud.com/display/PUB/Building+Lustre+from+Source

Cheers, Andreas

On Sep 12, 2023, at 03:25, Cyberxstudio cxs via lustre-discuss 
mailto:lustre-discuss@lists.lustre.org>> wrote:


Hi, I am setting up a lab environment for lustre. I have 3 VMs of RHEL 8.8, I 
have studied the documentation but it does not provide detail for el 8 rather 
el 7. Please guide me how to start
Thank You
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Whamcloud







___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Getting started with Lustre on RHEL 8.8

2023-09-12 Thread Andreas Dilger via lustre-discuss
Hello,
The preferred path to set up Lustre depends on what you are planning to do with 
it?  If for regular usage it is easiest to start with RPMs built for the distro 
from 
https://downloads.whamcloud.com/public/lustre/latest-release/
 (you can also use the server RPMs for a client if you want).

The various "client" packages for RHEL, SLES, Ubuntu can install directly on 
the vendor kernels, but the provided server RPMs also need the matching kernel. 
 You only need one of the ldiskfs (ext4) or ZFS packages, not both.

It isn't *necessary* to build/patch your kernel for the server, though the 
pre-built server download packages have patched the kernel to add integrated 
T10-PI support (which many users do not need).  You can get unpatched el8 
server RPMs directly from the builders:
https://build.whamcloud.com/job/lustre-b2_15-patchless/48/arch=x86_64,build_type=server,distro=el8.7,ib_stack=inkernel/artifact/artifacts/

If you plan to run on non-standard kernels, then you can build RPMs for your 
particular kernel. The easiest way is to just rebuild the SPRM package:
 https://wiki.whamcloud.com/pages/viewpage.action?pageId=8556211

If you want to do Lustre development you should learn how to build from a Git 
checkout:
https://wiki.whamcloud.com/display/PUB/Building+Lustre+from+Source

Cheers, Andreas

On Sep 12, 2023, at 03:25, Cyberxstudio cxs via lustre-discuss 
 wrote:


Hi, I am setting up a lab environment for lustre. I have 3 VMs of RHEL 8.8, I 
have studied the documentation but it does not provide detail for el 8 rather 
el 7. Please guide me how to start
Thank You
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] questions about group locks / LDLM_FL_NO_TIMEOUT flag

2023-08-30 Thread Andreas Dilger via lustre-discuss
You can't directly dump the holders of a particular lock, but it is possible to 
dump the list of FIDs that each client has open. 

  mds# lctl get_param mdt.*.exports.*.open_files | egrep "=|FID" | grep -B1 FID

That should list all client NIDs that have FID open. 

It shouldn't be possible for clients to "leak" a group lock, since they are 
tied to an open file handle and are dropped as soon as the file is closed, or 
by the kernel when it closes the open fds when the process is killed.

Cheers, Andreas

> On Aug 30, 2023, at 07:42, Bertschinger, Thomas Andrew Hjorth via 
> lustre-discuss  wrote:
> 
> Hello, 
> 
> We have a few files created by a particular application where reads to those 
> files consistently hang. The debug log on a client attempting a read() has 
> messages like:
> 
>> ldlm_completion_ast(): waiting indefinitely because of NO_TIMEOUT ...
> 
> This is printed when the flag LDLM_FL_NO_TIMEOUT is true, and code comments 
> above that flag imply that it is set for group locks. So, we've been trying 
> to identify if the application in question uses group locks. (I have reached 
> out to the app's developers but do not have a response yet.)
> 
> If I open the file with O_NONBLOCK, any reads immediately return with error 
> 11 / EWOULDBLOCK. This behavior is documented to occur for Lustre group locks.
> 
> However, I would like to clarify whether the LDLM_FL_NO_TIMEOUT flag is true 
> *only* when a group lock is held, or are there other circumstances where the 
> behavior described above could occur?
> 
> If this is caused by a group lock is there an easy way to tell from server 
> side logs or data what client(s) have the group lock and are blocking access? 
> The motivation is that we believe any jobs accessing these files have long 
> since been killed, and no nodes from the job are expected to be holding the 
> files open. We would like to confirm or rule out that possibility by easily 
> identifying any such clients.
> 
> Advice on how to effectively debug ldlm issues could be useful beyond just 
> this issue. In general, if there is a reliable way to start from a log entry 
> for a lock like 
> 
>> ... ns: lustre-OST-osc-9a0942c79800 lock: 
>> 3f3a5950/0xe54ca8d2d7b66d03 lrc: 4/1,0 mode: --/PR  ...
> 
> and get information about the client(s) holding that lock and any contending 
> locks, that would be helpful in debugging situations like this.
> 
> server: 2.15.2
> client that application ran on: 2.15.0.4_rc2_cray_172_ge66844d
> client that I tested file access from: 2.15.2
> 
> Thanks!
> 
> - Thomas Bertschinger
> ___
> lustre-discuss mailing list
> lustre-discuss@lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] question about rename operation ?

2023-08-16 Thread Andreas Dilger via lustre-discuss
Any directory renames where it is not just a simple name change (ie. parent 
directory is
not the same for both source and target) the MDS thread doing the rename will 
take the
LDLM "big filesystem lock" (BFL), which is a specific FID for global rename 
serialization.

This ensures that there is only one thread in the whole filesystem doing a 
rename that
may create directory loops, and the parent/child relationship is checked under
this lock to ensure there are no loops.

For regular file renames, and directory renames within a single parent, it is 
possible
to do parallel renames, and the MDS only locks the parent, source, and target 
FIDs to
avoid multiple threads modifying the same file or directory at once.

The client will also take the VFS rename lock before sending the rename RPC, 
which serializes the changes on the client, but does not help anything for the 
rest of the filesystem.  This unfortunately also serializes regular renames on 
a single client, but they
can still be done in parallel on multiple clients.

Cheers, Andreas

On Aug 15, 2023, at 20:14, 宋慕晗 via lustre-discuss 
 wrote:


Dear lustre maintainers,
There seems to be a bug in lustre *ll_rename* function:
/* VFS has locked the inodes before calling this */
ll_set_inode_lock_owner(src);
ll_set_inode_lock_owner(tgt);
if (tgt_dchild->d_inode)
ll_set_inode_lock_owner(tgt_dchild->d_inode);

Here we lock the src directory, target directory, and lock the target child if 
exists. But we don't lock the src child, but it's possible to change the ".." 
pointer of src child.
see this in xfs: https://www.spinics.net/lists/linux-xfs/msg68693.html

And I am also wondering how lustre deal with concurrent rename ?  Specifically, 
my concern revolves around the potential for directory loops when two clients 
initiate renames simultaneously.
In the VFS, there's a filesystem-specific vfs_rename_mutex that serializes the 
rename operation. In Ceph, I noticed the presence of a global client lock. 
However, I'm uncertain if the MDS serializes rename requests.
Consider the following scenario:

a
   /   \
 b c
/ \
  d   e
 /  \
fg

If Client 1 attempts to rename "c" to "f" while Client 2 tries to rename "b" to 
"g" concurrently, and both succeed, we could end up with a loop in the 
directory structure.
Could you please provide clarity on how lustre handles such situations? Your 
insights would be invaluable.
Thank you in advance for your time and assistance.
Warm regards,
Muhan Song

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] getting without inodes

2023-08-11 Thread Andreas Dilger via lustre-discuss
The t0 filesystem OSTs are formatted for an average file size of 70TB / 300M 
inodes = 240KB/inode.

The t1 filesystem OSTs are formatted for an average file size of 500TB / 65M 
inodes = 7.7MB/inode.

So not only are the t1 OSTs larger, but they have fewer inodes (by a factor of 
32x). This must have been done with specific formatting options since the 
default inode ratio is 1MiB/inode for the OSTs.

There isn't any information about the actual space usage (eg. "lfs df"), so I 
can't calculate whether the default 1MiB/inode would be appropriate for your 
filesystem, but definitely it was formatted with the expectation that the 
average file size would become larger as they were copied to t1 (eg. combined 
in a tarfile or something).

Unfortunately, there is no way to "fix" this in place, since the inode ratio 
for ldiskfs/ext4 filesystems is decided at format time.

One option is to use "lfs find" to find files on an OST (eg. OST0003 which is 
the least used), disable creates on that OST, and use "lfs migrate" to migrate 
all of the files to other OSTs, and then reformat the OST with more inodes and 
repeat this process for all of the OSTs.

Unfortunately, the t1 filesystem only has 8.5M free inodes and there are 27M 
inodes on OST0003, so it couldn't be drained completely to perform this 
process. You would need to delete enough files from t1 to free up inodes to do 
the migration, or eg. tar them up into larger files to reduce the inode count.

The OST migration/replacement process is described in the Lustre Operations 
Manual.

Cheers, Andreas

On Aug 11, 2023, at 01:17, Carlos Adean via lustre-discuss 
 wrote:


Hello experts,

We have a Lustre with two tiers T0(SSD) and T1(HDD), the first with 70TB and 
the second one with ~500TB.

I'm experiencing a problem that the T1 has much less inodes than T0 and that is 
getting without inodes in the OSTs, so I'd like to understand the source of 
this and how to fix that.


Thanks in advance.



=== T0

$ lfs df -i /lustre/t0
UUID  Inodes   IUsed   IFree IUse% Mounted on
t0-MDT_UUID390627328 1499300   389128028   1% /lustre/t0[MDT:0]
t0-OST_UUID 14651392 109744213553950   8% /lustre/t0[OST:0]
t0-OST0001_UUID 14651392 109749213553900   8% /lustre/t0[OST:1]
t0-OST0002_UUID 14651392 109733113554061   8% /lustre/t0[OST:2]
t0-OST0003_UUID 14651392 109756313553829   8% /lustre/t0[OST:3]
t0-OST0004_UUID 14651392 109757613553816   8% /lustre/t0[OST:4]
t0-OST0005_UUID 14651392 109750513553887   8% /lustre/t0[OST:5]
t0-OST0006_UUID 14651392 109752413553868   8% /lustre/t0[OST:6]
t0-OST0007_UUID 14651392 109759613553796   8% /lustre/t0[OST:7]
t0-OST0008_UUID 14651392 109744213553950   8% /lustre/t0[OST:8]
t0-OST0009_UUID 14651392 109756313553829   8% /lustre/t0[OST:9]
t0-OST000a_UUID 14651392 109751513553877   8% /lustre/t0[OST:10]
t0-OST000b_UUID 14651392 109652413554868   8% /lustre/t0[OST:11]
t0-OST000c_UUID 14651392 109660813554784   8% /lustre/t0[OST:12]
t0-OST000d_UUID 14651392 109652413554868   8% /lustre/t0[OST:13]
t0-OST000e_UUID 14651392 109664113554751   8% /lustre/t0[OST:14]
t0-OST000f_UUID 14651392 109664713554745   8% /lustre/t0[OST:15]
t0-OST0010_UUID 14651392 109670513554687   8% /lustre/t0[OST:16]
t0-OST0011_UUID 14651392 109661613554776   8% /lustre/t0[OST:17]
t0-OST0012_UUID 14651392 109652013554872   8% /lustre/t0[OST:18]
t0-OST0013_UUID 14651392 109659813554794   8% /lustre/t0[OST:19]
t0-OST0014_UUID 14651392 109666913554723   8% /lustre/t0[OST:20]
t0-OST0015_UUID 14651392 109657013554822   8% /lustre/t0[OST:21]

filesystem_summary:299694753 1499300   298195453   1% /lustre/t0


=== T1

$  lfs df -i /lustre/t1
UUID  Inodes   IUsed   IFree IUse% Mounted on
t1-MDT_UUID   147872153656448788  1422272748   4% /lustre/t1[MDT:0]
t1-OST_UUID 3049203230491899 133 100% /lustre/t1[OST:0]
t1-OST0001_UUID 3049203230491990  42 100% /lustre/t1[OST:1]
t1-OST0002_UUID 3049203230491916 116 100% /lustre/t1[OST:2]
t1-OST0003_UUID 3049203227471050 3020982  91% /lustre/t1[OST:3]
t1-OST0004_UUID 3049203230491989  43 100% /lustre/t1[OST:4]
t1-OST0005_UUID 3049203230491960  72 100% /lustre/t1[OST:5]
t1-OST0006_UUID 3049203230491948  84 100% /lustre/t1[OST:6]
t1-OST0007_UUID 3049203230491939  93 100% /lustre/t1[OST:7]
t1-OST0008_UUID 3049203229811803  680229  98% /lustre/t1[OST:8]
t1-OST0009_UUID 3049203229808261  683771  98% /lustre/t1[OST:9]
t1

Re: [lustre-discuss] Pool_New Naming Error

2023-08-08 Thread Andreas Dilger via lustre-discuss


On Aug 8, 2023, at 18:41, Baucum, Rashun via lustre-discuss 
mailto:lustre-discuss@lists.lustre.org>> wrote:

Hello,

I am running into an issue when attempting to setup pooling. The commands are 
being run on a server that hosts the MDS and MGS:

# lctl dl
  0 UP osd-ldiskfs lfs1-MDT-osd lfs1-MDT-osd_UUID 12
  1 UP osd-ldiskfs MGS-osd MGS-osd_UUID 4
  2 UP mgs MGS MGS 18
  3 UP mgc MGC@tcp 
6a356911-e772-c2ac-20a3-2dade59f93bb 4
  4 UP mds MDS MDS_uuid 2
  5 UP lod lfs1-MDT-mdtlov lfs1-MDT-mdtlov_UUID 3
  6 UP mdt lfs1-MDT lfs1-MDT_UUID 24
  7 UP mdd lfs1-MDD lfs1-MDD_UUID 3
  8 UP qmt lfs1-QMT lfs1-QMT_UUID 3
  9 UP osp lfs1-OST-osc-MDT lfs1-MDT-mdtlov_UUID 4
10 UP osp lfs1-OST0001-osc-MDT lfs1-MDT-mdtlov_UUID 4
11 UP osp lfs1-OST0004-osc-MDT lfs1-MDT-mdtlov_UUID 4
12 UP osp lfs1-OST0005-osc-MDT lfs1-MDT-mdtlov_UUID 4
13 UP osp lfs1-OST0002-osc-MDT lfs1-MDT-mdtlov_UUID 4
14 UP osp lfs1-OST0003-osc-MDT lfs1-MDT-mdtlov_UUID 4
15 UP lwp lfs1-MDT-lwp-MDT lfs1-MDT-lwp-MDT_UUID 4

# lctl pool_new lustre.pool1
error: pool_new can contain only alphanumeric characters, underscores, and 
dashes besides the required '.'
pool_new: Invalid argument

dmesg logs:
LustreError: 19441:0:(llog.c:416:llog_init_handle()) llog type is not specified!
LustreError: 19441:0:(mgs_llog.c:5513:mgs_pool_cmd()) lustre is not defined

Is this an error anyone has seen or knows a solution to?

The problem is that the pool name should be "fsname.pool_name", and your 
filesystem is named "lfs1" and not "lustre".  The
last error message above is trying to say this, but it could be more clear, 
like "filesystem name 'lustre' is not defined" or similar.  A patch to fix this 
would be welcome.

So your command should be:

lctl pool_new lfs1.pool1

though I would suggest a more descriptive name than "pool1" (e.g. "flash" or 
"new_osts" or whatever), but that is really up to you..

Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Whamcloud







___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] how does lustre handle node failure

2023-07-22 Thread Andreas Dilger via lustre-discuss
Shawn,
Lustre handles the largest filesystems in the world, hundreds of PB in size, so 
there are definitely Lustre filesystems with hundreds of servers.

In large storage clusters the servers failover in pairs or quads, since the 
storage is typically not on a single global SAN for all nodes to access, so 
there is definitely not a single huge HA cluster for all of the servers in the 
filesystem.

Cheers, Andreas

On Jul 21, 2023, at 16:09, Shawn via lustre-discuss 
 wrote:


Hi Laura,  thanks for your reply.
It seems the OSSs will share the disks created from a shared SAN.  So the 
OSS-pairs can failover in a pre-defined manner if one node is down, coordinated 
by a HA manager.

This can certainly work on a limited scale.  I'm curious if this static schema 
can scale to a large cluster with 100s of OSSs servers?


regards,
Shawn




On Tue, Jul 18, 2023 at 1:25 PM Laura Hild 
mailto:l...@jlab.org>> wrote:
I'm not familiar with using FLR to tolerate OSS failures.  My site does the HA 
pairs with shared storage method.  It's sort of described in the manual

  https://doc.lustre.org/lustre_manual.xhtml#configuringfailover

but in more, Pacemaker-specific detail at

  
https://wiki.lustre.org/Creating_a_Framework_for_High_Availability_with_Pacemaker

and

  
https://wiki.lustre.org/Creating_Pacemaker_Resources_for_Lustre_Storage_Services

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] File system global quota

2023-07-20 Thread Andreas Dilger via lustre-discuss
Probably the closest that could be achieved like this would be to set the 
ldiskfs reserved space on the OSTs like:

  tune2fs -m 10 /dev/sdX

That sets the root reserved space to 10% of the filesystem, and non-root users 
wouldn't be able to allocate blocks once the filesystem hits 90% full. This 
doesn't reserve specific blocks, just a percentage of the total capacity, so it 
should avoid fragmentation from the filesystem getting too full. 

Cheers, Andreas

> On Jul 20, 2023, at 08:11, Sebastian Oeste via lustre-discuss 
>  wrote:
> 
> Hi there,
> 
> we operate multiple lustre file systems at our side and wondering if it is 
> possible to have a file system global quota in lustre?
> For example, that only 90 percent of the file system can be used. The Idea is 
> to avoid situations where a file system is 99% full.
> I was looking through mount options and lfs-quota but there just quota for 
> users, groups or projects.
> I there a way to achieve something like that?
> 
> Thanks,
> sebastian
> -- 
> Sebastian Oeste, M.Sc.
> Computer Scientist
> 
> Technische Universität Dresden
> Zentrum für Informationsdienste und Hochleistungsrechnen (ZIH)
> Tel. +49 (0)351 463-32405
> E-Mail: sebastian.oe...@tu-dresden.de
> ___
> lustre-discuss mailing list
> lustre-discuss@lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Old Lustre Filesystem migrate to newer servers

2023-07-19 Thread Andreas Dilger via lustre-discuss
Wow,  Lustre 1.6 is really old, released in 2009.  Even Lustre 2.6 would be 
pretty old, released in 2014.

While there haven't been a *lot* of on-disk format changes over the years, 
there was a fairly significant change in Lustre 2.0 that would probably make 
upgrading the filesystem directly to a more recent Lustre release (e.g. 2.12.9) 
difficult.  We've long since removed compatibility for such old versions from 
the code.

My recommendation would be to install a VM with an old version of RHEL and 
Lustre (e.g. RHEL5 and Lustre 1.8.9 from 
https://downloads.whamcloud.com/public/lustre/lustre-1.8.9-wc1/el5/server/RPMS/x86_64/)
 and attach the storage to the node.  You would probably need to do the 
"writeconf" process to change the server NIDs from their current IB addresses 
to TCP.  Other than that, if the node can see the storage, Lustre should be 
able to mount it.

I would then strongly recommend to copy all of the data to new hardware, 
instead of running it like this.  *Even if* the storage is currently working, 
it is 10+ years old and likely to also fail soon.  Also, the storage is likely 
to be small and slow compared to any modern devices, and should be refreshed 
before the data is permanently lost.

Cheers, Andreas

On Jul 19, 2023, at 21:31, Richard Chang via lustre-discuss 
mailto:lustre-discuss@lists.lustre.org>> wrote:

Hi,

I have an existing, old Lustre, don't remember the exact version, but most most 
likely 1.6 . The Lustre Servers had crashed and can't be fixed, HW wise.

The MDT/MGT and OSTs are housed in a backend FC based DAS.

How easy or difficult it is to create a few new servers and adding these 
backend storage to the new servers to get back the data ?

I am not saying it is staright forward, but there is no harm in trying. We can 
even load the old version of OS and Lustre SW.

All the user is concerned about is the data, which I am sure is still safe in 
the backend storage box.

One thing though. The old servers had Infiniband as the LNET, but the newer 
ones will use TCP.

Any help and advice will be highly appreciated.

Thanks & regards,

Richard.





___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Whamcloud







___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] New client mounts fail after deactivating OSTs

2023-07-18 Thread Andreas Dilger via lustre-discuss
Brian,
Please file a ticket in LUDOC with details of how the manual should be updated. 
Ideally, including a patch. :-)

Cheers, Andreas

On Jul 11, 2023, at 15:39, Brad Merchant  
wrote:


We recreated the issue in a test cluster and it was definitely the llog_cancel 
steps that caused the issue. Clients couldn't process the llog properly on new 
mounts and would fail. We had to completely clear the llog and --writeconf 
every target to regenerate it from scratch.

The cluster is up and running now but I would certainly recommend at least 
revising that section of the manual.

On Mon, Jul 10, 2023 at 5:22 PM Brad Merchant 
mailto:bmerch...@cambridgecomputer.com>> wrote:
We deactivated half of 32 OSTs after draining them. We followed the steps in 
section 14.9.3 of the lustre manual

https://doc.lustre.org/lustre_manual.xhtml#lustremaint.remove_ost

After running the steps in subhead "3. Deactivate the OST." on OST0010-OST001f, 
new client mounts fail with the below log messages. Existing client mounts seem 
to function correctly but are on a bit of a ticking timebomb because they are 
configured with autofs.

The llog_cancel steps are new to me and the issues seemed to appear after those 
commands were issued (can't say that 100% definitively however). Servers are 
running 2.12.5 and clients are on 2.14.x


Jul 10 15:22:40 adm-sup1 kernel: LustreError: 
26814:0:(obd_config.c:1514:class_process_config()) no device for: 
hydra-OST0010-osc-8be5340c2000
Jul 10 15:22:40 adm-sup1 kernel: LustreError: 
26814:0:(obd_config.c:2038:class_config_llog_handler()) MGC172.16.100.101@o2ib: 
cfg command failed: rc = -22
Jul 10 15:22:40 adm-sup1 kernel: Lustre:cmd=cf00f 0:hydra-OST0010-osc  
1:osc.active=0
Jul 10 15:22:40 adm-sup1 kernel: LustreError: 15b-f: MGC172.16.100.101@o2ib: 
Configuration from log hydra-client failed from MGS -22. Check client and MGS 
are on compatible version.
Jul 10 15:22:40 adm-sup1 kernel: Lustre: hydra: root_squash is set to 99:99
Jul 10 15:22:40 adm-sup1 systemd-udevd[26823]: Process '/usr/sbin/lctl 
set_param 'llite.hydra-8be5340c2000.nosquash_nids=192.168.80.84@tcp 
192.168.80.122@tcp 192.168.80.21@tcp 172.16.90.11@o2ib 172.16.100.211@o2ib 
172.16.100.212@o2ib 172.16.100.213@o2ib 172.16.100.214@o2ib 172.16.100.215@o2ib 
172.16.90.51@o2ib'' failed with exit code 2.
Jul 10 15:22:40 adm-sup1 kernel: Lustre: Unmounted hydra-client
Jul 10 15:22:40 adm-sup1 kernel: LustreError: 
26803:0:(obd_mount.c:1680:lustre_fill_super()) Unable to mount  (-22)



___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Use of lazystatfs

2023-07-05 Thread Andreas Dilger via lustre-discuss
On Jul 5, 2023, at 07:14, Mike Mosley via lustre-discuss 
mailto:lustre-discuss@lists.lustre.org>> wrote:

Hello everyone,

We have drained some of our OSS/OSTs and plan to deactivate them soon.  The 
process ahead leads us to a couple of questions that we hope somebody can 
advise us on.

Scenario
We have fully drained the target OSTs using  'lfs find' to identify all files 
located on the targets and then feeding the list to 'lfs migrate. ' A final 
scan shows there are no files left on the targets.

Questions
1) Running 'lfs df -h' still shows some space being used even though we have 
drained all of the data.   Is that normal?  i.e.

UUID   bytesUsed   Available Use% Mounted on
hydra-OST0010_UUID 84.7T  583.8M   80.5T   1% /dfs/hydra[OST:16]
hydra-OST0011_UUID 84.7T  581.4M   80.5T   1% /dfs/hydra[OST:17]
hydra-OST0012_UUID 84.7T  581.7M   80.5T   1% /dfs/hydra[OST:18]
hydra-OST0013_UUID 84.7T  582.4M   80.5T   1% /dfs/hydra[OST:19]
hydra-OST0014_UUID 84.7T  584.1M   80.5T   1% /dfs/hydra[OST:20]
hydra-OST0015_UUID 84.7T  583.4M   80.5T   1% /dfs/hydra[OST:21]
hydra-OST0016_UUID 84.7T  583.6M   80.5T   1% /dfs/hydra[OST:22]
hydra-OST0017_UUID 84.7T  581.8M   80.5T   1% /dfs/hydra[OST:23]
hydra-OST0018_UUID 84.7T  582.6M   80.5T   1% /dfs/hydra[OST:24]
hydra-OST0019_UUID 84.7T  582.7M   80.5T   1% /dfs/hydra[OST:25]
hydra-OST001a_UUID 84.7T  580.0M   80.5T   1% /dfs/hydra[OST:26]
hydra-OST001b_UUID 84.7T  580.4M   80.5T   1% /dfs/hydra[OST:27]
hydra-OST001c_UUID 84.7T  582.1M   80.5T   1% /dfs/hydra[OST:28]
hydra-OST001d_UUID 84.7T  583.2M   80.5T   1% /dfs/hydra[OST:29]
hydra-OST001e_UUID 84.7T  583.7M   80.5T   1% /dfs/hydra[OST:30]
hydra-OST001f_UUID 84.7T  587.7M   80.5T   1% /dfs/hydra[OST:31]

I would suggest to unmount the OSTs from Lustre and mount via ldiskfs, then run 
"find $MOUNT/O -type f -ls" to find if there are any in-use files left.  It is 
likely that the 580M used by all of the OSTs is just residual logs and large 
directories under O/*.  There might be some hundreds or thousands of 
zero-length object files that were precreated but never used, that will 
typically have an unusual file access mode 07666 and can be ignored.

2) According to some comments, prior to deactivating the OSS/OSTs, we should 
add the 'lazystatfs' option to all of our client mounts so that they do not 
hang once we deactivate some of the OSTs.   Is that correct?  If so, why would 
you not just always have that option set?What are the ramifications of 
doing it well in advance of the OST deactivations?

The lazystatfs feature has been enabled by default since Lustre 2.9 so I don't 
think you need to do anything with it anymore.  The "lfs df" command will 
automatically skip unconfigured OSTs.


Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Whamcloud







___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Rocky 9.2/lustre 2.15.3 client questions

2023-06-23 Thread Andreas Dilger via lustre-discuss
Applying the LU-16626 patch locally should fix the issue, and has no risk since 
it is only fixing a build issue that affects an obscure diagnostic tool.

That said, I've cherry-picked that patch back to b2_15, so it should be 
included into 2.15.4.

https://review.whamcloud.com/51426

Cheers, Andreas

On Jun 23, 2023, at 05:04, Mountford, Christopher J. (Dr.) via lustre-discuss 
 wrote:

Hi,

I'm building the lustre client/kernel modules for our new HPC cluster and have 
a couple of questions:

1) Are there any known issues running lustre 2.15.3 clients and lustre 2.12.9 
servers? I haven't seen anything showstopping on the mailing list or in JIRA 
but wondered if anyone had run into problems.

2) Is it possible to get the dkms kernel rpm to work with Rocky/RHEL 9.2? If I 
try to install the lustre-client-dkms rpm I get the following error:

error: Failed dependencies:
   /usr/bin/python2 is needed by lustre-client-dkms-2.15.3-1.el9.noarch

- Not surprisingly as I understand that python2 is not available for rocky/rhel 
9

I see there is a patch for 2.16 (from LU-16626). Not a major problem as I can 
build kmod-lustre-client rpms for our kernel/ofed, but I would prefer to use 
dkms if possible.

Kind Regards,
Christopher.


Dr. Christopher Mountford,
System Specialist,
RCS,
Digital Services.
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] CentOS Stream 8/9 support?

2023-06-22 Thread Andreas Dilger via lustre-discuss
On Jun 22, 2023, at 06:58, Will Furnass via lustre-discuss 
mailto:lustre-discuss@lists.lustre.org>> wrote:

Hi,

I imagine that many here might have seen RedHat's announcement
yesterday about ceasing to provide sources for EL8 and EL9 to those
who aren't paying customers (see [1] - CentOS 7 unaffected).  For many
HPC sites using or planning to adopt Alma/Rocky 8/9 this prompts a
change of tack:
- buy RHEL 8/9
- switch to CentOS 8/9 Stream for something EL-like
- switch to something else (SUSE or Ubuntu)

Those wanting to stick with EL-like will be interested in how well
Lustre works with Stream 8/9.  Seems it's not in the support matrix
[2].  Have others here used Lustre with Stream successfully?  If so,
anything folks would care to share about gotchas if encountered?  Did
you used patched or unpatched kernels?

For clients I don't think it will matter much, since users often have to build
their own client RPMs (possibly via DKMS), or they use weak updates to
avoid rebuilding the RPMs at all for client updates.  The Lustre client code
itself works with a wide range of kernel versions (3.10-6.0 currently), and
I suspect that relatively few production systems want to be on the bleeding
edge of Linux kernels either, so the lack of 6.1-6.3 kernel support is likely
not affecting anyone, and even then patches are already in flight for them.

Definitely servers will be more tricky, since the baseline will always be
moving, and more quickly than EL kernels.

[1] https://www.redhat.com/en/blog/furthering-evolution-centos-stream
[2] https://wiki.whamcloud.com/display/PUB/Lustre+Support+Matrix

Cheers,

Will

--
Dr Will Furnass | Research Platforms Engineer
IT Services | University of Sheffield


Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Whamcloud







___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] No space left on device MDT DoM but not full nor run out of inodes

2023-06-22 Thread Andreas Dilger via lustre-discuss
There is a bug in the grant accounting that leaks under certain operations 
(maybe O_DIRECT?).  It is resolved by unmounting and remounting the clients, 
and/or upgrading.  There was a thread about it on lustre-discuss a couple of 
years ago.

Cheers, Andreas

On Jun 20, 2023, at 09:32, Jon Marshall via lustre-discuss 
mailto:lustre-discuss@lists.lustre.org>> wrote:

Sorry, typo in the version number - the version we are actually running is 
2.12.6

From: Jon Marshall
Sent: 20 June 2023 16:18
To: lustre-discuss@lists.lustre.org 
mailto:lustre-discuss@lists.lustre.org>>
Subject: No space left on device MDT DoM but not full nor run out of inodes

Hi,

We've been running lustre 2.15.1 in production for over a year and recently 
decided to enable PFL with DoM on our filesystem. Things have been fine up 
until last week, when users started reporting issues copying files, 
specifically "No space left on device". The MDT is running ldiskfs as the 
backend.

I've searched through the mailing list and found a couple of people reporting 
similar problems, which prompted me to check the inode allocation, which is 
currently:

UUID  Inodes   IUsed   IFree IUse% Mounted on
scratchc-MDT_UUID   62449254471144384   553348160  12% 
/mnt/scratchc[MDT:0]
scratchc-OST_UUID577125792448993433222645  43% 
/mnt/scratchc[OST:0]
scratchc-OST0001_UUID571140642450587632608188  43% 
/mnt/scratchc[OST:1]

filesystem_summary:1369752177114438465830833  52% /mnt/scratchc

So, nowhere near full - the disk usage is a little higher:

UUID   bytesUsed   Available Use% Mounted on
scratchc-MDT_UUID  882.1G  451.9G  355.8G  56% 
/mnt/scratchc[MDT:0]
scratchc-OST_UUID   53.6T   22.7T   31.0T  43% 
/mnt/scratchc[OST:0]
scratchc-OST0001_UUID   53.6T   23.0T   30.6T  43% 
/mnt/scratchc[OST:1]

filesystem_summary:   107.3T   45.7T   61.6T  43% /mnt/scratchc

But not full either! The errors are accompanied in the logs by:

LustreError: 15450:0:(tgt_grant.c:463:tgt_grant_space_left()) scratchc-MDT: 
cli ba0195c7-1ab4-4f7c-9e28-8689478f5c17/9e331e231c00 left 82586337280 < 
tot_grant 82586681321 unstable 0 pending 0 dirty 1044480
LustreError: 15450:0:(tgt_grant.c:463:tgt_grant_space_left()) Skipped 33050 
previous similar messages

For reference the DoM striping we're using is:

  lcm_layout_gen:0
  lcm_mirror_count:  1
  lcm_entry_count:   3
lcme_id: N/A
lcme_mirror_id:  N/A
lcme_flags:  0
lcme_extent.e_start: 0
lcme_extent.e_end:   1048576
  stripe_count:  0   stripe_size:   1048576   pattern:   mdt
   stripe_offset: -1

lcme_id: N/A
lcme_mirror_id:  N/A
lcme_flags:  0
lcme_extent.e_start: 1048576
lcme_extent.e_end:   1073741824
  stripe_count:  1   stripe_size:   1048576   pattern:   raid0  
 stripe_offset: -1

lcme_id: N/A
lcme_mirror_id:  N/A
lcme_flags:  0
lcme_extent.e_start: 1073741824
lcme_extent.e_end:   EOF
  stripe_count:  -1   stripe_size:   1048576   pattern:   raid0 
  stripe_offset: -1

So the first 1MB on the MDT.

My question is obviously what is causing these errors? I'm not massively 
familiar with Lustre internals, so any pointers on where to look would be 
greatly appreciated!

Cheers
Jon

Jon Marshall
High Performance Computing Specialist



IT and Scientific Computing Team



Cancer Research UK Cambridge Institute
Li Ka Shing Centre | Robinson Way | Cambridge | CB2 0RE
Web | 
Facebook | 
Twitter



[Description: CRI Logo]

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Whamcloud







___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Data stored in OST

2023-05-22 Thread Andreas Dilger via lustre-discuss
Yes, the OSTs must provide internal redundancy - RAID-6 typically. 

There is File Level Redundancy (FLR = mirroring) possible in Lustre file 
layouts, but it is "unmanaged", so users or other system-level tools are 
required to resync FLR files if they are written after mirroring.

Cheers, Andreas

> On May 22, 2023, at 09:39, Nick dan via lustre-discuss 
>  wrote:
> 
> 
> Hi
> 
> I had one doubt.
> In lustre, data is divided into stripes and stored in multiple OSTs. So each 
> OST will have some part of data. 
> My question is if one OST fails, will there be data loss?
> 
> Please advise for the same.
> 
> Thanks and regards
> Nick
> ___
> lustre-discuss mailing list
> lustre-discuss@lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] mlx5 errors on oss

2023-05-18 Thread Andreas Dilger via lustre-discuss
I can't comment on the specific network issue, but in general it is far better 
to use the MOFED drivers than the in-kernel ones. 

Cheers, Andreas

> On May 18, 2023, at 09:08, Nehring, Shane R [LAS] via lustre-discuss 
>  wrote:
> 
> Hello all,
> 
> We recently added infiniband to our cluster and are in the process of testing 
> it
> with lustre. We're running the distro provided drivers for the mellanox cards
> with the latest firmware. Overnight we started seeing the following errors on 
> a
> few oss:
> 
> infiniband mlx5_0: dump_cqe:272:(pid 40058): dump error cqe
> : 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> 0010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> 0020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> 0030: 00 00 00 00 00 00 88 13 08 00 00 a0 00 63 4d d2
> infiniband mlx5_0: dump_cqe:272:(pid 40057): dump error cqe
> : 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> 0010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> 0020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> 0030: 00 00 00 00 00 00 88 13 08 00 00 a1 00 c2 8e d2
> infiniband mlx5_0: dump_cqe:272:(pid 40057): dump error cqe
> : 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> 0010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> 0020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> 0030: 00 00 00 00 00 00 88 13 08 00 00 a2 00 1a 12 d2
> 
> I found a post suggesting this might be iommu related, disabling the iommu
> doesn't seem to help any.
> 
> We're running luster 2.15, more or less at the tip of b2_15
> (b74560d74a9f890838dbf2f0719e3d27c1e5eaf8)
> 
> Has anyone seen this before or have any pointers?
> 
> Thanks
> 
> Shane
> ___
> lustre-discuss mailing list
> lustre-discuss@lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] [EXTERNAL] Re: storing Lustre jobid in file xattrs: seeking feedback

2023-05-15 Thread Andreas Dilger via lustre-discuss
Note that there have been some requests to increase the jobid size (LU-16765) 
so any tools that are accessing the xattr shouldn't assume the jobid is only 32 
bytes in size.

On May 14, 2023, at 13:11, Bertschinger, Thomas Andrew Hjorth 
mailto:bertschin...@lanl.gov>> wrote:

Thanks for the responses.

I like the idea of allowing the xattr name to be a parameter, because while it 
increases the complexity, it seems safer.

The main difficulty I can think of is that user tools that query the jobid will 
need to get the value of the parameter first in order to query the correct 
xattr. Additionally, if the parameter is changed, jobids from old files may be 
missed. This doesn't seem like a big risk however, because I imagine this value 
would be changed rarely if ever.

As for limiting the name to 7 characters, I believe Andreas is referring to the 
xattr name itself, not the contents of the xattr, so there should be no problem 
with storing the full length of a jobid (32 characters) -- but let me know if I 
am not interpreting that correctly.

- Tom Bertschinger

From: Jeff Johnson 
mailto:jeff.john...@aeoncomputing.com>>
Sent: Friday, May 12, 2023 4:56 PM
To: Andreas Dilger
Cc: Bertschinger, Thomas Andrew Hjorth; 
lustre-discuss@lists.lustre.org<mailto:lustre-discuss@lists.lustre.org>
Subject: [EXTERNAL] Re: [lustre-discuss] storing Lustre jobid in file xattrs: 
seeking feedback

Just a thought, instead of embedding the jobname itself, perhaps just a least 
significant 7 character sha-1 hash of the jobname. Small chance of collision, 
easy to decode/cross reference to jobid when needed. Just a thought.

--Jeff


On Fri, May 12, 2023 at 3:08 PM Andreas Dilger via lustre-discuss 
mailto:lustre-discuss@lists.lustre.org><mailto:lustre-discuss@lists.lustre.org>>
 wrote:
Hi Thomas,
thanks for working on this functionality and raising this question.

As you know, I'm inclined toward the user.job xattr, but I think it is never a 
good idea to unilaterally make policy decisions in the kernel that cannot be 
changed.

As such, it probably makes sense to have a tunable parameter like 
"mdt.*.job_xattr=user.job" and then this could be changed in the future if 
there is some conflict (e.g. some site already uses the "user.job" xattr for 
some other purpose).

I don't think the job_xattr should allow totally arbitrary values (e.g. 
overwriting trusted.lov or trusted.lma or security.* would be bad). One option 
is to only allow a limited selection of valid xattr namespaces, and possibly 
names:

 *   NONE to turn this feature off
 *   user, or trusted or system (if admin wants to restrict the ability of 
regular users to change this value?), with ".job" added automatically
 *   user.* (or trusted.* or system.*) to also allow specifying the xattr name

If we allow the xattr name portion to be specified (which I'm not sure about, 
but putting it out for completeness), it should have some reasonable limits:

 *   <= 7 characters long to avoid wasting valuable xattr space in the inode
 *   should not conflict with other known xattrs, which is tricky if we allow 
the name to be arbitrary. Possibly if in trusted (and system?) it should only 
allow trusted.job to avoid future conflicts?
 *   maybe restrict it to contain "job" (or maybe "pbs", "slurm", ...) to 
reduce the chance of namespace clashes in user or system? However, I'm 
reluctant to restrict names in user since this shouldn't have any fatal side 
effects (e.g. data corruption like in trusted or system), and the admin is 
supposed to know what they are doing...

On May 4, 2023, at 15:53, Bertschinger, Thomas Andrew Hjorth via lustre-discuss 
mailto:lustre-discuss@lists.lustre.org><mailto:lustre-discuss@lists.lustre.org>>
 wrote:

Hello Lustre Users,

There has been interest in a proposed feature 
https://jira.whamcloud.com/browse/LU-13031 to store the jobid with each Lustre 
file at create time, in an extended attribute. An open question is which xattr 
namespace is to use between "user", the Lustre-specific namespace "lustre", 
"trusted", or even perhaps "system".

The correct namespace likely depends on how this xattr will be used. For 
example, will interoperability with other filesystems be important? Different 
namespaces have their own limitations so the correct choice depends on the use 
cases.

I'm looking for feedback on applications for this feature. If you have thoughts 
on how you could use this, please feel free to share them so that we design it 
in a way that meets your needs.

Thanks!

Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Whamcloud







___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] storing Lustre jobid in file xattrs: seeking feedback

2023-05-12 Thread Andreas Dilger via lustre-discuss
Hi Thomas,
thanks for working on this functionality and raising this question.

As you know, I'm inclined toward the user.job xattr, but I think it is never a 
good idea to unilaterally make policy decisions in the kernel that cannot be 
changed.

As such, it probably makes sense to have a tunable parameter like 
"mdt.*.job_xattr=user.job" and then this could be changed in the future if 
there is some conflict (e.g. some site already uses the "user.job" xattr for 
some other purpose).

I don't think the job_xattr should allow totally arbitrary values (e.g. 
overwriting trusted.lov or trusted.lma or security.* would be bad). One option 
is to only allow a limited selection of valid xattr namespaces, and possibly 
names:

  *   NONE to turn this feature off
  *   user, or trusted or system (if admin wants to restrict the ability of 
regular users to change this value?), with ".job" added automatically
  *   user.* (or trusted.* or system.*) to also allow specifying the xattr name

If we allow the xattr name portion to be specified (which I'm not sure about, 
but putting it out for completeness), it should have some reasonable limits:

  *   <= 7 characters long to avoid wasting valuable xattr space in the inode
  *   should not conflict with other known xattrs, which is tricky if we allow 
the name to be arbitrary. Possibly if in trusted (and system?) it should only 
allow trusted.job to avoid future conflicts?
  *   maybe restrict it to contain "job" (or maybe "pbs", "slurm", ...) to 
reduce the chance of namespace clashes in user or system? However, I'm 
reluctant to restrict names in user since this shouldn't have any fatal side 
effects (e.g. data corruption like in trusted or system), and the admin is 
supposed to know what they are doing...

On May 4, 2023, at 15:53, Bertschinger, Thomas Andrew Hjorth via lustre-discuss 
mailto:lustre-discuss@lists.lustre.org>> wrote:

Hello Lustre Users,

There has been interest in a proposed feature 
https://jira.whamcloud.com/browse/LU-13031 to store the jobid with each Lustre 
file at create time, in an extended attribute. An open question is which xattr 
namespace is to use between "user", the Lustre-specific namespace "lustre", 
"trusted", or even perhaps "system".

The correct namespace likely depends on how this xattr will be used. For 
example, will interoperability with other filesystems be important? Different 
namespaces have their own limitations so the correct choice depends on the use 
cases.

I'm looking for feedback on applications for this feature. If you have thoughts 
on how you could use this, please feel free to share them so that we design it 
in a way that meets your needs.

Thanks!

Tom Bertschinger
LANL
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Whamcloud







___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Missing Files in /proc/fs/lustre after Upgrading to Lustre 2.15.X

2023-05-04 Thread Andreas Dilger via lustre-discuss


On May 4, 2023, at 16:43, Jane Liu via lustre-discuss 
mailto:lustre-discuss@lists.lustre.org>> wrote:

Hi,

We previously had a monitoring tool in Lustre 2.12.X that relied on files 
located under /proc/fs/lustre for gathering metrics. However, after upgrading 
our system to version 2.15.2, we noticed that at least five files previously 
found under /proc/fs/lustre are no longer present. Here is a list of these 
files as an example:

/proc/fs/lustre/osd-ldiskfs/fsname-OST0078/brw_stats
/proc/fs/lustre/osd-ldiskfs/fsname-OST0078/kbytestotal
/proc/fs/lustre/osd-ldiskfs/fsname-OST0078/kbytesfree
/proc/fs/lustre/osd-ldiskfs/fsname-OST0078/filestotal
/proc/fs/lustre/osd-ldiskfs/fsname-OST0078/filesfree

We have been unable to locate these files in the new version. We can still 
obtain size information using the following commands:

lctl get_param obdfilter.*.kbytestotal
lctl get_param obdfilter.*.kbytesfree
lctl get_param obdfilter.*.filestotal
lctl get_param obdfilter.*.filesfree

However, we are unsure how to access the information previously available in 
the brw_stats file. Any guidance or suggestions would be greatly appreciated.

You've already partially answered your own question - the parameters for "lctl 
get_param" are under "osd-ldiskfs.*.{brw_stats,kbytes*,files*}" and not 
"obdfilter.*.*", but they have (mostly) moved from /proc/fs/lustre/osd-ldiskfs 
to /sys/fs/lustre/osd-ldiskfs.  In the case of brw_stats they are under 
/sys/kernel/debug/lustre/osd-ldiskfs.

These stats actually moved from obdfilter to osd-ldiskfs back in Lustre 2.4 
when the ZFS backend was added, and a symlink has been kept until now for 
compatibility.  That means your monitoring tool should still work with any 
modern Lustre version if you change the path. The move of brw_stats to 
/sys/kernel/debug/lustre was mandated by the upstream kernel and only happened 
in 2.15.0.

Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Whamcloud







___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] question mark when listing file after the upgrade

2023-05-03 Thread Andreas Dilger via lustre-discuss
This looks like https://jira.whamcloud.com/browse/LU-16655 causing problems 
after the upgrade from 2.12.x to 2.15.[012] breaking the Object Index files.

A patch for this has already been landed to b2_15 and will be included in 
2.15.3. If you've hit this issue, then you need to backup/delete the OI files 
(off of Lustre) and run OI Scrub to rebuild them.

I believe the OI Scrub/rebuild is described in the Lustre Manual.

Cheers, Andreas

On May 3, 2023, at 09:30, Colin Faber via lustre-discuss 
 wrote:


Hi,

What does your client log indicate? (dmesg / syslog)

On Wed, May 3, 2023, 7:32 AM Jane Liu via lustre-discuss 
mailto:lustre-discuss@lists.lustre.org>> wrote:
Hello,

I'm writing to ask for your help on one issue we observed after a major
upgrade of a large Lustre system from RHEL7 + 2.12.9 to RHEL8 + 2.15.2.
Basically we preserved MDT disk (VDisk on a VM) and also all OST disk
(JBOD) in RHEL7 and then reinstalled RHEL8 OS and then attached those
preserved disks to RHEL8 OS. However, I met an issue after the OS
upgrade and lustre installation.

I believe the issue is related to metadata.

The old MDS was a virtual machine, and the MDT vdisk was preserved
during the upgrade. When a new VM was created with the same hostname and
IP, the preserved MDT vdisk was attached to it. Everything seemed fine
initially. However, after the client mount was completed, the file
listing displayed question marks, as shown below:

[root@experimds01 ~]# mount -t lustre 11.22.33.44@tcp:/experi01
/mntlustre/
[root@experimds01 ~]# cd /mntlustre/
[root@experimds01 mntlustre]# ls -l
ls: cannot access 'experipro': No such file or directory
ls: cannot access 'admin': No such file or directory
ls: cannot access 'test4': No such file or directory
ls: cannot access 'test3': No such file or directory
total 0
d? ? ? ? ?? admin
d? ? ? ? ?? experipro
-? ? ? ? ?? test3
-? ? ? ? ?? test4

I shut down the MDT and ran "e2fsck -p
/dev/mapper/experimds01-experimds01". It reported "primary superblock
features different from
  backup, check forced."

[root@experimds01 ~]# e2fsck -p /dev/mapper/experimds01-experimds01
experi01-MDT primary superblock features different from backup,
check forced.
experi01-MDT: 9493348/429444224 files (0.5% non-contiguous),
109369520/268428864 blocks

Running e2fsck again showed that the filesystem was clean.
[root@experimds01 /]# e2fsck -p /dev/mapper/experimds01-experimds01
experi01-MDT: clean, 9493378/429444224 files, 109369610/268428864
blocks

However, the issue persisted. The file listing continued to display
question marks.

Do you have any idea what could be causing this problem and how to fix
it? By the way, I have an e2image backup of the MDT from the
RHEL7 system just in case we need fix it using the backup. Also, after
the upgrade, the command "lfs df" shows that all OSTs and MDT
  are fine.

Thank you in advance for your assistance.

Best regards,
Jane
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Recovering MDT failure

2023-04-28 Thread Andreas Dilger via lustre-discuss
On Apr 27, 2023, at 02:12, Ramiro Alba Queipo 
mailto:ramiro.a...@upc.edu>> wrote:

Hi everybody,

I have lustre 2.15.0 using Oracle on servers and Ubuntu 20.04 at clients. I 
have one MDT on a raid 1 SSD and the two disk have failed, so all the data is 
apparently lost.

- Is there a remote posibility to access data on OSSTs without MDT?
- When I started this system I tried to backup MDT data without succes. I there 
any procedure to backup data which I got, but also to recover it which I did 
not achieve?

Any help/suggestion es very welcomed
Thanks in advance.

Regards

Depending on the file layout used, the files on the OSTs are "just files", if 
you can mount the OSTs as type ldiskfs (or ZFS if that is the way you 
configured it).  The default file layout is one OST stripe per file, so you 
could read those files, and the UID/GID/timestamps should be correct, but there 
will not be any filenames associated with the files.

Regards, Andreas
--
Andreas Dilger
Lustre Principal Architect
Whamcloud







___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] [EXTERNAL] Mounting lustre on block device

2023-04-05 Thread Andreas Dilger via lustre-discuss


On Mar 16, 2023, at 17:02, Jeff Johnson 
mailto:jeff.john...@aeoncomputing.com>> wrote:

If you *really* want a block device on a client that resides in Lustre you 
*could* create a file in Lustre and then make that file a loopback device with 
losetup. Of course, your mileage will vary *a lot* based on use case, access, 
underlying LFS configuration.

dd if=/dev/zero of=/my_lustre_mountpoint/some_subdir/big_raw_file bs=1048576 
count=10
losetup -f /my_lustre_mountpoint/some_subdir/big_raw_file
*assuming loop0 is created*
some_fun_command /dev/loop0

Note with ldiskfs backends you can use "fallocate -l 10M 
/my_lustre_mountpoint/some_subdir/big_raw_file" to reserve the space.

Alternately, if you have flash-based OSTs you could truncate a sparse file to 
the full size ("truncate -S 10M ...") and format that, which will not 
consume as much space but will generate more random allocation on the OSTs.

Disclaimer: Just because you *can* do this, doesn't necessarily mean it is a 
good idea

We saw a good performance boost with ext4 images on Lustre holding many small 
files (CCI).  Also, I recall some customers in the past using ext2 or ext4 
images effectively to aggregate many small files for read-only use on compute 
nodes.

Cheers, Andreas


On Thu, Mar 16, 2023 at 3:29 PM Mohr, Rick via lustre-discuss 
mailto:lustre-discuss@lists.lustre.org>> wrote:
Are you asking if you can mount Lustre on a client so that it shows up as a 
block device?  If so, the answer to that is you can't.  Lustre does not appear 
as a block device to the clients.

-Rick



On 3/16/23, 3:44 PM, "lustre-discuss on behalf of Shambhu Raje via 
lustre-discuss" 
mailto:lustre-discuss-boun...@lists.lustre.org>
 
>
 on behalf of 
lustre-discuss@lists.lustre.org 
>>
 wrote:


When we mount a lustre file system on client, the lustre file system does not 
use block device on client side. Instead it uses virtual file system namespace. 
Mounting point will not be shown when we do 'lsblk'. As it only show on 'df-hT'.


How can we mount lustre file system on block such that when we write something 
with lusterfs then it can be shown in block device??
Can share command??











___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


--
--
Jeff Johnson
Co-Founder
Aeon Computing

jeff.john...@aeoncomputing.com
www.aeoncomputing.com
t: 858-412-3810 x1001   f: 858-412-3845
m: 619-204-9061

4170 Morena Boulevard, Suite C - San Diego, CA 92117

High-Performance Computing / Lustre Filesystems / Scale-out Storage
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Whamcloud







___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Joining files

2023-03-30 Thread Andreas Dilger via lustre-discuss
Based on your use case, I don't think file join will be a suitable solution. 
There is a limit on the number of files that can be joined (about 2000) and 
this would make for an unusual file format (something like a tar file, but 
would need special tools to access). It would also be very Lustre-specific. 

Instead, my recommendation would be to use an ext4 filesystem image to hold the 
many small files (during create, if from a single client, or aggregated after 
they are created).  Later, this filesystem image could be mounted read-only on 
multiple clients for access. Also, the whole image file can be archived to tape 
efficiently (taking all small files with it, instead of keeping a stub in 
Lustre for each file).

The use of loopback mounting image files from Lustre already works today, but 
needs userspace help to create and mount/unmount them. There was some proposal 
"Client Container Image (CCI)" on how this could be integrated directly into 
Lustre.  Please see my LUG presentation for details (maybe 2019 or so?)

Cheers, Andreas

> On Mar 30, 2023, at 00:47, Sven Willner  wrote:
> 
> Dear Patrick and Anders,
> 
> Thank you very much for your quick and comprehensive replies.
> 
> My motivation behind this issues is the following:
> At my institute (research around a large earth system/climate model) we are 
> evaluating using zarr (https://zarr.readthedocs.io) for outputing large 
> multi-dimensional arrays. This currently results in a huge number of small 
> files as the responsibility of parallel writing is fully shifted to the file 
> system. However, after closing the respective datasets we could merge those 
> files again to reduce the metadata burden onto the file system and for easier 
> archival if needed at a later point. Ideally without copying the large amount 
> of data again. For read access I would simply create an appropriate 
> index/lookup table for the resulting large file - hence holes/gaps in the 
> file are not a problem as such.
> 
> As Patrick writes
>> Layout: 1 1 1 1 1 1 1 ... 20 MiB 2 2 2 2 2 2  35 MiB
>> 
>> With data from 0-10 MiB and 20 - 30 MiB.
> that would be the resulting layout (I guess, minimizing holes could be 
> achieved by appropriate striping of the original files and/or a layout 
> adjustment during the merge, if possible).
> 
>> My expectation is that "join" of two files would be handled at the file EOF 
>> and *not* at the layout boundary.  Based on the original description from 
>> Sven, I'd think that small gaps in the file (e.g. 4KB for page alignment, 
>> 64KB for minimum layout alignment, or 1MB for stripe alignment) would be OK, 
>> but tens or hundreds of MB holes would be inefficient for processing.
> (Andreas)
> 
> Apart from archival, the resulting file would only be accessed locally in the 
> boundaries of the orginial smaller files, so I would expect the performance 
> costs of the gaps to be not that critical.
> 
>> while I think it is possible to implement this in Lustre, I'd have to ask 
>> what requirements are driving your request?  Is this just something you want 
>> to test, or is there some real-world usage demand for this (e.g. specific 
>> application workload, usage in some popular library, etc)?
> (Andreas)
> 
> At this stage I am just looking into possibilites to handle this situation - 
> I am neither an expert in zarr nor in Lustre.
> 
> If such a merge on the file system level turns out to be route worth taking, 
> I would be happy to work on an implementation. However, yes, I would need 
> some guidance there. Also, at this point I cannot estimate the amount of work 
> needed even to test this approach.
> 
> Would the necessary layout manipulation be possible in userspace? (I will 
> have a look into the implementations of `lfs migrate` and `lfs mirror 
> extend`).
> 
> Thanks a lot!
> Best,
> Sven
> 
> On Wed, Mar 29, 2023 at 07:41:56PM +, Andreas Dilger wrote:
> [-- Type: text/plain; charset=utf-8, Encoding: base64, Size: 8.2K --]
>> Patrick,
>> once upon a time there was "file join" functionality in Lustre that was 
>> ancient and complex, and was finally removed in 2009.  There are still a few 
>> remnants of this like "MDS_OPEN_JOIN_FILE" and "LOV_MAGIC_JOIN_V1" defined, 
>> but unused.   That functionality long predated composite file layouts (PFL, 
>> FLR), and used an external llog file *per file* to declare a series of other 
>> files that described the layout.  It was extremely fragile and complex and 
>> thankfully never got into widespread usage.
>> 
>> I think with the advent of composite file layout that it should be 
>> _possible_ to implement this kind of functionality purely with layout 
>> changes, similar to "lfs migrate" doing layout swap, or "lfs mirror extend" 
>> merging the layout of a victim file into another file to create a mirror.
>> 
>> My expectation is that "join" of two files would be handled at the file EOF 
>> and *not* at the layout boundary.  Based on the original description from 

Re: [lustre-discuss] Joining files

2023-03-29 Thread Andreas Dilger via lustre-discuss
Patrick,
once upon a time there was "file join" functionality in Lustre that was ancient 
and complex, and was finally removed in 2009.  There are still a few remnants 
of this like "MDS_OPEN_JOIN_FILE" and "LOV_MAGIC_JOIN_V1" defined, but unused.  
 That functionality long predated composite file layouts (PFL, FLR), and used 
an external llog file *per file* to declare a series of other files that 
described the layout.  It was extremely fragile and complex and thankfully 
never got into widespread usage.

I think with the advent of composite file layout that it should be _possible_ 
to implement this kind of functionality purely with layout changes, similar to 
"lfs migrate" doing layout swap, or "lfs mirror extend" merging the layout of a 
victim file into another file to create a mirror.

My expectation is that "join" of two files would be handled at the file EOF and 
*not* at the layout boundary.  Based on the original description from Sven, I'd 
think that small gaps in the file (e.g. 4KB for page alignment, 64KB for 
minimum layout alignment, or 1MB for stripe alignment) would be OK, but tens or 
hundreds of MB holes would be inefficient for processing.

My guess, based on similar requests I've seen previously, and Sven's email 
address, is that this relates to merging video streams from different files 
into a single file?

Sven,
while I think it is possible to implement this in Lustre, I'd have to ask what 
requirements are driving your request?  Is this just something you want to 
test, or is there some real-world usage demand for this (e.g. specific 
application workload, usage in some popular library, etc)?

It seems possible to do this with layout manipulation similar to "lfs mirror 
extend -f" (i.e. a kind of "super file append" mechanism) but would be 
similarly destructive to the "victim" files appended to the original one, and 
would definitely not be something that could be done while the "original" file 
was actively in use.  Essentially, instead of "lfs mirror extend" just 
appending the victim layout to the existing file, it would need to also modify 
the original layout to truncate the layout at EOF, then offset the extent 
ranges in the victim layout by the current file size (rounded up to at least 
64KB multiples, but preferably 1MB multiples to maintain RAID alignment).

Is this something that you would be willing to work on with guidance for the 
implementation details, or a feature request that you hope someone else will 
implement?

Cheers, Andreas

On Mar 29, 2023, at 07:41, Patrick Farrell via lustre-discuss 
mailto:lustre-discuss@lists.lustre.org>> wrote:

Sven,

The "combining layouts without any data movement" part isn't currently 
possible.  It's probably possible in theory, but it's never been implemented.  
(I'm curious what your use case is?)

Even allowing for data movement, there's no tool to do this for you.  Depending 
what you mean by combining, it's possible to do this with Linux tools (see the 
end of my note), but you're going to have data copying.

It's a bit of an odd requirement, with some inherent questions - For example, 
file layouts generally go to infinity, because if they don't, you will get IO 
errors when you 'run off the end', ie, go past the defined layout, so the last 
component is usually defined to go to infinity.

That poses obvious questions when combining files.

If you're looking to combine files with layouts that do not go to infinity, 
then it's at least straightforward to see how you'd concatenate them.  But 
presumably the data in each file doesn't go to the very end of the layout?  So 
do you want the empty parts of the layout included?

Say file 1 is 10 MiB in size but the layout goes to 20 MiB (again, layouts 
normally should go to infinity) and file 2 is also 10 MiB in size but the 
layout goes to, say, 15 MiB.  Should the result look like this?

Layout: 1 1 1 1 1 1 1 ... 20 MiB 2 2 2 2 2 2  35 MiB

With data from 0-10 MiB and 20 - 30 MiB.

That's something you'd have to write a tool for, so it could write the data at 
your specified offset for putting in the second file (and third, etc...).  You 
could also do something like:

lfs setstripe [your layout] combined file; cat file 1 > combined file; truncate 
[combined file] 20 MiB (the end of the file 1 layout); cat file 2 > 
combined_file", etc.

So, you definitely can't avoid data copying here.  But that's how you could do 
it with simple Linux tools (which you could probably have drawn up yourself :)).

-Patrick


From: lustre-discuss 
mailto:lustre-discuss-boun...@lists.lustre.org>>
 on behalf of Sven Willner 
mailto:sven.will...@mpimet.mpg.de>>
Sent: Wednesday, March 29, 2023 7:58 AM
To: lustre-discuss@lists.lustre.org 
mailto:lustre-discuss@lists.lustre.org>>
Subject: [lustre-discuss] Joining files

[You don't often get email from 
sven.will...@mpimet.mpg.de. Learn why t

Re: [lustre-discuss] About Lustre small files performace(8k) improve

2023-03-27 Thread Andreas Dilger via lustre-discuss
Are your performance tests on NFS or on native Lustre clients?  Native Lustre 
clients will likely be faster, and with many clients they can create files in 
parallel, even in the same directory.  With a single NFS server they will be 
limited by the VFS locking for a single directory.

Are you using IB or TCP networking?  IB will be faster for low-latency requests.

Are you using the Data-on-MDT feature?  This can reduce overhead for very small 
files.

Are you using NVMe storage or e.g. SATA SSDs?  Based on the OST size it looks 
like flash of some kind, unless you are using single-HDD OSTs?

Cheers, Andreas

On Mar 18, 2023, at 01:44, 王烁斌 via lustre-discuss 
mailto:lustre-discuss@lists.lustre.org>> wrote:

Hi all,

This is my Lustre FS.
UUID   1K-blocksUsed   Available Use% Mounted on
ltfs-MDT_UUID  307826072   36904   281574768   1% /mnt/lfs[MDT:0]
ltfs-MDT0001_UUID  307826072   36452   281575220   1% /mnt/lfs[MDT:1]
ltfs-MDT0002_UUID  307826072   36600   281575072   1% /mnt/lfs[MDT:2]
ltfs-MDT0003_UUID  307826072   36300   281575372   1% /mnt/lfs[MDT:3]
ltfs-OST_UUID15962575136 1027740 15156068868   1% /mnt/lfs[OST:0]
ltfs-OST0001_UUID15962575136 1027780 15156067516   1% /mnt/lfs[OST:1]
ltfs-OST0002_UUID15962575136 1027772 15156074212   1% /mnt/lfs[OST:2]
ltfs-OST0003_UUID15962575136 1027756 15156067860   1% /mnt/lfs[OST:3]
ltfs-OST0004_UUID15962575136 1027728 15156058224   1% /mnt/lfs[OST:4]
ltfs-OST0005_UUID15962575136 1027772 15156057668   1% /mnt/lfs[OST:5]
ltfs-OST0006_UUID15962575136 1027768 15156058568   1% /mnt/lfs[OST:6]
ltfs-OST0007_UUID15962575136 1027792 15156056752   1% /mnt/lfs[OST:7]

filesystem_summary:  127700601088 8222108 121248509668   1% /mnt/lfs

Structure ias flow:


After testing, under the current structure, the write performance of 500,000 
"8k" small files is:
NFSclient1——IOPS:28,000;  bandwidth——230MB
NFSclient1——IOPS:27,500;  bandwidth——220MB

Now I want to improve the performance of small files to a better level,May I 
ask if there is a better way。

I have noticed a feature called "MIP-IO" that can improve small file 
performance, but I don't know how to deploy this feature. Is there any way to 
improve small file performance?



___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Whamcloud







___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] DNE v3 and directory inode changing

2023-03-24 Thread Andreas Dilger via lustre-discuss
On Mar 24, 2023, at 13:20, Bertschinger, Thomas Andrew Hjorth 
mailto:bertschin...@lanl.gov>> wrote:

Thanks, this is helpful. We certainly don't need the auto-split feature and 
were just experimenting with it, so this should be fine for us. And we have 
been satisfied with the round robin directory creation so far. Just out of 
curiosity, is the auto-split feature still being actively worked on and 
expected to be complete/production-ready within some defined period of time?

Nobody is currently working on directory split.  There are a number of other 
DNE optimizations that are underway (mostly internal code, locking, and 
recovery improvements without much "visible" to the outside world).

The main feature in progress on the metadata side is the metadata writeback 
cache (WBC) that will greatly improve single-client workloads in a single 
directory tree (e.g. untar and then build/process files in that tree).   That 
should help significantly with genomics and machine learning workloads that 
have this kind of usage pattern.

Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Whamcloud







___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


  1   2   3   >