Re: [lustre-discuss] [EXTERNAL] [BULK] MDS hardware - NVME?

2024-01-05 Thread Cameron Harr via lustre-discuss
This doesn't answer your question about ldiskfs on zvols, but we've been 
running MDTs on ZFS on NVMe in production for a couple years (and on SAS 
SSDs for many years prior). Our current production MDTs using NVMe 
consist of one zpool/node made up of 3x 2-drive mirrors, but we've been 
experimenting lately with using raidz3 and possibly even raidz2 for MDTs 
since SSDs have been pretty reliable for us.


Cameron

On 1/5/24 9:07 AM, Vicker, Darby J. (JSC-EG111)[Jacobs Technology, Inc.] 
via lustre-discuss wrote:

We are in the process of retiring two long standing LFS's (about 8 years old), 
which we built and managed ourselves.  Both use ZFS and have the MDT'S on ssd's 
in a JBOD that require the kind of software-based management you describe, in 
our case ZFS pools built on multipath devices.  The MDT in one is ZFS and the 
MDT in the other LFS is ldiskfs but uses ZFS and a zvol as you describe - we 
build the ldiskfs MDT on top of the zvol.  Generally, this has worked well for 
us, with one big caveat.  If you look for my posts to this list and the ZFS 
list you'll find more details.  The short version is that we utilize ZFS 
snapshots and clones to do backups of the metadata.  We've run into situations 
where the backup process stalls, leaving a clone hanging around.  We've 
experienced a situation a couple of times where the clone and the primary zvol 
get swapped, effectively rolling back our metadata to the point when the clone 
was created.  I have tried, unsuccessfully, to recreate that in a test 
environment.  So if you do that kind of setup, make sure you have good 
monitoring in place to detect if your backups/clones stall.  We've kept up with 
lustre and ZFS updates over the years and are currently on lustre 2.14 and ZFS 
2.1.  We've seen the gap between our ZFS MDT and ldiskfs performance shrink to 
the point where they are pretty much on par to each now.  I think our ZFS MDT 
performance could be better with more hardware and software tuning but our 
small team hasn't had the bandwidth to tackle that.

Our newest LFS is vendor provided and uses NVMe MDT's.  I'm not at liberty to 
talk about the proprietary way those devices are managed.  However, the 
metadata performance is SO much better than our older LFS's, for a lot of 
reasons, but I'd highly recommend NVMe's for your MDT's.

-Original Message-
From: lustre-discuss mailto:lustre-discuss-boun...@lists.lustre.org>> on behalf of Thomas Roth via lustre-discuss 
mailto:lustre-discuss@lists.lustre.org>>
Reply-To: Thomas Roth mailto:t.r...@gsi.de>>
Date: Friday, January 5, 2024 at 9:03 AM
To: Lustre Diskussionsliste mailto:lustre-discuss@lists.lustre.org>>
Subject: [EXTERNAL] [BULK] [lustre-discuss] MDS hardware - NVME?


CAUTION: This email originated from outside of NASA. Please take care when clicking links 
or opening attachments. Use the "Report Message" button to report suspicious 
messages to the NASA SOC.








Dear all,


considering NVME storage for the next MDS.


As I understand, NVME disks are bundled in software, not by a hardware raid 
controller.
This would be done using Linux software raid, mdadm, correct?


We have some experience with ZFS, which we use on our OSTs.
But I would like to stick to ldiskfs for the MDTs, and a zpool with a zvol on 
top which is then formatted with ldiskfs - to much voodoo...


How is this handled elsewhere? Any experiences?




The available devices are quite large. If I create a raid-10 out of 4 disks, 
e.g. 7 TB each, my MDT will be 14 TB - already close to the 16 TB limit.
So no need for a box with lots of U.3 slots.


But for MDS operations, we will still need a powerful dual-CPU system with lots 
of RAM.
Then the NVME devices should be distributed between the CPUs?
Is there a way to pinpoint this in a call for tender?




Best regards,
Thomas



Thomas Roth


GSI Helmholtzzentrum für Schwerionenforschung GmbH
Planckstraße 1, 64291 Darmstadt, Germany, 
https://urldefense.us/v3/__http://www.gsi.de/__;!!G2kpM7uM-TzIFchu!1QmOnUbmSPpZPcc39XFZ3S-Vk4Dmh-Q78Gpm8ylYUf6Zhv_zpb2VXkM4C5Uhh05x01MhjqJTYZ5boqzEhkx6JF_rGY74EQ$
  



Commercial Register / Handelsregister: Amtsgericht Darmstadt, HRB 1528
Managing Directors / Geschäftsführung:
Professor Dr. Paolo Giubellino, Dr. Ulrich Breuer, Jörg Blaurock
Chairman of the Supervisory Board / Vorsitzender des GSI-Aufsichtsrats:
State Secretary / Staatssekretär Dr. Volkmar Dietz




___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org 
https://urldefense.us/v3/__http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org__;!!G2kpM7uM-TzIFchu!1QmOnUbmSPpZPcc39XFZ3S-Vk4Dmh-Q78Gpm8ylYUf6Zhv_zpb2VXkM4C5Uhh05x01MhjqJTYZ5boqzEhkx

Re: [lustre-discuss] [EXTERNAL] [BULK] MDS hardware - NVME?

2024-01-05 Thread Vicker, Darby J. (JSC-EG111)[Jacobs Technology, Inc.] via lustre-discuss
We are in the process of retiring two long standing LFS's (about 8 years old), 
which we built and managed ourselves.  Both use ZFS and have the MDT'S on ssd's 
in a JBOD that require the kind of software-based management you describe, in 
our case ZFS pools built on multipath devices.  The MDT in one is ZFS and the 
MDT in the other LFS is ldiskfs but uses ZFS and a zvol as you describe - we 
build the ldiskfs MDT on top of the zvol.  Generally, this has worked well for 
us, with one big caveat.  If you look for my posts to this list and the ZFS 
list you'll find more details.  The short version is that we utilize ZFS 
snapshots and clones to do backups of the metadata.  We've run into situations 
where the backup process stalls, leaving a clone hanging around.  We've 
experienced a situation a couple of times where the clone and the primary zvol 
get swapped, effectively rolling back our metadata to the point when the clone 
was created.  I have tried, unsuccessfully, to recreate that in a test 
environment.  So if you do that kind of setup, make sure you have good 
monitoring in place to detect if your backups/clones stall.  We've kept up with 
lustre and ZFS updates over the years and are currently on lustre 2.14 and ZFS 
2.1.  We've seen the gap between our ZFS MDT and ldiskfs performance shrink to 
the point where they are pretty much on par to each now.  I think our ZFS MDT 
performance could be better with more hardware and software tuning but our 
small team hasn't had the bandwidth to tackle that.

Our newest LFS is vendor provided and uses NVMe MDT's.  I'm not at liberty to 
talk about the proprietary way those devices are managed.  However, the 
metadata performance is SO much better than our older LFS's, for a lot of 
reasons, but I'd highly recommend NVMe's for your MDT's.

-Original Message-
From: lustre-discuss mailto:lustre-discuss-boun...@lists.lustre.org>> on behalf of Thomas Roth via 
lustre-discuss mailto:lustre-discuss@lists.lustre.org>>
Reply-To: Thomas Roth mailto:t.r...@gsi.de>>
Date: Friday, January 5, 2024 at 9:03 AM
To: Lustre Diskussionsliste mailto:lustre-discuss@lists.lustre.org>>
Subject: [EXTERNAL] [BULK] [lustre-discuss] MDS hardware - NVME?


CAUTION: This email originated from outside of NASA. Please take care when 
clicking links or opening attachments. Use the "Report Message" button to 
report suspicious messages to the NASA SOC.








Dear all,


considering NVME storage for the next MDS.


As I understand, NVME disks are bundled in software, not by a hardware raid 
controller.
This would be done using Linux software raid, mdadm, correct?


We have some experience with ZFS, which we use on our OSTs.
But I would like to stick to ldiskfs for the MDTs, and a zpool with a zvol on 
top which is then formatted with ldiskfs - to much voodoo...


How is this handled elsewhere? Any experiences?




The available devices are quite large. If I create a raid-10 out of 4 disks, 
e.g. 7 TB each, my MDT will be 14 TB - already close to the 16 TB limit.
So no need for a box with lots of U.3 slots.


But for MDS operations, we will still need a powerful dual-CPU system with lots 
of RAM.
Then the NVME devices should be distributed between the CPUs?
Is there a way to pinpoint this in a call for tender?




Best regards,
Thomas



Thomas Roth


GSI Helmholtzzentrum für Schwerionenforschung GmbH
Planckstraße 1, 64291 Darmstadt, Germany, http://www.gsi.de/ 



Commercial Register / Handelsregister: Amtsgericht Darmstadt, HRB 1528
Managing Directors / Geschäftsführung:
Professor Dr. Paolo Giubellino, Dr. Ulrich Breuer, Jörg Blaurock
Chairman of the Supervisory Board / Vorsitzender des GSI-Aufsichtsrats:
State Secretary / Staatssekretär Dr. Volkmar Dietz




___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org 
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org 




___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


[lustre-discuss] MDS hardware - NVME?

2024-01-05 Thread Thomas Roth via lustre-discuss

Dear all,

considering NVME storage for the next MDS.

As I understand, NVME disks are bundled in software, not by a hardware raid 
controller.
This would be done using Linux software raid, mdadm, correct?

We have some experience with ZFS, which we use on our OSTs.
But I would like to stick to ldiskfs for the MDTs, and a zpool with a zvol on 
top which is then formatted with ldiskfs - to much voodoo...

How is this handled elsewhere? Any experiences?


The available devices are quite large. If I create a raid-10 out of 4 disks, e.g. 7 TB each, my MDT will be 14 TB - already close to the 16 TB limit. 
So no need for a box with lots of U.3 slots.


But for MDS operations, we will still need a powerful dual-CPU system with lots 
of RAM.
Then the NVME devices should be distributed between the CPUs?
Is there a way to pinpoint this in a call for tender?


Best regards,
Thomas


Thomas Roth

GSI Helmholtzzentrum für Schwerionenforschung GmbH
Planckstraße 1, 64291 Darmstadt, Germany, www.gsi.de

Commercial Register / Handelsregister: Amtsgericht Darmstadt, HRB 1528
Managing Directors / Geschäftsführung:
Professor Dr. Paolo Giubellino, Dr. Ulrich Breuer, Jörg Blaurock
Chairman of the Supervisory Board / Vorsitzender des GSI-Aufsichtsrats:
State Secretary / Staatssekretär Dr. Volkmar Dietz


___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] adding a new OST to live system

2024-01-05 Thread Thomas Roth via lustre-discuss

Not a problem at all.

Perhaps if you manage to mount your new OST for the first time just when your MGS/MDT and your network are completely overloaded and almost 
unreactive, then, perhaps, there might be issues ;-)


Afterwards the new OST, being empty, will attract most of the files that are newly created. That could result in an imbalance - old, cold data vs. 
new, hot data. In our case, we migrate some of the old data around, such that the fill level of the OSTs becomes ~equal.


Regards,
Thomas

On 12/1/23 19:18, Lana Deere via lustre-discuss wrote:

I'm looking at the manual, 14.8, Adding a New OST to a Lustre File
System, and it looks straightforward.  It isn;'t clear to me, however,
whether it is OK to do this while the rest of the lustre system is
live.  Is it OK to add a new OST while the system is in use?  Or do I
need to arrange downtime for the system to do this?

Thanks.

.. Lana (lana.de...@gmail.com)
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Building lustre on rocky 8.8 fails?

2024-01-05 Thread Jan Andersen

Hi Xinliang and Andreas,

Thanks for helping with this!

I tried out your suggestions, and it compiled fine; however, things had become 
quite messy on the server, so I decided to reinstall the Rocky 8.8 and start 
over. Again, lustre built successfully, but for some reason, when you download 
the source code package from the repository, you get a slightly different 
version from the kernel's, with the result that lustre places its modules in a 
place the running kernel can't find.

So I built the kernel from source with the correct version, rebooted, cloned 
lustre again, ./configured etc, and now:

[root@mds lustre-release]# make
make  all-recursive
make[1]: Entering directory '/root/lustre-release'
Making all in ldiskfs
make[2]: Entering directory '/root/lustre-release/ldiskfs'
make[2]: *** No rule to make target 
'../ldiskfs/kernel_patches/series/ldiskfs-', needed by 'sources'.  Stop.
make[2]: Leaving directory '/root/lustre-release/ldiskfs'
make[1]: *** [autoMakefile:680: all-recursive] Error 1
make[1]: Leaving directory '/root/lustre-release'
make: *** [autoMakefile:546: all] Error 2

Which I don't quite understand, because I still have all the necessary packages 
and tools from before.

Before I barge ahead and try yet another permutation, do you have any advice so 
I might avoid problems? I can reinstall the OS, which will be a bit of a pain, 
but not that bad - but then which version; 8.8 or 8.9, or even 9? Or is there a 
simplish thing I can do to avoid all that?

/jan

On 03/01/2024 02:17, Xinliang Liu wrote:



On Wed, 3 Jan 2024 at 10:08, Xinliang Liu mailto:xinliang@linaro.org>> wrote:

Hi Jan,

On Tue, 2 Jan 2024 at 22:29, Jan Andersen mailto:j...@comind.io>> wrote:

I have installed Rocky 8.8 on a new server (Dell PowerEdge R640):

[root@mds 4.18.0-513.9.1.el8_9.x86_64]# cat /etc/*release*
Rocky Linux release 8.8 (Green Obsidian)
NAME="Rocky Linux"
VERSION="8.8 (Green Obsidian)"
ID="rocky"
ID_LIKE="rhel centos fedora"
VERSION_ID="8.8"
PLATFORM_ID="platform:el8"
PRETTY_NAME="Rocky Linux 8.8 (Green Obsidian)"
ANSI_COLOR="0;32"
LOGO="fedora-logo-icon"
CPE_NAME="cpe:/o:rocky:rocky:8:GA"
HOME_URL="https://rockylinux.org/ "
BUG_REPORT_URL="https://bugs.rockylinux.org/ 
"
SUPPORT_END="2029-05-31"
ROCKY_SUPPORT_PRODUCT="Rocky-Linux-8"
ROCKY_SUPPORT_PRODUCT_VERSION="8.8"
REDHAT_SUPPORT_PRODUCT="Rocky Linux"
REDHAT_SUPPORT_PRODUCT_VERSION="8.8"
Rocky Linux release 8.8 (Green Obsidian)
Rocky Linux release 8.8 (Green Obsidian)
Derived from Red Hat Enterprise Linux 8.8
Rocky Linux release 8.8 (Green Obsidian)
cpe:/o:rocky:rocky:8:GA

I downloaded the kernel source (I don't remember the exact command):

[root@mds 4.18.0-513.9.1.el8_9.x86_64]# ll /usr/src/kernels
total 8
drwxr-xr-x. 24 root root 4096 Jan  2 13:49 4.18.0-513.9.1.el8_9.x86_64/
drwxr-xr-x. 23 root root 4096 Jan  2 11:41 
4.18.0-513.9.1.el8_9.x86_64+debug/

Copied the config from /boot and ran:

yes "" | make oldconfig

After that I cloned the Lustre source and configured (according to my 
notes):

git clone git://git.whamcloud.com/fs/lustre-release.git 

cd lustre-release
git checkout 2.15.3

dnf install libtool
dnf install flex
dnf install bison
dnf install openmpi-devel
dnf install python3-devel
dnf install python3
dnf install kernel-devel kernel-headers
dnf install elfutils-libelf-devel
dnf install keyutils keyutils-libs-devel
dnf install libmount
dnf --enablerepo=powertools install libmount-devel
dnf install libnl3 libnl3-devel
dnf config-manager --set-enabled powertools
dnf install libyaml-devel
dnf install patch
dnf install e2fsprogs-devel
dnf install kernel-core
dnf install kernel-modules
dnf install rpm-build
dnf config-manager --enable devel
dnf config-manager --enable powertools
dnf config-manager --set-enabled ha
dnf install kernel-debuginfo

sh autogen.sh
./configure

This appeared to finish without errors:

...
config.status: executing libtool commands

CC:            gcc
LD:            /usr/bin/ld -m elf_x86_64
CPPFLAGS:      -include /root/lustre-release/undef.h -include 
/root/lustre-release/config.h -I/root/lustre-release/lnet/include/uapi 
-I/root/lustre-release/lustre/include/uapi 
-I/root/lustre-release/libcfs/include -I/root/lustre-release/lnet/utils/ 
-I/root/lustre-release/lustre/include
CFLAGS:        -g -O2 -Wall -Werror
EXTRA_KCFLAGS: -include /root/lustre-rele