Re: [lustre-discuss] Poor(?) Lustre performance

2022-04-20 Thread Finn Rawles Malliagh via lustre-discuss
Andreas,

Thank you again for your detailed reply and time, I will have a look
further at the lustre IO kit and hopefully, get to the bottom of things.

Cheers,
Finn

On Thu, 21 Apr 2022 at 00:52, Andreas Dilger  wrote:

> Finn,
> I can't really say for sure where the performance limitation in your
> system is coming from.
>
>
> You'd have to re-run the tests against the local ldiskfs filesystem to see
> how the performance compares to with that of Lustre.  The important part of
> benchmark testing is to systematically build a complete picture from the
> ground up to see what the capabilities of the various components of the
> storage stack are, and then determine where any bottlenecks are being hit.
>
> That is what the "lustre-iokit" is intended to do - benchmark starting on
> the raw storage (sgpdd-survey), on the local disk filesystem
> (obdfilter-survey for local OSDs), then the network (lnet-selftest), and
> finally on the client (obdfilter-survey for network OSDs).
>
> For example, running sgpdd-survey (or "fio") with small and large IO sizes
> against the storage devices, individually *AND IN PARALLEL* to determine
> their performance characteristics.  Also running in parallel is critical,
> since you may see e.g. 3GB/s reads, 2GB/s writes from a single NVMe device,
> but *not* see 4x that performance when running on 4x NVMe devices because
> of CPU and/or PCI and/or memory bandwidth limitations.  Similarly, you may
> see reasonable per-OSS performance from a single OSS, but network
> congestion (on the client, switch(es), or server) may prevent the
> performance from scaling as more servers are added.
>
> This is described in some detail at
> https://github.com/DDNStorage/lustre_manual_markdown/blob/master/04.02-Benchmarking%20Lustre%20File%20System%20Performance%20(Lustre%20IO%20Kit).md
>
> Cheers, Andreas
>
> On Apr 20, 2022, at 12:03, Finn Rawles Malliagh 
> wrote:
>
> Hi Andreas,
>
> Thank you for taking the time to reply with such a detailed response.
> I have taken your advice on board and made some changes. Firstly, I have
> swapped from ZFS and am now using striped LVM groups (Including the P4800X
> instead of using it as a cache drive). I have also modified io500.sh to
> include the optimisation listed above. Rerunning the IO500 benchmark
> provides the metadata results below:
>
> With ZFS
> [RESULT]mdtest-easy-write0.931693 kIOPS : time 31.028 seconds
> [INVALID]
> [RESULT]mdtest-hard-write0.427000 kIOPS : time 31.070 seconds
> [INVALID]
> [RESULT] find   25.311534 kIOPS : time 1.631 seconds
> [RESULT] mdtest-easy-stat0.570021 kIOPS : time 50.067 seconds
> [RESULT] mdtest-hard-stat1.834985 kIOPS : time 7.998 seconds
> [RESULT]   mdtest-easy-delete1.715750 kIOPS : time 17.308 seconds
> [RESULT] mdtest-hard-read1.006240 kIOPS : time 13.759 seconds
> [RESULT]   mdtest-hard-delete1.624117 kIOPS : time 8.910 seconds
> [SCORE ] Bandwidth 2.271383 GiB/s : IOPS 1.526825 kiops : TOTAL 1.862258
> [INVALID]
>
> With LVM:
> [RESULT]mdtest-easy-write3.057249 kIOPS : time 27.177 seconds
> [INVALID]
> [RESULT]mdtest-hard-write1.576865 kIOPS : time 51.740 seconds
> [INVALID]
> [RESULT] find   71.979457 kIOPS : time 2.234 seconds
> [RESULT] mdtest-easy-stat1.841655 kIOPS : time 44.443 seconds
> [RESULT] mdtest-hard-stat1.779211 kIOPS : time 45.967 seconds
> [RESULT]   mdtest-easy-delete1.559825 kIOPS : time 52.301 seconds
> [RESULT] mdtest-hard-read0.631109 kIOPS : time 127.765 seconds
> [RESULT]   mdtest-hard-delete0.856858 kIOPS : time 94.372 seconds
> [SCORE ] Bandwidth 0.948100 GiB/s : IOPS 2.359024 kiops : TOTAL 1.495524
> [INVALID]
>
> I believe these scores are more in line with what I should expect,
> however, it seems that my throughput performance is still lacking(?). In
> your expert opinion do you think this would be just a case of tuning
> IO500/lvm parameters further or something more fundamental about the
> configuration of this Lustre cluster?
>
> With LVM
> [RESULT]   ior-easy-write2.127026 GiB/s : time 122.305 seconds
> [INVALID]
> [RESULT]   ior-hard-write1.408638 GiB/s : time 1.246 seconds
> [INVALID]
> [RESULT]ior-easy-read1.549550 GiB/s : time 167.881 seconds
> [RESULT]ior-hard-read0.174036 GiB/s : time 10.063 seconds
>
>
> Kind Regards,
> Finn
>
> On Wed, 20 Apr 2022 at 09:24, Andreas Dilger 
> wrote:
>
>> On Apr 16, 2022, at 22:51, Finn Rawles Malliagh via lustre-discuss <
>> lustre-discuss@lists.lustre.org> wrote:
>>
>>
>> Hi all,
>>
>> I have just set up a three-node Lustre configuration, and initial testing
>> shows what I think are slow results. The current configuration is 2 OSS, 1
>> MDS-MGS; each OSS/MGS has 4x Intel P3600, 1x Intel P4800, Intel E810 100Gbe
>> eth, 2x 6252, 380GB dram
>> I am using Lustre 2.12.8, ZFS 0.7.13, ice-1

Re: [lustre-discuss] Poor(?) Lustre performance

2022-04-20 Thread Andreas Dilger via lustre-discuss
Finn,
I can't really say for sure where the performance limitation in your system is 
coming from.


You'd have to re-run the tests against the local ldiskfs filesystem to see how 
the performance compares to with that of Lustre.  The important part of 
benchmark testing is to systematically build a complete picture from the ground 
up to see what the capabilities of the various components of the storage stack 
are, and then determine where any bottlenecks are being hit.

That is what the "lustre-iokit" is intended to do - benchmark starting on the 
raw storage (sgpdd-survey), on the local disk filesystem (obdfilter-survey for 
local OSDs), then the network (lnet-selftest), and finally on the client 
(obdfilter-survey for network OSDs).

For example, running sgpdd-survey (or "fio") with small and large IO sizes 
against the storage devices, individually *AND IN PARALLEL* to determine their 
performance characteristics.  Also running in parallel is critical, since you 
may see e.g. 3GB/s reads, 2GB/s writes from a single NVMe device, but *not* see 
4x that performance when running on 4x NVMe devices because of CPU and/or PCI 
and/or memory bandwidth limitations.  Similarly, you may see reasonable per-OSS 
performance from a single OSS, but network congestion (on the client, 
switch(es), or server) may prevent the performance from scaling as more servers 
are added.

This is described in some detail at 
https://github.com/DDNStorage/lustre_manual_markdown/blob/master/04.02-Benchmarking%20Lustre%20File%20System%20Performance%20(Lustre%20IO%20Kit).md

Cheers, Andreas

On Apr 20, 2022, at 12:03, Finn Rawles Malliagh 
mailto:up883...@myport.ac.uk>> wrote:

Hi Andreas,

Thank you for taking the time to reply with such a detailed response.
I have taken your advice on board and made some changes. Firstly, I have 
swapped from ZFS and am now using striped LVM groups (Including the P4800X 
instead of using it as a cache drive). I have also modified io500.sh to include 
the optimisation listed above. Rerunning the IO500 benchmark provides the 
metadata results below:

With ZFS
[RESULT]mdtest-easy-write0.931693 kIOPS : time 31.028 seconds 
[INVALID]
[RESULT]mdtest-hard-write0.427000 kIOPS : time 31.070 seconds 
[INVALID]
[RESULT] find   25.311534 kIOPS : time 1.631 seconds
[RESULT] mdtest-easy-stat0.570021 kIOPS : time 50.067 seconds
[RESULT] mdtest-hard-stat1.834985 kIOPS : time 7.998 seconds
[RESULT]   mdtest-easy-delete1.715750 kIOPS : time 17.308 seconds
[RESULT] mdtest-hard-read1.006240 kIOPS : time 13.759 seconds
[RESULT]   mdtest-hard-delete1.624117 kIOPS : time 8.910 seconds
[SCORE ] Bandwidth 2.271383 GiB/s : IOPS 1.526825 kiops : TOTAL 1.862258 
[INVALID]

With LVM:
[RESULT]mdtest-easy-write3.057249 kIOPS : time 27.177 seconds 
[INVALID]
[RESULT]mdtest-hard-write1.576865 kIOPS : time 51.740 seconds 
[INVALID]
[RESULT] find   71.979457 kIOPS : time 2.234 seconds
[RESULT] mdtest-easy-stat1.841655 kIOPS : time 44.443 seconds
[RESULT] mdtest-hard-stat1.779211 kIOPS : time 45.967 seconds
[RESULT]   mdtest-easy-delete1.559825 kIOPS : time 52.301 seconds
[RESULT] mdtest-hard-read0.631109 kIOPS : time 127.765 seconds
[RESULT]   mdtest-hard-delete0.856858 kIOPS : time 94.372 seconds
[SCORE ] Bandwidth 0.948100 GiB/s : IOPS 2.359024 kiops : TOTAL 1.495524 
[INVALID]

I believe these scores are more in line with what I should expect, however, it 
seems that my throughput performance is still lacking(?). In your expert 
opinion do you think this would be just a case of tuning IO500/lvm parameters 
further or something more fundamental about the configuration of this Lustre 
cluster?

With LVM
[RESULT]   ior-easy-write2.127026 GiB/s : time 122.305 seconds 
[INVALID]
[RESULT]   ior-hard-write1.408638 GiB/s : time 1.246 seconds 
[INVALID]
[RESULT]ior-easy-read1.549550 GiB/s : time 167.881 seconds
[RESULT]ior-hard-read0.174036 GiB/s : time 10.063 seconds


Kind Regards,
Finn

On Wed, 20 Apr 2022 at 09:24, Andreas Dilger 
mailto:adil...@whamcloud.com>> wrote:
On Apr 16, 2022, at 22:51, Finn Rawles Malliagh via lustre-discuss 
mailto:lustre-discuss@lists.lustre.org>> wrote:

Hi all,

I have just set up a three-node Lustre configuration, and initial testing shows 
what I think are slow results. The current configuration is 2 OSS, 1 MDS-MGS; 
each OSS/MGS has 4x Intel P3600, 1x Intel P4800, Intel E810 100Gbe eth, 2x 
6252, 380GB dram
I am using Lustre 2.12.8, ZFS 0.7.13, ice-1.8.3, rdma-core-35.0 (RoCEv2 is 
enabled)
All zpools are setup identical for OST1, OST2, and MDT1

[root@stor3 ~]# zpool status
  pool: osstank
 state: ONLINE
  scan: none requested
config:
NAMESTATE READ WRITE CKSUM
osstank ONLINE   0 0 0
  nvme1n1   ONLINE 

[lustre-discuss] Resolving a stuck OI scrub thread

2022-04-20 Thread William D. Colburn
Back in March I wrote here about our corrupt file system
(http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/2022-March/018007.html).
We are still trying to fix it.  Since that time we have found all (we
think) of the files that hang when they are accessed, and unlinked them
from the filesystem.  We ran an lfsck which seemed to do a lot for half
a day but has now gone quiet again, but hasn't stopped.

[root@aocmds ~]# lctl lfsck_query | grep -v ': 0$'
layout_mdts_scanning-phase1: 1
layout_osts_scanning-phase2: 48
layout_repaired: 609777
namespace_mdts_scanning-phase1: 1
namespace_repaired: 9
[root@aocmds ~]# 

We are still getting a lot of errors on the OSS that has the corrupt
filesystem.  Most of the OSTs have 20 or fewer destroys_in_flight, but
the corrupt on has 1899729.  It also produces a lot of syslog messages.
Most look like this:

Apr 20 12:20:16 aocoss04 kernel: Lustre: 
64377:0:(osd_scrub.c:186:osd_scrub_refresh_mapping()) aoclst03-OST000e: fail to 
refresh OI map for scrub op 2 [0x1:0x1233bbf:0x0] => 
1750088/1273942946: rc = -17

But those look like a symptom to me, not a cause.  I think the cause is
buried in these messages:

Apr 20 12:20:56 aocoss04 kernel: Lustre: 
64701:0:(osd_scrub.c:767:osd_scrub_post()) sdd: OI scrub post, result = 1
Apr 20 12:20:56 aocoss04 kernel: Lustre: 
64701:0:(osd_scrub.c:1551:osd_scrub_main()) sdd: OI scrub: stop, pos = 
45780993: rc = 1
Apr 20 12:20:56 aocoss04 kernel: Lustre: 
64908:0:(osd_scrub.c:669:osd_scrub_prep()) sdd: OI scrub prep, flags = 0x4e
Apr 20 12:20:56 aocoss04 kernel: Lustre: 
64908:0:(osd_scrub.c:279:osd_scrub_file_reset()) sdd: reset OI scrub file, old 
flags = 0x0, add flags = 0x0
Apr 20 12:20:56 aocoss04 kernel: Lustre: 
64908:0:(osd_scrub.c:1541:osd_scrub_main()) sdd: OI scrub start, flags = 0x4e, 
pos = 12

Digging into the source code, it looks like osd_scrub_post() is
dicovering that a thread exists for doing the scrub, and so it aborts.
Each time a file on that OST is removed it looks like the
destroys_in_flight is incremented.  My best guess is that the thread is
hung.  I want to try stopping the lfsck (which doesn't seem to have
done anything in a little over twelve hours), then reboot the OSS to
clear that kernel thread, then restart the lfsck to try again.

lustre-2.10.8/lustre/osd-ldiskfs/osd_scrub.c
   if (!scrub->os_full_speed && !scrub->os_partial_scan) {
struct l_wait_info lwi = { 0 };
struct osd_otable_it *it = dev->od_otable_it;
struct osd_otable_cache *ooc = &it->ooi_cache;

l_wait_event(thread->t_ctl_waitq,
 it->ooi_user_ready || !thread_is_running(thread),
 &lwi);
if (unlikely(!thread_is_running(thread)))
GOTO(post, rc = 0);


One problem we did have that we don't want to repeat was that the layout
part of lfsck was chowning files yesterday and we had a lot of cluster
jobs fail because they started to get permission denied on their files.
The logs say that files were chowned from root to the user, which sounds
like user jobs should have been failing before the lfsck and working
after, but the errors happened during the part of run when these logs
were being generated.

Apr 19 14:35:07 aocmds kernel: Lustre: 
126605:0:(lfsck_layout.c:3906:lfsck_layout_repair_owner()) 
aoclst03-MDT-osd: layout LFSCK assistant repaired inconsistent file owner 
for: parent [0x2b669:0xc03f:0x0], child [0x10004:0x3220d02:0x0], 
OST-index 4, stripe-index 0, old owner 0/0, new owner 5916/335: rc = 1

Does anyone have any advice about a) stopping lfsck rebooting the ost
and restarting the lfask to try and clear the hung thread and start
processing the destroys_inflight, and b) if the
lfsck_layout_repair_owner() is likely to run again or have we probably
resolved those issues?

--Schlake
  Sysadmin IV, NRAO
  Work: 575-835-7281 (BACK IN THE OFFICE!)
  Cell: 575-517-5668 (out of work hours)
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Poor(?) Lustre performance

2022-04-20 Thread Finn Rawles Malliagh via lustre-discuss
Hi Andreas,

Thank you for taking the time to reply with such a detailed response.
I have taken your advice on board and made some changes. Firstly, I have
swapped from ZFS and am now using striped LVM groups (Including the P4800X
instead of using it as a cache drive). I have also modified io500.sh to
include the optimisation listed above. Rerunning the IO500 benchmark
provides the metadata results below:

With ZFS
[RESULT]mdtest-easy-write0.931693 kIOPS : time 31.028 seconds
[INVALID]
[RESULT]mdtest-hard-write0.427000 kIOPS : time 31.070 seconds
[INVALID]
[RESULT] find   25.311534 kIOPS : time 1.631 seconds
[RESULT] mdtest-easy-stat0.570021 kIOPS : time 50.067 seconds
[RESULT] mdtest-hard-stat1.834985 kIOPS : time 7.998 seconds
[RESULT]   mdtest-easy-delete1.715750 kIOPS : time 17.308 seconds
[RESULT] mdtest-hard-read1.006240 kIOPS : time 13.759 seconds
[RESULT]   mdtest-hard-delete1.624117 kIOPS : time 8.910 seconds
[SCORE ] Bandwidth 2.271383 GiB/s : IOPS 1.526825 kiops : TOTAL 1.862258
[INVALID]

With LVM:
[RESULT]mdtest-easy-write3.057249 kIOPS : time 27.177 seconds
[INVALID]
[RESULT]mdtest-hard-write1.576865 kIOPS : time 51.740 seconds
[INVALID]
[RESULT] find   71.979457 kIOPS : time 2.234 seconds
[RESULT] mdtest-easy-stat1.841655 kIOPS : time 44.443 seconds
[RESULT] mdtest-hard-stat1.779211 kIOPS : time 45.967 seconds
[RESULT]   mdtest-easy-delete1.559825 kIOPS : time 52.301 seconds
[RESULT] mdtest-hard-read0.631109 kIOPS : time 127.765 seconds
[RESULT]   mdtest-hard-delete0.856858 kIOPS : time 94.372 seconds
[SCORE ] Bandwidth 0.948100 GiB/s : IOPS 2.359024 kiops : TOTAL 1.495524
[INVALID]

I believe these scores are more in line with what I should expect, however,
it seems that my throughput performance is still lacking(?). In your expert
opinion do you think this would be just a case of tuning IO500/lvm
parameters further or something more fundamental about the configuration of
this Lustre cluster?

With LVM
[RESULT]   ior-easy-write2.127026 GiB/s : time 122.305 seconds
[INVALID]
[RESULT]   ior-hard-write1.408638 GiB/s : time 1.246 seconds
[INVALID]
[RESULT]ior-easy-read1.549550 GiB/s : time 167.881 seconds
[RESULT]ior-hard-read0.174036 GiB/s : time 10.063 seconds


Kind Regards,
Finn

On Wed, 20 Apr 2022 at 09:24, Andreas Dilger  wrote:

> On Apr 16, 2022, at 22:51, Finn Rawles Malliagh via lustre-discuss <
> lustre-discuss@lists.lustre.org> wrote:
>
>
> Hi all,
>
> I have just set up a three-node Lustre configuration, and initial testing
> shows what I think are slow results. The current configuration is 2 OSS, 1
> MDS-MGS; each OSS/MGS has 4x Intel P3600, 1x Intel P4800, Intel E810 100Gbe
> eth, 2x 6252, 380GB dram
> I am using Lustre 2.12.8, ZFS 0.7.13, ice-1.8.3, rdma-core-35.0 (RoCEv2 is
> enabled)
> All zpools are setup identical for OST1, OST2, and MDT1
>
> [root@stor3 ~]# zpool status
>   pool: osstank
>  state: ONLINE
>   scan: none requested
> config:
> NAMESTATE READ WRITE CKSUM
> osstank ONLINE   0 0 0
>   nvme1n1   ONLINE   0 0 0
>   nvme2n1   ONLINE   0 0 0
>   nvme3n1   ONLINE   0 0 0
> cache
>   nvme0n1   ONLINE   0 0 0
>
>
> It's been a while since I've done anything with ZFS, but I see a few
> potential issues here:
> - firstly, it doesn't make sense IMHO to have an NVMe cache device when
> the main storage
>   pool is also NVMe.  You could better use that capacity/bandwidth for
> storing more data
>   instead of duplicating it into the cache device.  Also, Lustre cannot
> use the ZIL.
> - in general ZFS is not very good at IOPS workloads because of the high
> overhead per block.
>   Lustre can't use the ZIL, so no opportunity to accelerate heavy IOPS
> workloads.
>
> When running "./io500 ./config-minimalLUST.ini" on my lustre client, I get
> these performance numbers:
> IO500 version io500-isc22_v1 (standard)
> [RESULT]   ior-easy-write1.173435 GiB/s : time 31.703 seconds
> [INVALID]
> [RESULT]   ior-hard-write0.821624 GiB/s : time 1.070 seconds
> [INVALID]
>
> [RESULT]ior-easy-read5.177930 GiB/s : time 7.187 seconds
>
> [RESULT]ior-hard-read5.331791 GiB/s : time 0.167 seconds
>
>
> When running "./io500 ./config-minimalLOCAL.ini" on a singular locally
> mounted ZFS pool I get the following performance numbers:
> IO500 version io500-isc22_v1 (standard)
> [RESULT]   ior-easy-write1.304500 GiB/s : time 33.302 seconds
> [INVALID]
>
> [RESULT]   ior-hard-write0.485283 GiB/s : time 1.806 seconds
> [INVALID]
>
> [RESULT]ior-easy-read3.078668 GiB/s : time 14.111 seconds
>
> [RESULT]ior-hard-read3.183

Re: [lustre-discuss] Poor(?) Lustre performance

2022-04-20 Thread Andreas Dilger via lustre-discuss
On Apr 16, 2022, at 22:51, Finn Rawles Malliagh via lustre-discuss 
mailto:lustre-discuss@lists.lustre.org>> wrote:

Hi all,

I have just set up a three-node Lustre configuration, and initial testing shows 
what I think are slow results. The current configuration is 2 OSS, 1 MDS-MGS; 
each OSS/MGS has 4x Intel P3600, 1x Intel P4800, Intel E810 100Gbe eth, 2x 
6252, 380GB dram
I am using Lustre 2.12.8, ZFS 0.7.13, ice-1.8.3, rdma-core-35.0 (RoCEv2 is 
enabled)
All zpools are setup identical for OST1, OST2, and MDT1

[root@stor3 ~]# zpool status
  pool: osstank
 state: ONLINE
  scan: none requested
config:
NAMESTATE READ WRITE CKSUM
osstank ONLINE   0 0 0
  nvme1n1   ONLINE   0 0 0
  nvme2n1   ONLINE   0 0 0
  nvme3n1   ONLINE   0 0 0
cache
  nvme0n1   ONLINE   0 0 0

It's been a while since I've done anything with ZFS, but I see a few potential 
issues here:
- firstly, it doesn't make sense IMHO to have an NVMe cache device when the 
main storage
  pool is also NVMe.  You could better use that capacity/bandwidth for storing 
more data
  instead of duplicating it into the cache device.  Also, Lustre cannot use the 
ZIL.
- in general ZFS is not very good at IOPS workloads because of the high 
overhead per block.
  Lustre can't use the ZIL, so no opportunity to accelerate heavy IOPS 
workloads.

When running "./io500 ./config-minimalLUST.ini" on my lustre client, I get 
these performance numbers:
IO500 version io500-isc22_v1 (standard)
[RESULT]   ior-easy-write1.173435 GiB/s : time 31.703 seconds 
[INVALID]
[RESULT]   ior-hard-write0.821624 GiB/s : time 1.070 seconds 
[INVALID]
[RESULT]ior-easy-read5.177930 GiB/s : time 7.187 seconds
[RESULT]ior-hard-read5.331791 GiB/s : time 0.167 seconds

When running "./io500 ./config-minimalLOCAL.ini" on a singular locally mounted 
ZFS pool I get the following performance numbers:
IO500 version io500-isc22_v1 (standard)
[RESULT]   ior-easy-write1.304500 GiB/s : time 33.302 seconds 
[INVALID]
[RESULT]   ior-hard-write0.485283 GiB/s : time 1.806 seconds 
[INVALID]
[RESULT]ior-easy-read3.078668 GiB/s : time 14.111 seconds
[RESULT]ior-hard-read3.183521 GiB/s : time 0.275 seconds

There are definitely some file layout tunables that can improve IO500 
performance for these workloads.
See the default io500.sh file, where they are commented out by default:

  # Example commands to create output directories for Lustre.  Creating
  # top-level directories is allowed, but not the whole directory tree.
  #if (( $(lfs df $workdir | grep -c MDT) > 1 )); then
  #  lfs setdirstripe -D -c -1 $workdir
  #fi
  #lfs setstripe -c 1 $workdir
  #mkdir $workdir/ior-easy $workdir/ior-hard
  #mkdir $workdir/mdtest-easy $workdir/mdtest-hard
  #local osts=$(lfs df $workdir | grep -c OST)
  # Try overstriping for ior-hard to improve scaling, or use wide striping
  #lfs setstripe -C $((osts * 4)) $workdir/ior-hard ||
  #  lfs setstripe -c -1 $workdir/ior-hard
  # Try to use DoM if available, otherwise use default for small files
  #lfs setstripe -E 64k -L mdt $workdir/mdtest-easy || true #DoM?
  #lfs setstripe -E 64k -L mdt $workdir/mdtest-hard || true #DoM?
  #lfs setstripe -E 64k -L mdt $workdir/mdtest-rnd


As you can see above, the IO performance of Lustre isn't really much different 
than the local storage
performance of ZFS.  You are always going to lose some percentage over the 
network and because
of the added distributed locking.  That said, for the hardware that you have, 
it should be getting about
2-3GB/s per NVMe device, and up to 10GB/s over the network, so the limitation 
here is really ZFS.
It would be useful to test with ldiskfs on tje same hardware, maybe with LVM 
aggregating the NVMes.

When running "./io500 ./config-minimalLUST.ini" on my lustre client, I get 
these performance numbers:
IO500 version io500-isc22_v1 (standard)
[RESULT]mdtest-easy-write0.931693 kIOPS : time 31.028 seconds 
[INVALID]
[RESULT]mdtest-hard-write0.427000 kIOPS : time 31.070 seconds 
[INVALID]
[RESULT] find   25.311534 kIOPS : time 1.631 seconds
[RESULT] mdtest-easy-stat0.570021 kIOPS : time 50.067 seconds
[RESULT] mdtest-hard-stat1.834985 kIOPS : time 7.998 seconds
[RESULT]   mdtest-easy-delete1.715750 kIOPS : time 17.308 seconds
[RESULT] mdtest-hard-read1.006240 kIOPS : time 13.759 seconds
[RESULT]   mdtest-hard-delete1.624117 kIOPS : time 8.910 seconds
[SCORE ] Bandwidth 2.271383 GiB/s : IOPS 1.526825 kiops : TOTAL 1.862258 
[INVALID]

When running "./io500 ./config-minimalLOCAL.ini" on a singular locally mounted 
ZFS pool I get the following performance numbers:
IO500 version io500-isc22_v1 (standard)
[RESULT]mdtest-easy-write   47.979181 kIOPS : time 1.838 sec