My apologies – I posted some bad info.  While we started out with the HDD's in 
the MDS, pretty early on we switched to SSD's.  So that's not the source of our 
MD slowness.  Can you do NVMe in an external JBOD?

From: Andreas Dilger <adil...@whamcloud.com>
Date: Tuesday, January 5, 2021 at 11:51 AM
To: "Vicker, Darby J. (JSC-EG111)[Jacobs Technology, Inc.]" 
<darby.vicke...@nasa.gov>
Cc: "lustre-discuss@lists.lustre.org" <lustre-discuss@lists.lustre.org>
Subject: [EXTERNAL] Re: [lustre-discuss] Tuning for metadata performance

Probably the best single thing you could do for metadata performance
would be to switch to SSD, or better NVMe, storage.  ZFS is very sync
and IOPS hungry, so using HDDs is killer for ZFS metadata performance.

If you want to minimize the downtime, you could incrementally replace the
HDDs in the zpool with larger SSD devices and resilver between each
one.  I recall LLNL doing this in the first months of their first ZFS-based
Lustre filesystem for this reason.

Going to NVMe-based devices is even better for IOPS/bandwidth, but
can't be done completely live.  You could potentially use repeated zfs
send/recv to get an almost uptodate copy on a new MDS, then take a small
outage to do the final resync. However, I've also seen reports that send/recv 
is painfully slow with HDD MDTs so you should probably test that before 
committing to a solution.

Cheers, Andreas


On Jan 5, 2021, at 08:47, Vicker, Darby J. (JSC-EG111)[Jacobs Technology, Inc.] 
<darby.vicke...@nasa.gov> wrote:
Hello,

I'm looking for some advice on tuning our existing lustre file system to 
achieve better metadata performance.  This file system is getting fairly old – 
its been in production for almost 4 years now.  The hardware and our existing 
tuning efforts can be found here.

http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/2017-April/014390.html<https://gcc02.safelinks.protection.outlook.com/?url=http%3A%2F%2Flists.lustre.org%2Fpipermail%2Flustre-discuss-lustre.org%2F2017-April%2F014390.html&data=04%7C01%7Cdarby.vicker-1%40nasa.gov%7C67d78e41f86c4f9698d908d8b1aae627%7C7005d45845be48ae8140d43da96dd17b%7C0%7C0%7C637454694848613577%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=8e0e37AIG8sCXIIbQaQ41Gddhj2byrpACgn2NRIlJ0g%3D&reserved=0>

The hardware is the same but we have upgraded the software stack a few times – 
now on CentOS 7.6, ZFS 0.7.9 and lustre 2.10.8.  We do plan to upgrade to the 
latest CentOS 7.x and either lustre 2.12 or 2.13 soon.  The MDS hardware isn't 
well-described in that thread so here are more details:

Chassis: Supermicro 2U Twin Server
Processor: 4 x Quad­Core Xeon Processor E5­2637 v2 3.50GHz (2 sockets/8 cores 
per node)
Memory: 16 x 16GB PC3­14900 1866MHz DDR3 ECC Registered DIMM (128GB per node)

External JBOD:
Chassis: 24x Hot­Swap 2.5" SAS ­ 12Gb/s SAS Dual Expander
Drives: 12 x 600GB SAS 3.0 12.0Gb/s 15000RPM ­ 2.5" ­ Seagate Enterprise 
Performance 15K HDD (512n)
Controller Card: LSI SAS 9300-8e SAS 12Gb/s PCIe 3.0 8-Port Host Bus Adapter

The above hardware and tuning served us well for a long time but the lab has 
grown, both in number of lustre clients (now up to ~200 ethernet clients and 
~500 IB clients) and the number of users in the lab.  With the extra users have 
come different types of workloads.  Peviously, the file system was most used 
for workloads with a fairly small number of large files.  We now see workloads 
that include 100's of concurrent processes all doing mixed small and large file 
IO on a lot of files (e.g. each process clones a repo, compiles a code and runs 
a serial sim that writes a lot of data).

I recently ran the io500 tests and our LFS stats for MDEasy and MDHard are 
pretty bad, even when compared to the lowest MD stats on the current io500 
list.  Our standard NFS server handily beats our LFS wrt MD performance.  So 
I'm hopeful that we can squeeze more MD performance out of our LFS.  Obviously, 
software tuning on the existing hardware would be preferred but we are open to 
hardware additions/upgrades if that would help (e.g. adding more MDS's).  There 
are a lot of tuning options in both ZFS and lustre so I'm hoping someone can 
point me in the right direction.  Are DNE and/or DoM expected to help?  I 
attended the SC20 Lustre BoF and it sounds like 2.13 has some metadata 
performance improvements, so just an upgrade might help.  We have dual MDS's 
now but for HA, not performance.  I'd hate to lose the HA aspect as we utilize 
it for failover quite a bit (maintenance, etc.) but it would probably be worth 
it if MD performance was significantly improved.  If I understand correctly, 
there is some overhead with DNE and performance suffers with just two MDS's 
with a benefit with 4 or more MDS's, correct?  So that wouldn't be a good 
option for us unless we add MDS's?  Would an upgrade to SSD or NVMe in our MDTs 
help?

I would greatly appreciate thoughts on the best path forward for making 
improvements.

Thanks,
Darby
_______________________________________________
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
_______________________________________________
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Reply via email to