Re: [lustre-discuss] How to activate an OST on a client ?

2024-08-28 Thread Cameron Harr via lustre-discuss
There's also an "lctl --device  activate" that I've used in the 
past though I don't know what conditions need to be for it to work.


On 8/27/24 07:46, Andreas Dilger via lustre-discuss wrote:

Hi Jan,
There is "lctl --device  recover" that will trigger a reconnect to 
the named OST device (per "lctl dl" output), but not sure if that will 
help.



Cheers, Andreas

On Aug 22, 2024, at 06:36, Haarst, Jan van via lustre-discuss 
 wrote:




Hi,

Probably the wording of the subject doesn’t actually cover the issue, 
what we see is this :


We have a client behind a router (linking tcp to Omnipath) that shows 
an inactive OST (all on 2.15.5).


Other clients that go through the router do not have this issue.

One client had the same issue, although it showed a different OST as 
inactive.


After a reboot, all was well again on that machine.

The clients can lctl ping the OSSs.

So although we have a workaround (reboot the client), it would be 
nice to:


 1. Fix the issue without a reboot
 2. Fix the underlying issue.

It might be unrelated, but we also see another routing issue every 
now and then:


The router stops routing request toward a certain OSS, and this can 
be fixed by deleting the peer_nid of the OSS from the router.


I am probably missing informative logs, but I’m more than happy to 
try to generate them, if somebody has a pointer to how.


We are a bit stumped right now.

With kind regards,

--

Jan van Haarst

HPC Administrator

For Anunna/HPC questions, please use https://support.wur.nl 
 (with 
HPC as service)


Aanwezig: maandag, dinsdag, donderdag & vrijdag

Facilitair Bedrijf, onderdeel van Wageningen University & Research

Afdeling Informatie Technologie

Postbus 59, 6700 AB, Wageningen

Gebouw 116, Akkermaalsbos 12, 6700 WB, Wageningen

http://www.wur.nl/nl/Disclaimer.htm 



___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org 



___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
https://urldefense.us/v3/__http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org__;!!G2kpM7uM-TzIFchu!1YPSOGUFPvipdg8HUxDkmcB7rvfUxuSATnKZq-9LFTP16TrMxtlrPe7m3ccX4BmKFoLsVnaKiIL3u4pxK2GT6mNJQIy33g$
  ___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] ldlm.lock_limit_mb sizing

2024-07-25 Thread Cameron Harr via lustre-discuss

Sorry for the late reply.

The way I was looking at the information was through `llstat -i5 
ldlm.services.ldlm_canceld.stats`


I don't have a copy of the data in the state it was in while it was 
"overwhelmed" but on multiple MDT nodes (that were pinned at near 100% 
CPU for hours with I/O stopped), I could see req_waittime (shown in 
usec) exeed 6000 *seconds* and req_qdepth (what I called locks in 
request queue) around the aforementioned 500K. If we restarted Lustre on 
one of these MDS nodes, the req_depth would steadily climb back up, 
along with CPU load and those symptoms would follow to the peer server 
if we failed over the MDT.


What is the correct parameter to modify requests in flight from the MDT 
to OST? I didn't find that that looked appropriate.


Thanks,
Cameron

On 7/19/24 1:32 PM, Kulyavtsev, Alex Ivanovich wrote:

Oleg, Cameron,
how to look at counts / list of requests queue (ungranted lock), request wait  
time ?

Can you please  point to parameter names to check first for troubleshooting and 
to monitor.
I’m looking at parameters below but not sure about meaning or entry format.

ldlm.lock_granted_count

ldlm.services.ldlm_canceld.req_history
ldlm.services.ldlm_canceld.stats
ldlm.services.ldlm_canceld.timeouts

ldlm.services.ldlm_cbd.req_history
ldlm.services.ldlm_cbd.stats
ldlm.services.ldlm_cbd.timeouts

mdt.*.exports.*.ldlm_stats
obdfilter.*.exports.*.ldlm_stats

Anything to look at `ldlm.namespaces` ?

Best regards, Alex.


On Jul 17, 2024, at 20:56, Oleg Drokin via lustre-discuss 
 wrote:

This Message Is From an External Sender
This message came from outside your organization.
On Wed, 2024-07-17 at 12:58 -0700, Cameron Harr via lustre-discuss
wrote:

In 2017, Oleg gave a talk at ORNL's Lustre conference about LDLM,
including references to ldlm.lock_limit _mb and
ldlm.lock_reclaim_threshold_mb.


https://urldefense.us/v3/__https://lustre.ornl.gov/ecosystem-2017/documents/Day-2_Tutorial-4_Drokin.pdf__;!!G_uCfscf7eWS!bAKFMeyE7sSlS07D-Xg3QWp90v8S2IQDhmAFhrPR86dHuUwyGB2zJXOZGIHTTrGU0FS2cUWfQJ-zktshrFBJ3NVn6b0RaA$


The apparent defaults back then in Lustre 2.8 for those two
parameters
were 30MB and 20MB, respectively.  On my 2.15 servers with 256GB and
no
changes from us, I'm seeing numbers of 77244MB and 51496MB,
respectively. We recently got ourselves into a situation where a
subset
of MDTs appeared to be entirely overwhelmed trying to cancel locks,
with
~500K locks in the request queue but a request wait time of 6000
seconds. So, we're looking  at potentially limiting the locks on the
servers.

What's the formula for appropriately sizing ldlm.lock_limit _mb and
ldlm.lock_reclaim_threshold_mb in 2.15 (I don't think node memory
amounts have increased 2X in 7 years)?

What do you mean by the "locks in the request queue"? If you mean your
server has got that many ungranted locks, there's nothing you can
really do here - that's how many outstanding client requests you've
got.

Sure, you can turn clients away, but probably could be more productive
to make sure your cancels are quicker?

I think I've seen cases recently with servers gummed up with requests
for creations being stuck waiting on OSTs to create more objects, while
holding various dlm locks (= other threads that wanted to access these
directories getting stuck too) while OSTs getting super slow because of
an influx of (pretty expensive) destroy requests to delete objects from
unlinked files.
In the end dropping requests in flight from MDTs to OSTs helped much
more by making sure OSTs were doing their creates faster so MDTs were
blocking much less.
___
lustre-discuss mailing list

lustre-discuss@lists.lustre.org
https://urldefense.us/v3/__http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org__;!!G_uCfscf7eWS!bAKFMeyE7sSlS07D-Xg3QWp90v8S2IQDhmAFhrPR86dHuUwyGB2zJXOZGIHTTrGU0FS2cUWfQJ-zktshrFBJ3NWjzv8aOA$

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


[lustre-discuss] ldlm.lock_limit_mb sizing

2024-07-17 Thread Cameron Harr via lustre-discuss
In 2017, Oleg gave a talk at ORNL's Lustre conference about LDLM, 
including references to ldlm.lock_limit _mb and 
ldlm.lock_reclaim_threshold_mb. 
https://lustre.ornl.gov/ecosystem-2017/documents/Day-2_Tutorial-4_Drokin.pdf


The apparent defaults back then in Lustre 2.8 for those two parameters 
were 30MB and 20MB, respectively.  On my 2.15 servers with 256GB and no 
changes from us, I'm seeing numbers of 77244MB and 51496MB, 
respectively. We recently got ourselves into a situation where a subset 
of MDTs appeared to be entirely overwhelmed trying to cancel locks, with 
~500K locks in the request queue but a request wait time of 6000 
seconds. So, we're looking  at potentially limiting the locks on the 
servers.


What's the formula for appropriately sizing ldlm.lock_limit _mb and 
ldlm.lock_reclaim_threshold_mb in 2.15 (I don't think node memory 
amounts have increased 2X in 7 years)?


Thanks!

Cameron Harr

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] "zpool upgrade -a" command for upgrade from lustre-2.12 to lustre 2.15 needed?

2024-05-28 Thread Cameron Harr via lustre-discuss
You can import the 0.7-formatted zpools in zfs 2.1 without upgrading 
them; you just won't get all the benefits (and there are a lot) of 2.1 
until you upgrade. So, you could run for a while with the old format 
until you're comfortable to upgrade them permanently. You can also take 
snapshots and back them up if you have space to do so.


Cameron

On 5/23/24 04:10, Bernd Melchers via lustre-discuss wrote:

Hi All,
for the process of update from CentOS-7 / lustre-2.12.9 / zfs-0.7 to
Alma-8 / lustre-2.15 / zfs-2.1 we would like to know if the zfs pools
have to be upgraded (with "zpool upgrade -a"). This will make newer zfs
features available to lustre, but with the disadvantage that there is no
way back to zfs-0.7 and lustre-2.12. Which zfs level does lustre-2.15
need?

Mit freundlichen Grüßen
Bernd Melchers


___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] [EXTERNAL] [BULK] MDS hardware - NVME?

2024-01-10 Thread Cameron Harr via lustre-discuss

On 1/10/24 11:59, Thomas Roth via lustre-discuss wrote:
Actually we had MDTs on software raid-1 *connecting two JBODs* for 
quite some time - worked surprisingly well and stable.


I'm glad it's working for you!




Hmm, if you have your MDTs on a zpool of mirrors aka raid-10, wouldn't 
going towards raidz2 increase data safety, something you don't need if 
the SSDs anyhow never fail? Doesn't raidz2 protect against failure of 
*any* two disks - in a pool of mirrors the second failure could 
destroy one mirror?


With raidz2 you can replace any disk in the raid group, but there's also 
a lot more drives that can fail. With mirrors, there's a 1:1 replacement 
ratio with essentially no rebuild time. Of course that assumes the 2 
drives you lost weren't the 2 drives in the same mirror, but we consider 
that low-probability. ZFS is also smart enough to (try to) suspend the 
pool if if it loses too many devices. And, the striped mirrors may see 
better performance over Z2.




Regards
Thomas

On 1/9/24 20:57, Cameron Harr via lustre-discuss wrote:

Thomas,

We value management over performance and have knowingly left 
performance on the floor in the name of standardization, robustness, 
management, etc; while still maintaining our performance targets. We 
are a heavy ZFS-on-Linux (ZoL) shop so we never considered MD-RAID, 
which, IMO, is very far behind ZoL in enterprise storage features.


As Jeff mentioned, we have done some tuning (and if you haven't 
noticed there are *a lot* of possible ZFS parameters) to further 
improve performance and are at a good place performance-wise.


Cameron

On 1/8/24 10:33, Jeff Johnson wrote:

Today nvme/mdraid/ldiskfs will beat nvme/zfs on MDS IOPs but you can
close the gap somewhat with tuning, zfs ashift/recordsize and special
allocation class vdevs. While the IOPs performance favors
nvme/mdraid/ldiskfs there are tradeoffs. The snapshot/backup abilities
of ZFS and the security it provides to the most critical function in a
Lustre file system shouldn't be undervalued. From personal experience,
I'd much rather deal with zfs in the event of a seriously jackknifed
MDT than mdraid/ldiskfs and both zfs and mdraid/ldiskfs are preferable
to trying to unscramble a vendor blackbox hwraid volume. ;-)

When zfs directio lands and is fully integrated into Lustre the
performance differences *should* be negligible.

Just my $.02 worth

On Mon, Jan 8, 2024 at 8:23 AM Thomas Roth via lustre-discuss
 wrote:

Hi Cameron,

did you run a performance comparison between ZFS and mdadm-raid on 
the MDTs?
I'm currently doing some tests, and the results favor software 
raid, in particular when it comes to IOPS.


Regards
Thomas

On 1/5/24 19:55, Cameron Harr via lustre-discuss wrote:
This doesn't answer your question about ldiskfs on zvols, but 
we've been running MDTs on ZFS on NVMe in production for a couple 
years (and on SAS SSDs for many years prior). Our current 
production MDTs using NVMe consist of one zpool/node made up of 3x 
2-drive mirrors, but we've been experimenting lately with using 
raidz3 and possibly even raidz2 for MDTs since SSDs have been 
pretty reliable for us.


Cameron

On 1/5/24 9:07 AM, Vicker, Darby J. (JSC-EG111)[Jacobs Technology, 
Inc.] via lustre-discuss wrote:
We are in the process of retiring two long standing LFS's (about 
8 years old), which we built and managed ourselves.  Both use ZFS 
and have the MDT'S on ssd's in a JBOD that require the kind of 
software-based management you describe, in our case ZFS pools 
built on multipath devices.  The MDT in one is ZFS and the MDT in 
the other LFS is ldiskfs but uses ZFS and a zvol as you describe 
- we build the ldiskfs MDT on top of the zvol.  Generally, this 
has worked well for us, with one big caveat.  If you look for my 
posts to this list and the ZFS list you'll find more details.  
The short version is that we utilize ZFS snapshots and clones to 
do backups of the metadata.  We've run into situations where the 
backup process stalls, leaving a clone hanging around.  We've 
experienced a situation a couple of times where the clone and the 
primary zvol get swapped, effectively rolling back our metadata 
to the point when the clone was created.  I have tried, 
unsuccessfully, to recreate
that in a test environment.  So if you do that kind of setup, 
make sure you have good monitoring in place to detect if your 
backups/clones stall.  We've kept up with lustre and ZFS updates 
over the years and are currently on lustre 2.14 and ZFS 2.1.  
We've seen the gap between our ZFS MDT and ldiskfs performance 
shrink to the point where they are pretty much on par to each 
now.  I think our ZFS MDT performance could be better with more 
hardware and software tuning but our small team hasn't had the 
bandwidth to tackle that.


Our newest LFS is vendor provided and uses NVMe MDT's. I'm not at 
li

Re: [lustre-discuss] [EXTERNAL] [BULK] MDS hardware - NVME?

2024-01-09 Thread Cameron Harr via lustre-discuss

Thomas,

We value management over performance and have knowingly left performance 
on the floor in the name of standardization, robustness, management, 
etc; while still maintaining our performance targets. We are a heavy 
ZFS-on-Linux (ZoL) shop so we never considered MD-RAID, which, IMO, is 
very far behind ZoL in enterprise storage features.


As Jeff mentioned, we have done some tuning (and if you haven't noticed 
there are *a lot* of possible ZFS parameters) to further improve 
performance and are at a good place performance-wise.


Cameron

On 1/8/24 10:33, Jeff Johnson wrote:

Today nvme/mdraid/ldiskfs will beat nvme/zfs on MDS IOPs but you can
close the gap somewhat with tuning, zfs ashift/recordsize and special
allocation class vdevs. While the IOPs performance favors
nvme/mdraid/ldiskfs there are tradeoffs. The snapshot/backup abilities
of ZFS and the security it provides to the most critical function in a
Lustre file system shouldn't be undervalued. From personal experience,
I'd much rather deal with zfs in the event of a seriously jackknifed
MDT than mdraid/ldiskfs and both zfs and mdraid/ldiskfs are preferable
to trying to unscramble a vendor blackbox hwraid volume. ;-)

When zfs directio lands and is fully integrated into Lustre the
performance differences *should* be negligible.

Just my $.02 worth

On Mon, Jan 8, 2024 at 8:23 AM Thomas Roth via lustre-discuss
 wrote:

Hi Cameron,

did you run a performance comparison between ZFS and mdadm-raid on the MDTs?
I'm currently doing some tests, and the results favor software raid, in 
particular when it comes to IOPS.

Regards
Thomas

On 1/5/24 19:55, Cameron Harr via lustre-discuss wrote:

This doesn't answer your question about ldiskfs on zvols, but we've been 
running MDTs on ZFS on NVMe in production for a couple years (and on SAS SSDs 
for many years prior). Our current production MDTs using NVMe consist of one 
zpool/node made up of 3x 2-drive mirrors, but we've been experimenting lately 
with using raidz3 and possibly even raidz2 for MDTs since SSDs have been pretty 
reliable for us.

Cameron

On 1/5/24 9:07 AM, Vicker, Darby J. (JSC-EG111)[Jacobs Technology, Inc.] via 
lustre-discuss wrote:

We are in the process of retiring two long standing LFS's (about 8 years old), 
which we built and managed ourselves.  Both use ZFS and have the MDT'S on ssd's 
in a JBOD that require the kind of software-based management you describe, in 
our case ZFS pools built on multipath devices.  The MDT in one is ZFS and the 
MDT in the other LFS is ldiskfs but uses ZFS and a zvol as you describe - we 
build the ldiskfs MDT on top of the zvol.  Generally, this has worked well for 
us, with one big caveat.  If you look for my posts to this list and the ZFS 
list you'll find more details.  The short version is that we utilize ZFS 
snapshots and clones to do backups of the metadata.  We've run into situations 
where the backup process stalls, leaving a clone hanging around.  We've 
experienced a situation a couple of times where the clone and the primary zvol 
get swapped, effectively rolling back our metadata to the point when the clone 
was created.  I have tried, unsuccessfully, to recreate
that in a test environment.  So if you do that kind of setup, make sure you 
have good monitoring in place to detect if your backups/clones stall.  We've 
kept up with lustre and ZFS updates over the years and are currently on lustre 
2.14 and ZFS 2.1.  We've seen the gap between our ZFS MDT and ldiskfs 
performance shrink to the point where they are pretty much on par to each now.  
I think our ZFS MDT performance could be better with more hardware and software 
tuning but our small team hasn't had the bandwidth to tackle that.

Our newest LFS is vendor provided and uses NVMe MDT's.  I'm not at liberty to 
talk about the proprietary way those devices are managed.  However, the 
metadata performance is SO much better than our older LFS's, for a lot of 
reasons, but I'd highly recommend NVMe's for your MDT's.

-Original Message-
From: lustre-discuss mailto:lustre-discuss-boun...@lists.lustre.org>> on behalf of Thomas Roth via lustre-discuss 
mailto:lustre-discuss@lists.lustre.org>>
Reply-To: Thomas Roth mailto:t.r...@gsi.de>>
Date: Friday, January 5, 2024 at 9:03 AM
To: Lustre Diskussionsliste mailto:lustre-discuss@lists.lustre.org>>
Subject: [EXTERNAL] [BULK] [lustre-discuss] MDS hardware - NVME?


CAUTION: This email originated from outside of NASA. Please take care when clicking links 
or opening attachments. Use the "Report Message" button to report suspicious 
messages to the NASA SOC.








Dear all,


considering NVME storage for the next MDS.


As I understand, NVME disks are bundled in software, not by a hardware raid 
controller.
This would be done using Linux software raid, mdadm, correct?


We have some

Re: [lustre-discuss] [EXTERNAL] [BULK] MDS hardware - NVME?

2024-01-05 Thread Cameron Harr via lustre-discuss
This doesn't answer your question about ldiskfs on zvols, but we've been 
running MDTs on ZFS on NVMe in production for a couple years (and on SAS 
SSDs for many years prior). Our current production MDTs using NVMe 
consist of one zpool/node made up of 3x 2-drive mirrors, but we've been 
experimenting lately with using raidz3 and possibly even raidz2 for MDTs 
since SSDs have been pretty reliable for us.


Cameron

On 1/5/24 9:07 AM, Vicker, Darby J. (JSC-EG111)[Jacobs Technology, Inc.] 
via lustre-discuss wrote:

We are in the process of retiring two long standing LFS's (about 8 years old), 
which we built and managed ourselves.  Both use ZFS and have the MDT'S on ssd's 
in a JBOD that require the kind of software-based management you describe, in 
our case ZFS pools built on multipath devices.  The MDT in one is ZFS and the 
MDT in the other LFS is ldiskfs but uses ZFS and a zvol as you describe - we 
build the ldiskfs MDT on top of the zvol.  Generally, this has worked well for 
us, with one big caveat.  If you look for my posts to this list and the ZFS 
list you'll find more details.  The short version is that we utilize ZFS 
snapshots and clones to do backups of the metadata.  We've run into situations 
where the backup process stalls, leaving a clone hanging around.  We've 
experienced a situation a couple of times where the clone and the primary zvol 
get swapped, effectively rolling back our metadata to the point when the clone 
was created.  I have tried, unsuccessfully, to recreate that in a test 
environment.  So if you do that kind of setup, make sure you have good 
monitoring in place to detect if your backups/clones stall.  We've kept up with 
lustre and ZFS updates over the years and are currently on lustre 2.14 and ZFS 
2.1.  We've seen the gap between our ZFS MDT and ldiskfs performance shrink to 
the point where they are pretty much on par to each now.  I think our ZFS MDT 
performance could be better with more hardware and software tuning but our 
small team hasn't had the bandwidth to tackle that.

Our newest LFS is vendor provided and uses NVMe MDT's.  I'm not at liberty to 
talk about the proprietary way those devices are managed.  However, the 
metadata performance is SO much better than our older LFS's, for a lot of 
reasons, but I'd highly recommend NVMe's for your MDT's.

-Original Message-
From: lustre-discuss mailto:lustre-discuss-boun...@lists.lustre.org>> on behalf of Thomas Roth via lustre-discuss 
mailto:lustre-discuss@lists.lustre.org>>
Reply-To: Thomas Roth mailto:t.r...@gsi.de>>
Date: Friday, January 5, 2024 at 9:03 AM
To: Lustre Diskussionsliste mailto:lustre-discuss@lists.lustre.org>>
Subject: [EXTERNAL] [BULK] [lustre-discuss] MDS hardware - NVME?


CAUTION: This email originated from outside of NASA. Please take care when clicking links 
or opening attachments. Use the "Report Message" button to report suspicious 
messages to the NASA SOC.








Dear all,


considering NVME storage for the next MDS.


As I understand, NVME disks are bundled in software, not by a hardware raid 
controller.
This would be done using Linux software raid, mdadm, correct?


We have some experience with ZFS, which we use on our OSTs.
But I would like to stick to ldiskfs for the MDTs, and a zpool with a zvol on 
top which is then formatted with ldiskfs - to much voodoo...


How is this handled elsewhere? Any experiences?




The available devices are quite large. If I create a raid-10 out of 4 disks, 
e.g. 7 TB each, my MDT will be 14 TB - already close to the 16 TB limit.
So no need for a box with lots of U.3 slots.


But for MDS operations, we will still need a powerful dual-CPU system with lots 
of RAM.
Then the NVME devices should be distributed between the CPUs?
Is there a way to pinpoint this in a call for tender?




Best regards,
Thomas



Thomas Roth


GSI Helmholtzzentrum für Schwerionenforschung GmbH
Planckstraße 1, 64291 Darmstadt, Germany, 
https://urldefense.us/v3/__http://www.gsi.de/__;!!G2kpM7uM-TzIFchu!1QmOnUbmSPpZPcc39XFZ3S-Vk4Dmh-Q78Gpm8ylYUf6Zhv_zpb2VXkM4C5Uhh05x01MhjqJTYZ5boqzEhkx6JF_rGY74EQ$
  



Commercial Register / Handelsregister: Amtsgericht Darmstadt, HRB 1528
Managing Directors / Geschäftsführung:
Professor Dr. Paolo Giubellino, Dr. Ulrich Breuer, Jörg Blaurock
Chairman of the Supervisory Board / Vorsitzender des GSI-Aufsichtsrats:
State Secretary / Staatssekretär Dr. Volkmar Dietz




___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org 
https://urldefense.us/v3/__http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org__;!!G2kpM7uM-TzIFchu!1QmOnUbmSPpZPcc39XFZ3S-Vk4Dmh-Q78Gpm8ylYUf6Zhv_zpb2VXkM4C5Uhh05x01MhjqJTYZ5boqzEhkx

Re: [lustre-discuss] About Lustre small files performace(8k) improve

2023-03-27 Thread Cameron Harr via lustre-discuss
I'll assume here you're referring to MPI-IO, which is not really a 
"feature" but a way to perform parallel I/O using the message passing 
interface (MPI) stack. There can also be different interpretations of 
what MPI-IO exactly means: many HPC applications (and benchmarks such as 
IOR) use MPI and can write out data in parallel using various I/O APIs. 
One of those APIs for IOR is "mpiio", from the web page:


-a S api – API for I/O [POSIX|MPIIO|HDF5|HDFS|S3|S3_EMC|NCMPI|RADOS] In 
short, running parallel threads to write files will usually boost your 
IOPs to a point, though which API you choose should probably reflect 
what your real-life applications use. Cameron


On 3/18/23 2:44 AM, 王烁斌 via lustre-discuss wrote:

Hi all,

This is my Lustre FS.
UUID                   1K-blocks        Used   Available Use% Mounted on
ltfs-MDT_UUID      307826072       36904   281574768  1% 
/mnt/lfs[MDT:0]
ltfs-MDT0001_UUID      307826072       36452   281575220  1% 
/mnt/lfs[MDT:1]
ltfs-MDT0002_UUID      307826072       36600   281575072  1% 
/mnt/lfs[MDT:2]
ltfs-MDT0003_UUID      307826072       36300   281575372  1% 
/mnt/lfs[MDT:3]
ltfs-OST_UUID    15962575136     1027740 15156068868  1% 
/mnt/lfs[OST:0]
ltfs-OST0001_UUID    15962575136     1027780 15156067516  1% 
/mnt/lfs[OST:1]
ltfs-OST0002_UUID    15962575136     1027772 15156074212  1% 
/mnt/lfs[OST:2]
ltfs-OST0003_UUID    15962575136     1027756 15156067860  1% 
/mnt/lfs[OST:3]
ltfs-OST0004_UUID    15962575136     1027728 15156058224  1% 
/mnt/lfs[OST:4]
ltfs-OST0005_UUID    15962575136     1027772 15156057668  1% 
/mnt/lfs[OST:5]
ltfs-OST0006_UUID    15962575136     1027768 15156058568  1% 
/mnt/lfs[OST:6]
ltfs-OST0007_UUID    15962575136     1027792 15156056752  1% 
/mnt/lfs[OST:7]


filesystem_summary:  127700601088     8222108 121248509668   1% /mnt/lfs

Structure ias flow:

After testing, under the current structure, the write performance of 
500,000 "8k" small files is:

NFSclient1——IOPS:28,000;  bandwidth——230MB
NFSclient1——IOPS:27,500;  bandwidth——220MB

Now I want to improve the performance of small files to a better 
level,May I ask if there is a better way。


I have noticed a feature called "MIP-IO" that can improve small file 
performance, but I don't know how to deploy this feature. Is there any 
way to improve small file performance?





___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
https://urldefense.us/v3/__http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org__;!!G2kpM7uM-TzIFchu!jpxmiee-6Txz4MkvTdXq-z7hM1I20qZ98shFSOwVie2UEZXxjo7aWF6Im3x0pik$
  ___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] BCP for High Availability?

2023-01-19 Thread Cameron Harr via lustre-discuss
We (LLNL) were probably that Lab using pacemaker-remote, and we still 
are as it generally works and is what we're used to. That said, on an 
upcoming system, we may end up trying 2-node HA clusters due to the 
vendor's preference. I'm not sure what specifics you're interested in, 
but as you mention, the PM-remote option let's one cluster bring or down 
the entire file system and can handle fencing and resource management 
for everyone. The biggest caveat with this method (learned harshly by 
numerous folks) is not to do 'systemctl stop pacemaker' on that central 
node unless you really want to take down the entire file system.


On 1/15/23 18:37, Andrew Elwell via lustre-discuss wrote:

Hi Folks,

I'm just rebuilding my testbed and have got to the "sort out all the
pacemaker stuff" part. What's the best current practice for the
current LTS (2.15.x) release tree?

I've always done this as multiple individual HA clusters covering each
pair of servers with common dual connected drive array(s), but I
remember seeing a talk some years ago where one of the US labs was
using ?pacemaker-remote? and bringing them all up from a central node

I note there's a few (old) crib notes on the wiki - referenced from
the lustre manual, but nothing updated in the last couple of years.

What are people out there doing?


Many thanks

Andrew
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
https://urldefense.us/v3/__http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org__;!!G2kpM7uM-TzIFchu!hA_mvzRa3TBp976BGEStcbJQ5HQrSaOHqnwTEkb-TKQGmwf1LaBDZXvRl7ULJ4Q$

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Lustre 2.12.6 on RHEL 7.9 not able to mount disks after reboot

2022-08-09 Thread Cameron Harr via lustre-discuss

JC,

The message where it asks if the MGS is running is a pretty common error 
that you'll see when something isn't right. There's not a lot of detail 
in your message but first step is to make sure your OST device is 
present on the OSS server. You mentioned remounting the RAID 
directories; is this software/MD RAID? Are you using ldiskfs or ZFS for 
the backend storage (I'll guess ldiskfs if using MD RAID).


If you've already verified the OST volume is present, see if you can 
'lctl ping' between the MDS and OSS nodes. I'm not sure what your 
knowledge base is so forgive me if this is too elementary, but on each 
node, type 'lctl list_nids' to get the Lustre node identifier, then run 
'lctl ping ' to make sure you can talk Lustre/LNet between them:


[root@tin1:~]# lctl list_nids
192.168.101.1@o2ib1

[root@tin6:~]# lctl list_nids
192.168.101.6@o2ib1

[root@tin6:~]# lctl ping 192.168.101.1@o2ib1
12345-0@lo
12345-192.168.101.1@o2ib1

[root@tin1:~]# lctl ping 192.168.101.6@o2ib1
12345-0@lo
12345-192.168.101.6@o2ib1

If you get a failure (like I/O Error), then you have a communications 
problem and you'll want to make sure all the correct interfaces are up. 
If the pings do work, then you'll want to look for messages in 
/var/log/lustre and dmesg.


Cameron

On 8/9/22 06:45, Crowder, Jonathan via lustre-discuss wrote:


Hello, this is my first post here so I may need some guidance on the 
function of this system.


I am in a small team supporting some 36TB lustre servers for a 
business unit. Our configuration per mount point is one lustre master 
node and 3 lustre object stores. We had one of the object stores lost 
to an unidentified reboot and upon getting it booted back into the 
lustre kernel by azure cloud teams, we saw behavior where we could not 
get it to remount the raid directories for storage to the local file 
paths we have set up for them. I can obtain the output soon here, it 
knows the MGS node, but asks if it's running. I am having difficulty 
investigating more deeply into why this is happening as the other 
object stores are working without issue.


Thanks,

JC


Internal


___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
https://urldefense.us/v3/__http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org__;!!G2kpM7uM-TzIFchu!l2FbiTAR6qhLwbOqf4kfzj8IRp8tfTexTXEOpPVB2ASGCAIVUTpJGN5isgF9Ugs$
  ___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] question regarding du vs df on lustre

2022-04-19 Thread Cameron Harr via lustre-discuss
One thing you can look at is running 'zpool iostat 1' (there are many 
options) to monitor that ZFS is still doing I/O during that time gap. 
With NVMe though, as Andreas said, I would expect that time gap to last 
seconds to minutes, not hours.


On 4/19/22 02:16, Einar Næss Jensen wrote:

Thank you for answering Andreas.

Lustre version is 2.12.8

It is indeed when we delete io500 files when we discovered this, but we also see it when 
deleting other files, that the "df" lags 1-2 hours behind.
We see it both on nvme and ssd drives. Haven't checkd hdd drives/osts yet.

This is a new lustre setup, and benchmarks are good (in our opinion). For now 
it is just this annoyance bugging us.
We didn't notice on previous lustre setup but will check if we see it there 
also.


Einar





From: Andreas Dilger 
Sent: Monday, April 11, 2022 18:01
To: Einar Næss Jensen
Cc: lustre-discuss@lists.lustre.org
Subject: Re: [lustre-discuss] question regarding du vs df on lustre

Lustre is returning the file unlink from the MDS immediately, but deleting
the objects from the OSTs asynchronously in the background.

How many files are being deleted in this case?  If you are running tests
like IO500, where there are many millions of small files plus some huge
files, then it may be that huge object deletion is behind small objects?

That said, it probably shouldn't take hours to finish if the OST storage is
NVMe based.

Cheers, Andreas


On Apr 4, 2022, at 05:05, Einar Næss Jensen  wrote:

Hello lustre people.

We are experimenting with lustre on nvme, and observe the following issue:
After running benchmarks and deleting benchmark files, we see that df and du 
reports different sizes:

[root@idun-02-27 ~]#  du -hs /nvme/
38M /nvme/
[root@idun-02-27 ~]# df -h|grep nvme
10.3.1.2@o2ib:/nvme   5.5T  3.9T  1.3T  76% /nvme


It takes several hours before du and df agrees.

What is causing this?
How can we get updated records for df immediately when deleting files?


Best REegards
Einar
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
https://urldefense.us/v3/__http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org__;!!G2kpM7uM-TzIFchu!jfI9cwREBq_m05-u6IlWShROtv7oEhRhs5fbpFCgf_agCpH6myqZ3tudujL7OAU$

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
https://urldefense.us/v3/__http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org__;!!G2kpM7uM-TzIFchu!jfI9cwREBq_m05-u6IlWShROtv7oEhRhs5fbpFCgf_agCpH6myqZ3tudujL7OAU$

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] lustre lts roadmap

2021-11-23 Thread Cameron Harr via lustre-discuss

Einar,

It's easier to upgrade the Lustre version than your OS, IMO. So, if you 
want to run a RHEL/CentOS 8 derivative, then you may need to go with 
2.14 on your servers for now and then upgrade to 2.15 (future LTS), once 
that becomes stable. That's our current plan.


Cameron

On 11/22/21 5:37 AM, Peter Jones via lustre-discuss wrote:

Hi Einar

Based on the results we see in the OpenSFS community survey, the data clearly 
shows that the majority of sites stick to the LTS releases. Lustre 2.12.8 LTS 
is in release testing ATM.  From survey results again, most current Centos 
users intend to use either Rocky or Alma when they move to 8.x versions.

Peter

On 2021-11-22, 5:29 AM, "lustre-discuss on behalf of Einar Næss Jensen" 
 wrote:

 Hello everyone.
 
 Thanks for all help in the past.

 We are in the process of installing a new lustre system, and we are unsure 
what version to use.
 Our current system is 2.10, and we are debating wether we should go for 
2.12.x lts, 2.14.x or wait for new lts version
 
 Is there an expected release date for next lts?
 
 Do you (the community) stick to lts, or do you run the more recent versions?
 
 
 BTW: What is Lustre plans with regards to centos? Will Rocky be the way forward?
 
 
 Best Regards

 Einar
 
 
 
 Sent from my ICE X Supercomputer

 Einar Næss jensen
 einar.nass.jen...@ntnu.no
 +47 90990249
 
 ___

 lustre-discuss mailing list
 lustre-discuss@lists.lustre.org
 
https://urldefense.us/v3/__http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org__;!!G2kpM7uM-TzIFchu!n8ih4eCb89ndix3wVG6afqejpUploctvZDH8EnRzyZ7qFQ0dWAnBUp6Q-Ky1OSg$
 


___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
https://urldefense.us/v3/__http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org__;!!G2kpM7uM-TzIFchu!n8ih4eCb89ndix3wVG6afqejpUploctvZDH8EnRzyZ7qFQ0dWAnBUp6Q-Ky1OSg$

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Missing OST's from 1 node only

2021-10-12 Thread Cameron Harr via lustre-discuss

  
I don't know the problem here, but you might want to look for
  connectivity issues from the client to the OSS(s) that house those
  two missing OSTs. I would image the lustre.log would show such
  errors in bulk. I've seen where an IB subnet manager gets in a
  weird state such that some nodes can no longer find a path to
  certain other nodes.
Cameron

On 10/7/21 4:54 PM, Sid Young via
  lustre-discuss wrote:


  
  G'Day all,


I have an odd situation where 1 compute node, mounts /home
  and /lustre but only half the OST's are present, while all the
  other nodes are fine not sure where to start on this one?


Good node:
[root@n02 ~]# lfs df
  UUID                   1K-blocks        Used   Available Use%
  Mounted on
  home-MDT_UUID     4473970688    30695424  4443273216   1%
  /home[MDT:0]
  home-OST_UUID    51097721856 39839794176 11257662464  78%
  /home[OST:0]
  home-OST0001_UUID    51097897984 40967138304 10130627584  81%
  /home[OST:1]
  home-OST0002_UUID    51097705472 37731089408 13366449152  74%
  /home[OST:2]
  home-OST0003_UUID    51097773056 41447411712  9650104320  82%
  /home[OST:3]
  
  filesystem_summary:  204391098368 159985433600 44404843520
   79% /home
  
  UUID                   1K-blocks        Used   Available Use%
  Mounted on
  lustre-MDT_UUID   5368816128    28246656  5340567424   1%
  /lustre[MDT:0]
  lustre-OST_UUID  51098352640 10144093184 40954257408  20%
  /lustre[OST:0]
  lustre-OST0001_UUID  51098497024  9584398336 41514096640  19%
  /lustre[OST:1]
  lustre-OST0002_UUID  51098414080 11683002368 39415409664  23%
  /lustre[OST:2]
  lustre-OST0003_UUID  51098514432 10475310080 40623202304  21%
  /lustre[OST:3]
  lustre-OST0004_UUID  51098506240 11505326080 39593178112  23%
  /lustre[OST:4]
  lustre-OST0005_UUID  51098429440  9272059904 41826367488  19%
  /lustre[OST:5]
  
  filesystem_summary:  306590713856 62664189952 243926511616
   21% /lustre
  
  [root@n02 ~]#






The bad Node:


 [root@n04 ~]# lfs df

UUID                   1K-blocks        Used   Available Use%
Mounted on
home-MDT_UUID     4473970688    30726400  4443242240   1%
/home[MDT:0]
home-OST0002_UUID    51097703424 37732352000 13363446784  74%
/home[OST:2]
home-OST0003_UUID    51097778176 41449634816  9646617600  82%
/home[OST:3]

filesystem_summary:  102195481600 79181986816 23010064384  78%
/home

UUID                   1K-blocks        Used   Available Use%
Mounted on
lustre-MDT_UUID   5368816128    28246656  5340567424   1%
/lustre[MDT:0]
lustre-OST0003_UUID  51098514432 10475310080 40623202304  21%
/lustre[OST:3]
lustre-OST0004_UUID  51098511360 11505326080 39593183232  23%
/lustre[OST:4]
lustre-OST0005_UUID  51098429440  9272059904 41826367488  19%
/lustre[OST:5]

filesystem_summary:  153295455232 31252696064 122042753024  21%
/lustre

[root@n04 ~]#

  

  

  

  

  




Sid Young
Translational Research Institute 


  

  

  

  

  

  
  
  
  ___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
https://urldefense.us/v3/__http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org__;!!G2kpM7uM-TzIFchu!nEjUA49bGioXfgNynjj0MPhe-SucZDvI3_iVk8BGgkI-ZEL4s6xX3Ow51T_fkSY$ 


  

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Fwd: ldiskfs vs zfs

2021-02-17 Thread Cameron Harr via lustre-discuss

Sudheendra,

You will get varied answers depending on who you ask. For us, we are 
strong believers in ZFS. Of course, as a major contributor to 
ZFS-on-Linux, we're heavily biased, but we believe the management 
features (snapshots, etc) outweigh any performance deficiencies compared 
to ldiskfs, especially when using SSDs. Plus, ZFS is continually landing 
major improvements that boost performance and resiliency.


Cameron

On 2/16/21 1:27 PM, Sudheendra Sampath wrote:



-- Forwarded message -
From: >

Date: Tue, Feb 16, 2021 at 1:06 PM
Subject: Fwd: ldiskfs vs zfs
To: mailto:sudheendra.samp...@gmail.com>>


Your message has been rejected, probably because you are not
subscribed to the mailing list and the list's policy is to prohibit
non-members from posting to it.  If you think that your messages are
being rejected in error, contact the mailing list owner at
lustre-discuss-ow...@lists.lustre.org 
.





-- Forwarded message --
From: Sudheendra Sampath >
To: lustre-discuss@lists.lustre.org 


Cc:
Bcc:
Date: Tue, 16 Feb 2021 13:05:47 -0800
Subject: Fwd: ldiskfs vs zfs
Including lustre-discuss mailing list for some direction.

Thanks for your help.

-- Forwarded message -
From: *Sudheendra Sampath* >

Date: Tue, Feb 16, 2021 at 11:37 AM
Subject: ldiskfs vs zfs
To: mailto:lustre-de...@lists.lustre.org>>


Hi,

I am trying to evaluate SSD based Lustre setup in our lab.  I am 
looking for some answers for the following questions:


 1. Which of the 2 underlying file systems is popular among the global
HPC community and reasons behind it ?
 2. Since I am using SSD based setup, if any data/links mention the
overhead of using one over the other.

Appreciate any online links to understand the validation done.

Thanks for your help.
--
Regards

Sudheendra Sampath


--
Regards

Sudheendra Sampath


--
Regards

Sudheendra Sampath

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] tgt grant error ..

2020-08-03 Thread Cameron Harr

Cory,

Thanks for accrediting Vladimir. The patch associated with LU-13766 was 
done by Mike Pershin @ Whamcloud -- thus my comment, but upon looking 
closer I see that was originally done for LU-12687 by Vladimir.


On 8/1/20 6:04 AM, Spitz, Cory James wrote:


+ Vladimir

I _/think/_ that the proposed fix for LU-13766 came from Vladimir 
under LU-12687.  Is there a different proposed patch from Whamcloud?


It seems that we all do have other non-direct IO grant problems to 
sort out.  I guess I’m kinda fishing to see if there is another grant 
patch out there.


-Cory

On 7/31/20, 12:09 PM, "lustre-discuss on behalf of Cameron Harr" 
<mailto:lustre-discuss-boun...@lists.lustre.org> on behalf of 
ha...@llnl.gov <mailto:ha...@llnl.gov>> wrote:


We have been hitting similar warnings. See 
https://jira.whamcloud.com/browse/LU-13766 
<https://jira.whamcloud.com/browse/LU-13766>, which has a possible 
patch available from Whamcloud.


On 7/28/20 10:24 PM, Zeeshan Ali Shah wrote:

Dear All we are getting following errors, what does this mean? any
advice ?

[Wed Jul 29 08:17:33 2020] LustreError:
24267:0:(tgt_grant.c:758:tgt_grant_check()) lustre-OST000e: cli
61f1ceb7-b0ab-3607-1d60-69e1f3abc7a5 claims 1703936 GRANT, real
grant 0
[Wed Jul 29 08:17:33 2020] LustreError:
24267:0:(tgt_grant.c:758:tgt_grant_check()) Skipped 421 previous
similar messages
[Wed Jul 29 08:18:38 2020] LustreError:
24267:0:(tgt_grant.c:758:tgt_grant_check()) lustre-OST000c: cli
61f1ceb7-b0ab-3607-1d60-69e1f3abc7a5 claims 1703936 GRANT, real
grant 0
[Wed Jul 29 08:18:38 2020] LustreError:
24267:0:(tgt_grant.c:758:tgt_grant_check()) Skipped 810 previous
similar messages
[Wed Jul 29 08:20:49 2020] LustreError:
24012:0:(tgt_grant.c:758:tgt_grant_check()) lustre-OST0010: cli
61f1ceb7-b0ab-3607-1d60-69e1f3abc7a5 claims 1703936 GRANT, real
grant 0
[Wed Jul 29 08:20:49 2020] LustreError:
24012:0:(tgt_grant.c:758:tgt_grant_check()) Skipped 1783 previous
similar messages

Thanks a lot for your help

Br

Zeeshan



___

lustre-discuss mailing list

lustre-discuss@lists.lustre.org  <mailto:lustre-discuss@lists.lustre.org>

http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] tgt grant error ..

2020-07-31 Thread Cameron Harr

  
We have been hitting similar warnings. See
  https://jira.whamcloud.com/browse/LU-13766, which has a possible
  patch available from Whamcloud.

On 7/28/20 10:24 PM, Zeeshan Ali Shah
  wrote:


  
  Dear All we are getting following errors, what does
this mean? any advice ?

  
  
  [Wed Jul 29 08:17:33 2020] LustreError:
24267:0:(tgt_grant.c:758:tgt_grant_check()) lustre-OST000e:
cli 61f1ceb7-b0ab-3607-1d60-69e1f3abc7a5 claims 1703936
GRANT, real grant 0
[Wed Jul 29 08:17:33 2020] LustreError:
24267:0:(tgt_grant.c:758:tgt_grant_check()) Skipped 421
previous similar messages
[Wed Jul 29 08:18:38 2020] LustreError:
24267:0:(tgt_grant.c:758:tgt_grant_check()) lustre-OST000c:
cli 61f1ceb7-b0ab-3607-1d60-69e1f3abc7a5 claims 1703936
GRANT, real grant 0
[Wed Jul 29 08:18:38 2020] LustreError:
24267:0:(tgt_grant.c:758:tgt_grant_check()) Skipped 810
previous similar messages
[Wed Jul 29 08:20:49 2020] LustreError:
24012:0:(tgt_grant.c:758:tgt_grant_check()) lustre-OST0010:
cli 61f1ceb7-b0ab-3607-1d60-69e1f3abc7a5 claims 1703936
GRANT, real grant 0
[Wed Jul 29 08:20:49 2020] LustreError:
24012:0:(tgt_grant.c:758:tgt_grant_check()) Skipped 1783
previous similar messages
  
  
  
  
  
  Thanks a lot for your help
  
  
  Br
  
  
  Zeeshan
  
  

  
  
  
  ___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


  

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Can we re-index the lustre-discuss archive DB?

2020-07-16 Thread Cameron Harr

Thank you Andreas. Is that link posted somewhere that I missed

On 7/15/20 6:47 PM, Andreas Dilger wrote:

On Jul 15, 2020, at 6:07 PM, Cameron Harr  wrote:

To the person with the power,

I've been trying to search the lustre-discuss 
(http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/) archives but it seems only 
old (<= 2013 perhaps) messages are searchable with the "Search" box. Is it 
possible to re-index the searchable DB to include recent/current messages?

Cameron, it looks like there is a full archive at:

https://marc.info/?l=lustre-discuss

Cheers, Andreas






___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


[lustre-discuss] Can we re-index the lustre-discuss archive DB?

2020-07-15 Thread Cameron Harr

To the person with the power,

I've been trying to search the lustre-discuss 
(http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/) archives 
but it seems only old (<= 2013 perhaps) messages are searchable with the 
"Search" box. Is it possible to re-index the searchable DB to include 
recent/current messages?


Thanks,

Cameron

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] High MDS load

2020-05-28 Thread Cameron Harr

  
Are you using any Lustre monitoring tools? We use ltop from the
  LMT package (https://github.com/LLNL/lmt) and during that time of
  high load you could see if there are bursts of IOPs coming in.
  Running iotop or iostat might also provide some insight into the
  load if based on I/O.
Cameron

On 5/28/20 8:37 AM, Peeples, Heath
  wrote:


  
  
  
  
I have 2 MDSs and periodically on one of
  them (either at one time or another) peak above 300, causing
  the file system to basically stop.  This lasts for a few
  minutes and then goes away.  We can’t identify any one user
  running jobs at the times we see this, so it’s hard to
  pinpoint this on a user doing something to cause it.   Could
  anyone point me in the direction of how to begin debugging
  this?  Any help is greatly appreciated.
 
Heath
  
  
  
  ___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


  

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Incorrect user's file count

2020-05-26 Thread Cameron Harr

  
One other suggestion for the discrepancy is if the user has files
  outside of their user directory on the file system -- for instance
  in a group or project directory. User quotas have no notion of
  data path and just look at files owned by that user across the
  entire file system.
Cameron

On 5/26/20 9:47 AM, Co, Michele (mc2zk)
  wrote:


  
  
  
  
Hi!
 
We are
running:
(client) lustre-client-2.10.7-1.el7.x86_64
(server)
lustre-2.10.7-1.el7.x86_64

and are using lfs quota -u  to monitor users’
Lustre file counts.
 
We have
discovered that for at least one user, there is a large
discrepancy between the number of files reported by various
utilities.
 
lfs quota -u
 .    // reports 334345 files
lfs find  -u  | wc -l
  // reports 127812
find . -type
f |wc -l   // reports 104580 files
 
I’ve read
through the archives and found that the discrepancy could be
due to files being deleted which are still open.  However,
it’s unclear to me in lustre how to determine which files
might be in this state or how to clean this up.  Could
anyone here give me some tips on how to troubleshoot/resolve
this discrepancy?
 
Thanks in
advance,
Michele Co
  
  
  
  ___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


  

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] changelog not purging

2020-04-16 Thread Cameron Harr
What Lustre version are you on? Does it look like this issue: 
https://jira.whamcloud.com/browse/LU-12098


On 4/16/20 1:35 PM, John White wrote:

We’re having some pretty intense MDT capacity growth issues on one of our file 
systems that coincide exactly with changelogs.  Problem is we can’t seem to 
purge them.  We can deregister them but the capacity remains.  For a 750mil 
files we have close to 3TB of capacity usage on the MDT.

I know there are 2 files in the root of the ldiskfs for the changelog but those 
are only a couple MB max.  Are there other files in the ldiskfs directory 
structure that can be manually purged?
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] different quota counts between different Lustre versions?

2018-09-13 Thread Cameron Harr

Thomas,

ZFS 0.7+ (I think) has built-in object accounting that is more accurate 
than Lustre's guess. If you do a 'zpool get all | grep 
userobj_accounting' you can see if it's enabled and active (two separate 
states I found out just yesterday). In my testing, I upgraded from a ZFS 
0.6 zpool (where Lustre quotas were definitely showing wrong) to ZFS 0.7 
and noticed the accounting didn't immediately change. After stopping 
Lustre, exporting/importing the pools again, then things showed 
correctly and the userobj_accounting went from "enabled" to "active".


Cameron


On 09/13/2018 05:33 AM, Thomas Roth wrote:

Hi all,

we have one Lustre 2.5.3 with ZFS 0.6.3,  and another one with Lustre 2.10.3 
and ZFS 0.7.6
I have copied a number of directories from old to new, and noticed that the 
number of inodes given by
'lfs quota' differ.

Extreme example is a user with just one file:

# lfs quota -h -u USER /lustre/old

Disk quotas for usr USER:
  Filesystemused   quota   limit   grace   files   quota   limit   grace
 /lustre/old133k  0k  0k   -   1   0   0   -


# lfs quota -h -u USER /lustre/new

Disk quotas for usr USER:
  Filesystemused   quota   limit   grace   files   quota   limit   grace
/lustre/new  60k  0k  0k   -   2   0   0   -



(Size difference is easy: it is a text file, and only the OSS in the 'new' FS 
have compression enabled.)


I suppose some counting scheme has changed between those Lustre versions?

However, some copied directories have much larger count-differences, but some 
have identical numbers.


Cheers,
Thomas



___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] How to find the performance degrade reason

2018-09-12 Thread Cameron Harr

I realize I didn't answer all the questions, so:

1) For a file-system-level view, we use ltop, a handy utility that's in 
the LMT package (https://github.com/LLNL/lmt). If you want to pinpoint 
OST or MDT tests, you could lock a file/dir to a particular OST/MDT and 
then use 'zpool iostat ...' to monitor I/O for that device


2) Again, I'd recommend getting ltop. 'zpool iostat' can give you 
numerous performance statistics as well.



On 09/12/2018 04:11 PM, Cameron Harr wrote:


Use 'zpool iostat -v ...'. You can review the man page for lots of 
options, including latency statistics.


Cameron


On 09/10/2018 11:21 AM, Zeeshan Ali Shah wrote:

Dear All,

Suddenly our lustre installation become dead slow, copying data from 
local source to lustre result around 20MB/sec before it was above 600 MB.


i checked zfs status -xv and (all pool healths are ok)

1) How to check which OSTs are involved during data write operation ?
2) How to check Meta data (read and write) stats ?
3) any other advice to drill down the reason ..


/Zeeshan


___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org




___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] How to find the performance degrade reason

2018-09-12 Thread Cameron Harr
Use 'zpool iostat -v ...'. You can review the man page for lots of 
options, including latency statistics.


Cameron


On 09/10/2018 11:21 AM, Zeeshan Ali Shah wrote:

Dear All,

Suddenly our lustre installation become dead slow, copying data from 
local source to lustre result around 20MB/sec before it was above 600 MB.


i checked zfs status -xv and (all pool healths are ok)

1) How to check which OSTs are involved during data write operation ?
2) How to check Meta data (read and write) stats ?
3) any other advice to drill down the reason ..


/Zeeshan


___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] help ldev.conf usage clarification

2018-09-12 Thread Cameron Harr
We only use ZFS, but we have one file with all MGS/MDTs/OSTs in it, 
replicated across all Lustre server nodes. Note the 2nd column is for a 
failover peer, which you hopefully have.



On 09/12/2018 03:11 PM, Riccardo Veraldi wrote:

Hello,

I wanted to ask some clarifiaction on ldev.conf usage and features.

I am using ldev.conf only on my ZFS lustre OSSes and MDS.

Anyway I hae a doubt on what should go in that file.

I have seen people having only the metadata configuration in it like 
for example:


mds01 - mgs zfs:lustre01-mgs/mgs
mds01 - mdt0 zfs:lustre01-mdt0/mdt0

and people filling hte file with both mgs settings and listing also 
all the OSSes/OSTs then spreading the same ldev.conf file over all the 
OSSes like in this example with


3 OSSes where each one has one OST:


mds01 - mgs zfs:lustre01-mgs/mgs
mds01 - mdt0 zfs:lustre01-mdt0/mdt0
#
drp-tst-ffb01 - OST01 zfs:lustre01-ost01/ost01
drp-tst-ffb02 - OST02 zfs:lustre01-ost02/ost02
drp-tst-ffb03 - OST03 zfs:lustre01-ost03/ost03

is this correct or only the metadata information should stay in 
ldev.conf ?


Also can ldev.conf be used with ldiskfs based cluster ? On ldiskfs 
based clsuters I usually mount the metadata partition and OSS 
partitions in fstab and my ldev.conf is empty.


thanks

Rick



___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Making a file system read-only

2018-02-06 Thread Cameron Harr

On 02/06/2018 11:23 AM, E.S. Rosenberg wrote:

Hi Cameron,

On Mon, Feb 5, 2018 at 10:54 PM, Cameron Harr <mailto:ha...@llnl.gov>> wrote:


Greetings,

I'll be retiring a 2.5 ZFS-backed file system in the coming months
and plan on putting it in a "read-only" state for 6 weeks or so to
let users archive or migrate data they want. The optimal, if
slightly contradictory, definition of "read-only" for me would be
to allow unlinks, but disallow all other writes; however, if
that's not possible, then disallowing all writes would be
sufficient. I will be doing some testing, but our 2.5 T&D system
won't be available for several days and am therefore soliciting
advice up front.

Option 1 is to remount the file system on the clients in read-only
mode, but I can imagine a couple problems with this method.

Just out of curiosity what type of problems do you fore-see? Missing a 
client still being rw?


There are potentially thousands of clients for some of the systems and I 
can foresee missing clients as well as having many clients hang when 
remounting due to open files. It can also be tedious given a large 
number of different systems. Regardless, I may very  well end up doing 
this client remount anyway.



Option 2 is to deactivate the OSTs with lctl, but that also leaves
me with some questions. In section 14.8.3.1a of the Lustre manual,
it recommends setting max_create_count=0 with Lustre 2.9 and
above. I'm using 2.5, not 2.9, but noticed that
/proc/fs/lustre/osp/fs-*-osc-MDT/max_create_count does indeed
exist. Does setting that option in 2.5 still have an effect in
prohibiting new file and directory creates? Additionally, will
clients be OK having all OSTs inactive?

There was some discussion of making a/multiple OSTs read-only for data 
migration last year that you may find useful:

http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/2017-March/014294.html


Thanks Eli. I didn't know about the "degraded" option. I'll try setting 
that on the OSTs and deactivating the OSTs on the MDS (though it's not 
clear if unlinks are allowed back in 2.5 or just 2.9+).

-Cameron

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] What does your (next) MDT look like?

2018-02-06 Thread Cameron Harr

  
  
On 2/6/18 9:32 AM, E.S. Rosenberg wrote:

  
  

  

  
Hello fellow Lustre users :)
  

Since I didn't want to take the "size of MDT, inode
count, inode size" thread too far off-topic I'm starting
a new thread.

  
  I'm curious how many people are using SSD MDTs?

  

  


We've used SSDs exclusively as MDTs for several years, set up as
groups of mirrors (2-way and 3-way) in ZFS. The mirrors span two SSD
disk enclosures. Our tests showed improved MD performance, though I
don't have the numbers. 

  

  
Also how practical is such a thing in a 2.11.x Data On
  MDT scenario?

Is using some type of mix between HDD and SSD storage for
MDTs practical?
  

  


You could use SSDs for separate cache or ZIL SLOG devices. I
wouldn't think you'd want to mix the two otherwise except during the
transition of HDD -> SSD.


  

  Does SSD vs HDD have an effect as far as ldiskfs vs zfs?
  

  


Nothing I'm aware of. ldiskfs vs. ZFS is a separate decision.

  

  
  
  Thanks!

Eli
  
  
  
  
  ___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org



  

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


[lustre-discuss] Making a file system read-only

2018-02-05 Thread Cameron Harr

Greetings,

I'll be retiring a 2.5 ZFS-backed file system in the coming months and 
plan on putting it in a "read-only" state for 6 weeks or so to let users 
archive or migrate data they want. The optimal, if slightly 
contradictory, definition of "read-only" for me would be to allow 
unlinks, but disallow all other writes; however, if that's not possible, 
then disallowing all writes would be sufficient. I will be doing some 
testing, but our 2.5 T&D system won't be available for several days and 
am therefore soliciting advice up front.


Option 1 is to remount the file system on the clients in read-only mode, 
but I can imagine a couple problems with this method.


Option 2 is to deactivate the OSTs with lctl, but that also leaves me 
with some questions. In section 14.8.3.1a of the Lustre manual, it 
recommends setting max_create_count=0 with Lustre 2.9 and above. I'm 
using 2.5, not 2.9, but noticed that 
/proc/fs/lustre/osp/fs-*-osc-MDT/max_create_count does indeed exist. 
Does setting that option in 2.5 still have an effect in prohibiting new 
file and directory creates? Additionally, will clients be OK having all 
OSTs inactive?


Is there another viable option?

Many thanks,
Cameron


___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org