[lustre-discuss] flock vs localflock

2018-07-05 Thread Vicker, Darby (JSC-EG311)
Hi everyone, We recently saw some extremely high stat loads on our lustre FS. Output from “llstat -i 1 mdt” looked like: [root@hpfs-fsl-mds0 lustre]# /proc/fs/lustre/mds/MDS/mdt/stats @ 1530642446.366015124 Name Cur.Count Cur.Rate #Events Unit last

Re: [lustre-discuss] Luster access from windows

2018-05-24 Thread Vicker, Darby (JSC-EG311)
We use samba to export lustre from a linux lustre client to both Windows and MacOS regularly. We are pretty much using samba defaults. But this is not intended for serious use of lustre - mainly just for pulling a few files over for presentations or to attach to an email, etc..

Re: [lustre-discuss] Synchronous writes on a loaded ZFS OST

2018-05-08 Thread Vicker, Darby (JSC-EG311)
Yes, we experienced some similar slowness on our ZFS-based lustre FS too. More details here: http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/2017-April/014390.html This also affects git repos quite a bit. The fix suggested by Andreas in that thread worked fairly well and we

Re: [lustre-discuss] Adding a servicenode (failnode) to existing OSTs

2018-04-03 Thread Vicker, Darby (JSC-EG311)
We have a similar setup and recently had to do something similar - in our case, to add a 2nd IB NID. The admin node says that servicenode is preferred over failnode, so that's what we use. It works great - we love the capability to fail over for maintenance or troubleshooting. Our

Re: [lustre-discuss] Adding a new NID

2018-01-08 Thread Vicker, Darby (JSC-EG311)
ess or implied, and I assume no liability, etc. ☺ Nevertheless, I hope this helps, at least as a cross-reference. Malcolm. From: lustre-discuss <lustre-discuss-boun...@lists.lustre.org> on behalf of "Vicker, Darby (JSC-EG311)" <darby.vicke...@nasa.gov> Date: Saturday, 6 Janu

Re: [lustre-discuss] Adding a new NID

2018-01-05 Thread Vicker, Darby (JSC-EG311)
Sorry – one other question. We are configured for failover too. Will the "lctl replace_nids" do the right thing or should I do the tunefs to make sure all the failover pairs get updated properly? This is what our tunefs command would look like for an OST: tunefs.lustre \

[lustre-discuss] Adding a new NID

2018-01-05 Thread Vicker, Darby (JSC-EG311)
Hello everyone, We have an existing LFS that is dual-homed on ethernet (mainly for our workstations) and IB (for the computational cluster), ZFS backend for the MDT and OST's. We just got a new computational cluster and need to add another IB NID. The procedure for doing this is straight

Re: [lustre-discuss] MGS is not working in HA

2017-10-26 Thread Vicker, Darby (JSC-EG311)
8. Any suggestions? Regards Ravi Konila From: Vicker, Darby (JSC-EG311) Sent: Wednesday, October 25, 2017 11:51 PM To: Mannthey, Keith ; Ravi Konila ; Lustre Discuss Subject: Re: [lustre-discuss] MGS is not working in HA Sorry – I also meant to say that the resolution went off the mailing list and

Re: [lustre-discuss] MGS is not working in HA

2017-10-25 Thread Vicker, Darby (JSC-EG311)
Sorry – I also meant to say that the resolution went off the mailing list and was continued in LU-8397. You can find the patch there. From: lustre-discuss on behalf of Darby Vicker Date: Wednesday, October 25, 2017 at 1:17 PM

Re: [lustre-discuss] MGS is not working in HA

2017-10-25 Thread Vicker, Darby (JSC-EG311)
Which version of lustre are you using? We initially has problem with this too when using failover with lustre 2.8 and 2.9. We got a patch that fixed it and recent versions work fine for us. We have a combined MGS/MDS so our scenario is a little different but this sounds very similar to our

Re: [lustre-discuss] Recovering data from failed Lustre file system

2017-10-13 Thread Vicker, Darby (JSC-EG311)
I'd recommend using this command to find all the files that were affected: lfs find --ost > lost_files.txt Then run an rsync to copy off the data that is left. Something like: rsync --exclude lost_files.txt From: lustre-discuss on behalf of

Re: [lustre-discuss] Multiple Lustre filesystem

2017-09-15 Thread Vicker, Darby (JSC-EG311)
e MDT's for different file systems against the same MDS would be as I have never attempted such a configuration. -cf On Fri, Sep 15, 2017 at 10:29 AM, Vicker, Darby (JSC-EG311) <darby.vicke...@nasa.gov<mailto:darby.vicke...@nasa.gov>> wrote: From: lustre-discuss <lustre-discuss-bo

Re: [lustre-discuss] Multiple Lustre filesystem

2017-09-15 Thread Vicker, Darby (JSC-EG311)
From: lustre-discuss on behalf of Colin Faber Date: Friday, September 15, 2017 at 9:48 AM To: Ravi Konila Cc: Lustre Discuss Subject: Re: [lustre-discuss] Multiple Lustre

Re: [lustre-discuss] sudden read performance drop on sequential forward read.

2017-08-31 Thread Vicker, Darby (JSC-EG311)
This sounds exactly like what we ran into when we upgraded to 2.9 (and is still present in 2.10). See these: http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/2017-May/014524.html https://jira.hpdd.intel.com/browse/LU-9574 The mailing list thread describes our problem a little more

Re: [lustre-discuss] nodes crash during ior test

2017-08-22 Thread Vicker, Darby (JSC-EG311)
Any more info on this? I’m running into the same thing. I tried to find and LU on this but didn’t see anything directly related. ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org

[lustre-discuss] PFL error

2017-08-07 Thread Vicker, Darby (JSC-EG311)
Hello, We've upgraded to 2.10 and I've been playing with progressive file layouts.  To begin, I'm just setting a test directory to use the following PFL.   lfs setstripe \    -E 4M   -c 1 -S 1M -i -1 \    -E 256M -c 4 -S 1M -i -1 \    -E -1   -c 8 -S 4M -i -1 . I then created some files in the

Re: [lustre-discuss] set OSTs read only ?

2017-07-12 Thread Vicker, Darby (JSC-EG311)
I’m not sure if there is something you can do on the server side for this but I think you can remount read only on the clients fairly easily without disruption: # mount -t lustre 192.x.x.x@tcp:/hpfs-fsl/ /mnt # touch /mnt/test # mount -t lustre -o remount,ro 192.x.x.x@tcp:/hpfs-fsl/ /mnt # touch

[lustre-discuss] IML help/docs

2017-06-22 Thread Vicker, Darby (JSC-EG311)
Hello, I've been looking through the LUG '17 docs and the references to the intel manager for lustre caught my eye. I'd like to try it out but I'm having trouble getting it compiled and haven't found much in the way of docs or further information. If anyone knows where to find more info,

Re: [lustre-discuss] Large file read performance degradation from multiple OST's

2017-05-30 Thread Vicker, Darby (JSC-EG311)
Using the git bisect we were able to isolate the problem to this commit: commit d8467ab8a2ca15fbbd5be3429c9cf9ceb0fa78b8 LU-7990 clio: revise readahead to support 16MB IO In our testing, we can read from a large file (stripe count=4) at near line rate (10 GbE – so 1200 MB/s) using a client

Re: [lustre-discuss] Large file read performance degradation from multiple OST's

2017-05-26 Thread Vicker, Darby (JSC-EG311)
>> I tried a 2.8 client mounting the 2.9 servers and that showed the expected >> behavior ??? increasing >> performance with increasing OST's. Two things: >> >> 1. Any pointers to compiling a 2.8 client on recent RHEL 7 kernels would be >> helpful. I had to boot >> into an older kernel

Re: [lustre-discuss] Large file read performance degradation from multiple OST's

2017-05-26 Thread Vicker, Darby (JSC-EG311)
sa.gov> Cc: "lustre-discuss@lists.lustre.org" <lustre-discuss@lists.lustre.org> Subject: Re: [lustre-discuss] Large file read performance degradation from multiple OST's On May 24, 2017, at 10:04, Vicker, Darby (JSC-EG311) <darby.vicke...@nasa.gov> wrote: >

Re: [lustre-discuss] Large file read performance degradation from multiple OST's

2017-05-24 Thread Vicker, Darby (JSC-EG311)
I tried a 2.8 client mounting the 2.9 servers and that showed the expected behavior – increasing performance with increasing OST's. Two things: 1. Any pointers to compiling a 2.8 client on recent RHEL 7 kernels would be helpful. I had to boot into an older kernel to get the above test done.

[lustre-discuss] Large file read performance degradation from multiple OST's

2017-05-22 Thread Vicker, Darby (JSC-EG311)
Hello, We recently noticed that the large file read performance on our 2.9 LFS is dramatically worse than it used to be. The attached plot is the result of a test script that uses dd to write a large file (50GB) to disk, read that file and then copy it to a 2nd file to test write, read and

Re: [lustre-discuss] Lustre 2.9 performance issues

2017-04-30 Thread Vicker, Darby (JSC-EG311)
This worked great. We implemented it on Friday and the timings of the dd test on our 2.9/ZFS LFS have dropped to under a second. Thanks a lot. The risk of both the client and OSS crashing within a few seconds is low enough for us compared to the performance gain. The commit you pointed to

Re: [lustre-discuss] Lustre 2.9 performance issues

2017-04-26 Thread Vicker, Darby (JSC-EG311)
Thanks for the kstat info. Our 2.4 LFS is quite a bit different architecture – ldiskfs on a hardware RAID – so no opportunity to compare the zfs kstat info between the two. Our 2.9 LFS is barely in production at this point and only a handful of people have moved over to it. So its

[lustre-discuss] Lustre 2.9 performance issues

2017-04-25 Thread Vicker, Darby (JSC-EG311)
Hello, We are having a few performance issues with our newest lustre file system. Here is the overview of our setup: -) Supermicro servers connected to external 12Gb/s SAS JBODs for MDT/OSS storage -) CentOS version = 7.3.1611 (kernel 3.10.0-514.2.2.el7.x86_64) on the servers and clients

Re: [lustre-discuss] Odd file permission help please. Has lustre corrupted?

2017-03-13 Thread Vicker, Darby (JSC-EG311)
eck to see if you're experiencing any group upcall errors on the MDS? Have you tried disabling group upcall on the MDS? Also, when looking at the problematic files with lfs getstripe, have you verified that those objects exist? Any errors about missing objects on the OSTs? Any other errors? On Mon,

Re: [lustre-discuss] Odd file permission help please. Has lustre corrupted?

2017-03-13 Thread Vicker, Darby (JSC-EG311)
, Murshid Azman <murshid.az...@gmail.com>, "lustre-discuss@lists.lustre.org" <lustre-discuss@lists.lustre.org> Subject: Re: [lustre-discuss] Odd file permission help please. Has lustre corrupted? Is your authentication system across all of your lustre nodes in sync? On Mon,

Re: [lustre-discuss] Odd file permission help please. Has lustre corrupted?

2017-03-13 Thread Vicker, Darby (JSC-EG311)
We have the same thing happening on one of our lustre file system (2.4.3 servers, 2.9 client). All OST’s are connected so that’s not the problem. We do use ACL’s so I was hopeful this was the cause for us too. But this doesn’t seem to be the case – any operation on the problem directory just

Re: [lustre-discuss] design to enable kernel updates

2017-02-10 Thread Vicker, Darby (JSC-EG311)
f.john...@aeoncomputing.com>> wrote: You're also leaving out the corosync/pacemaker/stonith configuration. That is unless you are doing manual export/import of pools. On Fri, Feb 10, 2017 at 9:03 PM, Vicker, Darby (JSC-EG311) <darby.vicke...@nasa.gov<mailto:darby.vicke...@nasa.gov>> wrote: Sur

Re: [lustre-discuss] design to enable kernel updates

2017-02-10 Thread Vicker, Darby (JSC-EG311)
stand with OSTs and MDTs where all I really need is to have the failnode set when I do the mkfs.lustre However, as I understand it, you have to use something like pacemaker and drbd to deal with the MGS/MGT. Is this how you approached it? Brian Andrus On 2/6/2017 12:58 PM, Vicker, Darb

Re: [lustre-discuss] design to enable kernel updates

2017-02-06 Thread Vicker, Darby (JSC-EG311)
Agreed. We are just about to go into production on our next LFS with the setup described. We had to get past a bug in the MGS failover for dual-homed servers but as of last week that is done and everything is working great (see "MGS failover problem" thread on this mailing list from this

Re: [lustre-discuss] MGS failover problem

2017-01-19 Thread Vicker, Darby (JSC-EG311)
I've gone back to having the tcp and o2ib NIDS again. Even though the ZFS properties show all the NIDS, the info in /proc shows only a single NID. Its no wonder the failover isn't working – it seems that the OST doesn't have any failover NID's for the MGC like it did when I had only the tcp

Re: [lustre-discuss] MGS failover problem

2017-01-13 Thread Vicker, Darby (JSC-EG311)
Progress. I did another round of "tunefs.lustre –writeconf" to take out the IB so we are on Ethernet only. I think the MDS/MGS failover worked properly – note the "Connection restored to MGC192.52.98.30@tcp_1 (at 192.52.98.31@tcp)" message in the oss logs below – and that the

Re: [lustre-discuss] MGS failover problem

2017-01-11 Thread Vicker, Darby (JSC-EG311)
>> Getting failover right over multiple separate networks can be a real >> hair-pulling experience. > >Darby: Do you have the option of (at least temporarily) running the file >system with only Infiniband configured? If you could set up the file system >to only use Infiniband, then >that would

Re: [lustre-discuss] MGS failover problem

2017-01-11 Thread Vicker, Darby (JSC-EG311)
> The question I have in this is how long are you waiting, and how are you > determining that lnet has hung? The example I just sent today I waited about 10 minutes. But the other day it looks like I waited about 20 minutes before rebooting as I couldn't kill lnet. I'm calling it hung because

Re: [lustre-discuss] MGS failover problem

2017-01-11 Thread Vicker, Darby (JSC-EG311)
I tried a failover making sure lustre, including lnet, was completely shutdown on the primary MDS. This didn't work either. Lnet hung like I remembered. So I powered down the primary MDS to force it offline and then mounted lustre on the secondary MDS. The services and a client recovers but

Re: [lustre-discuss] MGS failover problem

2017-01-11 Thread Vicker, Darby (JSC-EG311)
My understanding is that the MMP only works with ldiskfs – its not enabled with a ZFS backend yet. But I could be wrong about that too. This is our first attempt at setting up failover so we would like to get comfortable doing this manually before we setup something automated. I'll try the

Re: [lustre-discuss] MGS failover problem

2017-01-10 Thread Vicker, Darby (JSC-EG311)
node options if you wanted. -- Rick Mohr Senior HPC System Administrator National Institute for Computational Sciences http://www.nics.tennessee.edu > On Jan 8, 2017, at 11:58 PM, Vicker, Darby (JSC-EG311) <darby.vicke...@nasa.gov> wrote: >

[lustre-discuss] MGS failover problem

2017-01-08 Thread Vicker, Darby (JSC-EG311)
We have a new set of hardware we are configuring as a lustre file system. We are having a problem with MGS failover and could use some help. It was formatted originally using 2.8 but we have since upgraded to 2.9. We are using a JBOB with server pairs for failover and are using ZFS as the

Re: [lustre-discuss] Inaccessible directory

2016-09-13 Thread Vicker, Darby (JSC-EG311)
ystem this may take several hours. > >For disaster recovery purposes, a device-level backup (dd) is more "plug and >play" in that the whole image is restored from the backup and the LFSCK phase >only needs to handle files that have been modified since the time the backup

Re: [lustre-discuss] Inaccessible directory

2016-09-12 Thread Vicker, Darby (JSC-EG311)
kup was created.   Cheers, Andreas --  Andreas Dilger Lustre Principal Architect Intel High Performance Data Division   On 2016/09/01, 14:49, "Vicker, Darby (JSC-EG311)" <darby.vicke...@nasa.gov> wrote:   Thanks.  This is happening on all the clients so its not a DLM lock problem. 

Re: [lustre-discuss] Inaccessible directory

2016-09-01 Thread Vicker, Darby (JSC-EG311)
first can also give you an idea of what kind of corruption is present before Making the fix. Cheers, Andreas On Aug 31, 2016, at 10:54, Vicker, Darby (JSC-EG311) <darby.vicke...@nasa.gov<mailto:darby.vicke...@nasa.gov>> wrote: Hello, We’ve run into a problem where an entire directo

[lustre-discuss] Inaccessible directory

2016-08-31 Thread Vicker, Darby (JSC-EG311)
Hello, We’ve run into a problem where an entire directory on our lustre file system has become inaccessible. # mount | grep lustre2 192.52.98.142@tcp:/hpfs2eg3 on /lustre2 type lustre (rw,flock) # ls -l /lustre2/mirrors/cpas ls: cannot access /lustre2/mirrors/cpas: Stale file handle # ls -l