Re: [lustre-discuss] billions of 50k files

2017-11-29 Thread Brian Andrus

Andreas,

Thanks for responding.

Right now, I am looking at using ZFS and an ssd/nvme for the journal 
disk. I suggested mirroring, but they aren't too keen on losing 50% of 
their purchased storage..

This particular system will likely not be scaled up at a future date.

It seems like the 2.11 may be a good direction if they can wait. I do 
like the idea of running an MDT-only system in the future. With multiple 
MDTs, that would be a great match for this scenario and also have the 
ability to grow in the future. And being able to "import" an existing 
filesystem is awesome. A vote for that!


Brian Andrus


On 11/29/2017 6:08 PM, Dilger, Andreas wrote:

On Nov 29, 2017, at 15:31, Brian Andrus  wrote:

All,

I have always seen lustre as a good solution for large files and not the best 
for many small files.
Recently, I have seen a request for a small lustre system (2 OSSes, 1 MDS) that 
would be for billions of files that average 50k-100k.

This is about 75TB of usable capacity per billion files.  Are you looking at 
HDD or SSD storage?  RAID or mirror?  What kind of client load, and how much 
does this system need to scale in the future?


It seems to me, that for this to be 'of worth', the block sizes on disks need 
to be small, but even then, with tcp overhead and inode limitations, it may 
still not perform all that well (compared to larger files).

Even though Lustre does 1MB or 4MB RPCs, it only allocates as much space on the 
OSTs as needed for the file data.  This means 4KB blocks with ldiskfs, and 
variable (power-of-two) blocksize on ZFS (64KB or 128KB blocks by default). You 
could constrain ZFS to smaller blocks if needed (e.g. recordsize=32k), or 
enable ZFS compression to try and fit the data into smaller blocks (depends 
whether your data is compressible or not).

The drawback is that every Lustre file currently needs an MDT inode (1KB+) and 
an OST inode, so Lustre isn't the most efficient for small files.


Am I off here? Have there been some developments in lustre that help this 
scenario (beyond small files being stored on the MDT directly)?

The Data-on-MDT feature (DoM) has landed for 2.11, which seems like it would 
suit your workload well, since it only needs a single MDT inode for small 
files, and reduces the overhead when accessing the file.  DoM will still be a 
couple of months before that is released, though you could start testing now if 
you were interested.  Currently DoM is intended to be used together with OSTs, 
but if there is a demand we could look into what is needed to run an MDT-only 
filesystem configuration (some checks in the code that prevent the filesystem 
becoming available before at least one OST is mounted would need to be removed).

That said, you could also just set up a single NFS server with ZFS to handle 
the 75TB * N of storage, unless you need highly concurrent access to the files. 
 This would probably be acceptable if you don't need to scale too much (in 
capacity or performance), and don't have a large number of clients connecting.

One of the other features we're currently investigating (not sure how much interest there 
is yet) is to be able to "import" an existing ext4 or ZFS filesystem into 
Lustre as MDT (with DoM), and be able to grow horizontally by adding more MDTs or 
OSTs.  Some work is already being done that will facilitate this in 2.11 (DoM, and OI 
Scrub for ZFS), but more would be needed for this to work.  That would potentially allow 
you to start with a ZFS or ext4 NFS server, and then migrate to Lustre if you need to 
scale it up.

Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Intel Corporation









___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] billions of 50k files

2017-11-29 Thread Dilger, Andreas
On Nov 29, 2017, at 15:31, Brian Andrus  wrote:
> 
> All,
> 
> I have always seen lustre as a good solution for large files and not the best 
> for many small files.
> Recently, I have seen a request for a small lustre system (2 OSSes, 1 MDS) 
> that would be for billions of files that average 50k-100k.

This is about 75TB of usable capacity per billion files.  Are you looking at 
HDD or SSD storage?  RAID or mirror?  What kind of client load, and how much 
does this system need to scale in the future?

> It seems to me, that for this to be 'of worth', the block sizes on disks need 
> to be small, but even then, with tcp overhead and inode limitations, it may 
> still not perform all that well (compared to larger files).

Even though Lustre does 1MB or 4MB RPCs, it only allocates as much space on the 
OSTs as needed for the file data.  This means 4KB blocks with ldiskfs, and 
variable (power-of-two) blocksize on ZFS (64KB or 128KB blocks by default). You 
could constrain ZFS to smaller blocks if needed (e.g. recordsize=32k), or 
enable ZFS compression to try and fit the data into smaller blocks (depends 
whether your data is compressible or not).

The drawback is that every Lustre file currently needs an MDT inode (1KB+) and 
an OST inode, so Lustre isn't the most efficient for small files.

> Am I off here? Have there been some developments in lustre that help this 
> scenario (beyond small files being stored on the MDT directly)?

The Data-on-MDT feature (DoM) has landed for 2.11, which seems like it would 
suit your workload well, since it only needs a single MDT inode for small 
files, and reduces the overhead when accessing the file.  DoM will still be a 
couple of months before that is released, though you could start testing now if 
you were interested.  Currently DoM is intended to be used together with OSTs, 
but if there is a demand we could look into what is needed to run an MDT-only 
filesystem configuration (some checks in the code that prevent the filesystem 
becoming available before at least one OST is mounted would need to be removed).

That said, you could also just set up a single NFS server with ZFS to handle 
the 75TB * N of storage, unless you need highly concurrent access to the files. 
 This would probably be acceptable if you don't need to scale too much (in 
capacity or performance), and don't have a large number of clients connecting.

One of the other features we're currently investigating (not sure how much 
interest there is yet) is to be able to "import" an existing ext4 or ZFS 
filesystem into Lustre as MDT (with DoM), and be able to grow horizontally 
by adding more MDTs or OSTs.  Some work is already being done that will 
facilitate this in 2.11 (DoM, and OI Scrub for ZFS), but more would be needed 
for this to work.  That would potentially allow you to start with a ZFS or ext4 
NFS server, and then migrate to Lustre if you need to scale it up.

Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Intel Corporation







___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Lustre compilation error

2017-11-29 Thread Dilger, Andreas
Rick,
Would you be able to open a ticket for this, and possibly submit a patch to fix 
the build?

Cheers, Andreas

On Nov 29, 2017, at 14:18, Mohr Jr, Richard Frank (Rick Mohr) 
mailto:rm...@utk.edu>> wrote:


On Oct 18, 2017, at 9:44 AM, parag_k 
mailto:para...@citilindia.com>> wrote:


I got the source from github.

My configure line is-

./configure --disable-client 
--with-kernel-source-header=/usr/src/kernels/3.10.0-514.el7.x86_64/ 
--with-o2ib=/usr/src/ofa_kernel/default/


Are you still running into this issue?  If so, try adding “—enable-server” and 
removing “—disable-client”.  I was building lustre 2.10.1 today, and I 
initially had both “—disable-client” and “—enable-server” in my configuration 
line.  When I did that, I got error messages like these:

make[3]: *** No rule to make target `fld.ko', needed by `all-am'.  Stop.

When I removed the “—disable-client” option, the error went away.

--
Rick Mohr
Senior HPC System Administrator
National Institute for Computational Sciences
http://www.nics.tennessee.edu

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Lustre 2.10.1 + RHEL7 Page Allocation Failures

2017-11-29 Thread Dilger, Andreas
In particular, see the patch https://review.whamcloud.com/30164

LU-10133 o2iblnd: fall back to vmalloc for mlx4/mlx5

If a large QP is allocated with kmalloc(), but fails due to memory
fragmentation, fall back to vmalloc() to handle the allocation.
This is done in the upstream kernel, but was only fixed in mlx4
in the RHEL7.3 kernel, and neither mlx4 or mlx5 in the RHEL6 kernel.
Also fix mlx5 for SLES12 kernels.

Test-Parameters: trivial
Signed-off-by: Andreas Dilger
Change-Id: Ie74800edd27bf4c3210724079cbebbae532d1318

On Nov 29, 2017, at 06:09, Jones, Peter A  wrote:
> 
> Charles
> 
> That ticket is completely open so you do have access to everything. As I 
> understand it the options are to either use the latest MOFED update rather 
> than relying on the in-kernel OFED (which I believe is the advise usually 
> provided by Mellanox anyway) or else apply the kernel patch Andreas has 
> created that is referenced in the ticket.
> 
> Peter
> 
> On 2017-11-29, 2:50 AM, "lustre-discuss on behalf of Charles A Taylor" 
>  wrote:
> 
>> 
>> Hi All,
>> 
>> We recently upgraded from Lustre 2.5.3.90 on EL6 to 2.10.1 on EL7 (details 
>> below) but have hit what looks like LU-10133 (order 8 page allocation 
>> failures).
>> 
>> We don’t have access to look at the JIRA ticket in more detail but from what 
>> we can tell the the fix is to change from vmalloc() to vmalloc_array() in 
>> the mlx4 drivers.  However, the vmalloc_array() infrastructure is in an 
>> upstream (far upstream) kernel so I’m not sure when we’ll see that fix.
>> 
>> While this may not be a Lustre issue directly, I know we can’t be the only 
>> Lustre site running 2.10.1 over IB on Mellanox ConnectX-3 HCAs.  So far we 
>> have tried increasing vm.min_free_kbytes to 8GB but that does not help.  
>> Zone_reclaim_mode is disabled (for other reasons that may not be valid under 
>> EL7) but order 8 chunks get depleted on both NUMA nodes so I’m not sure that 
>> is the answer either (though we have not tried it yet).
>> 
>> [root@ufrcmds1 ~]# cat /proc/buddyinfo 
>> Node 0, zone  DMA  1  0  0  0  2  1  1  
>> 0  1  1  3 
>> Node 0, zoneDMA32   1554  13496  11481   5108150  0  0  
>> 0  0  0  0 
>> Node 0, zone   Normal 114119 208080  78468  35679   6215690  0  
>> 0  0  0  0 
>> Node 1, zone   Normal  81295 184795 106942  38818   4485293   1653  
>> 0  0  0  0 
>> 
>> I’m wondering if other sites are hitting this and, if so, what are you doing 
>> to work around the issue on your OSSs.  
>> 
>> Regards,
>> 
>> Charles Taylor
>> UF Research Computing
>> 
>> 
>> Some Details:
>> ---
>> OS: RHEL 7.4 (Linux ufrcoss28.ufhpc 3.10.0-693.2.2.el7_lustre.x86_64)
>> Lustre: 2.10.1 (lustre-2.10.1-1.el7.x86_64)
>> Clients: ~1400 (still running 2.5.3.90 but we are in the process of 
>> upgrading)
>> Servers: 10 HA OSS pairs (20 OSSs)
>>128 GB RAM
>>6 OSTs (8+2 RAID-6) per OSS 
>>Mellanox ConnectX-3 IB/VPI HCAs 
>>RedHat Native IB Stack (i.e. not MOFED)
>>mlx4_core driver:
>>   filename:   
>> /lib/modules/3.10.0-693.2.2.el7_lustre.x86_64/kernel/drivers/net/ethernet/mellanox/mlx4/mlx4_core.ko.xz
>>   version:2.2-1
>>   license:Dual BSD/GPL
>>   description:Mellanox ConnectX HCA low-level driver
>>   author: Roland Dreier
>>   rhelversion:7.4
> ___
> lustre-discuss mailing list
> lustre-discuss@lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Intel Corporation







___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] weird issue w. lnet routers

2017-11-29 Thread John Casu

thanks guys for all your help.

looks like the issue is fundamentally poor performance across 100GbE, where I'm 
only
getting ~50Gb/s using iperf. I believe the MTU is set correctly across all my 
systems
Using connectx-4 in 100GbE mode.

thanks again,
-john



On 11/28/17 9:03 PM, Colin Faber wrote:

Are peer credits set appropriately across your fabric?

On Nov 28, 2017 8:40 PM, "john casu" mailto:j...@chiraldynamics.com>> wrote:

Thanks,
just about to try that MTU setting.

It's a small lustre system... 2 routers, MDS/MGS pair, OSS pair, JBOD pair 
(112 drives for OST)
and yes, routing between EDR & 100GbE

-john

On 11/28/17 7:28 PM, Raj wrote:

John, increasing MTU size on Ethernet side should increase the b/w. I 
also have a feeling that some lnet routers and/or
intermediate switches/routers does not have jumbo frame turned on (some 
switches needs to be set at 9212 bytes ).
How many LNet  routers are you using? I believe you are routing between 
EDR IB and 100GbE.


On Tue, Nov 28, 2017 at 7:21 PM John Casu mailto:j...@chiraldynamics.com> >> wrote:

     just built a system w. lnet routers that bridge Infiniband & 
100GbE, using Centos built in Infiniband support
     servers are Infiniband, clients are 100GbE (connectx-4 cards)

     my direct write performance from clients over Infiniband is around 
15GB/s

     When I introduced the lnet routers, performance dropped to 10GB/s

     Thought the problem was an MTU of 1500, but when I changed the 
MTUs to 9000
     performance dropped to 3GB/s.

     When I tuned according to John Fragella's LUG slides, things went 
even slower (1.5GB/s write)

     does anyone have any ideas on what I'm doing wrong??

     thanks,
     -john c.

     ___
     lustre-discuss mailing list
lustre-discuss@lists.lustre.org  
>
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org 


___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org 
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org 



___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Failure migrating OSTs in KVM lustre 2.7.0 testbed

2017-11-29 Thread Scott Wood
Once I've had one fail a migration between hosts, it stays failed.  I've waited 
a bit and tried again, and it fails to mount with the same errors (or 
messages).  I am then only able to remount it on the host that originally had 
it mounted.  Once that has been done, it's happy and, the next time I try to 
migrate it, it may or may not work.  Of course in an attempt to recreate and 
test, just now, all 8 OSTs are happily unmounting from their primary hosts and 
remounting on their secondary hosts, within  a second or two.  I'll keep 
testing and hope to be able to recreate it and get a handle on the timing.  
Again, thanks for the troubleshooting

Scott
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Failure migrating OSTs in KVM lustre 2.7.0 testbed

2017-11-29 Thread Brian Andrus
Ok. So when you say 'occasionally' does that mean if you try the command 
again, it works?


If so, I'm wondering if you are doing it before the timeout period has 
expired, so lustre is still expecting the OST to be on the original OSS. 
That is, it is still in a window where "maybe it will come back".



Brian Andrus


On 11/29/2017 3:09 PM, Scott Wood wrote:


Hi folks,


In an effort to replicate a production environment to do a test 
upgrade, I've created a six server KVM testbed on a Centos 7.4 host 
with CentOS 6 guests.   I have four OSS and two MDSs.  I have qcow2 
virtual disks visible to the servers in pairs.  Each OSS has two OSTs 
and can also mount its paired server's two OSTs.  I have separate MGT 
and MGT volumes, again, both visible and mountable by either MDS.  
When I unmount an OST from one of the OSSs and try to mount it on what 
will be its HA pair (failing over manually now until I get it working, 
then I'll install corosync and pacemaker), the second guest to mount 
the OST *occasionally* fails as follows:



[root@fakeoss4 ~]# mount /mnt/OST7
mount.lustre: increased /sys/block/vde/queue/max_sectors_kb from 1024 
to 2147483647
mount.lustre: mount /dev/vde at /mnt/OST7 failed: No such file or 
directory

Is the MGS specification correct?
Is the filesystem name correct?
If upgrading, is the copied client log valid? (see upgrade docs)

And, from /var/log/messages:

Nov 29 10:55:33 fakeoss4 kernel: LDISKFS-fs (vdd): mounted filesystem 
with ordered data mode. quota=on. Opts:
Nov 29 10:55:33 fakeoss4 kernel: LustreError: 
2326:0:(llog_osd.c:236:llog_osd_read_header()) fake-OST0006-osd: bad 
log fake-OST0006 [0xa:0x10:0x0] header magic: 0x0 (expected 0x10645539)
Nov 29 10:55:33 fakeoss4 kernel: LustreError: 
2326:0:(mgc_request.c:1739:mgc_llog_local_copy()) 
MGC192.168.122.5@tcp: failed to copy remote log fake-OST0006: rc = -5
Nov 29 10:55:33 fakeoss4 kernel: LustreError: 13a-8: Failed to get MGS 
log fake-OST0006 and no local copy.
Nov 29 10:55:33 fakeoss4 kernel: LustreError: 15c-8: 
MGC192.168.122.5@tcp: The configuration from log 'fake-OST0006' failed 
(-2). This may be the result of communication errors between this node 
and the MGS, a bad configuration, or other errors. See the syslog for 
more information.
Nov 29 10:55:33 fakeoss4 kernel: LustreError: 
2326:0:(obd_mount_server.c:1299:server_start_targets()) failed to 
start server fake-OST0006: -2
Nov 29 10:55:33 fakeoss4 kernel: LustreError: 
2326:0:(obd_mount_server.c:1783:server_fill_super()) Unable to start 
targets: -2
Nov 29 10:55:33 fakeoss4 kernel: LustreError: 
2326:0:(obd_mount_server.c:1498:server_put_super()) no obd fake-OST0006
Nov 29 10:55:34 fakeoss4 kernel: Lustre: server umount fake-OST0006 
complete
Nov 29 10:55:34 fakeoss4 kernel: LustreError: 
2326:0:(obd_mount.c:1339:lustre_fill_super()) Unable to mount (-2)


The OSS that fails to mount can see the MGS in question:

[root@fakeoss4 ~]# lctl ping 192.168.122.5
12345-0@lo
12345-192.168.122.5@tcp

The environment was built as follows:  A guest VM was installed from 
CentOS-6.5 install media. The kernel was then updated to 
2.6.32-504.8.1.el6_lustre.x86_64 from the Intel repos,.  The intel 
binary rpms for lustre were then installed.  "exclude=kernel*" was 
added to /etc/yum.repos.d and a "yum update" was run, so its an up to 
day system with the exception of the locked down kernel. 
 e2fsprogs-1.42.12.wc1-7.el6.x86_64 is the version installed.  The VM 
was then cloned to make the six lustre servers and the filesystems 
were created with the following options:



[root@fakemds1 ~]# mkfs.lustre --fsname=fake --mgs 
--servicenode=192.168.122.5@tcp0 --servicenode=192.168.122.67@tcp0 
/dev/vdb


[root@fakemds1 ~]# mkfs.lustre --reformat --fsname=fake --mdt 
--index=0 --servicenode=192.168.122.5@tcp0 
--servicenode=192.168.122.67@tcp0 
--mgsnode=192.168.122.5@tcp0:192.168.122.67@tcp0 /dev/vdc



[root@fakeoss1 ~]# mkfs.lustre --reformat --fsname=fake --ost 
--index=0 --servicenode=192.168.122.197@tcp0 
--servicenode=192.168.122.238@tcp0 
--mgsnode=192.168.122.5@tcp0:192.168.122.67@tcp0 /dev/vdb #repeated 
for 3 more OTSs with changed index and devices appropriately



[root@fakeoss3 ~]# mkfs.lustre --reformat --fsname=fake --ost 
--index=4 --servicenode=192.168.122.97@tcp0 
--servicenode=192.168.122.221@tcp0 
--mgsnode=192.168.122.5@tcp0:192.168.122.67@tcp0 /dev/vdb #repeated 
for 3 more OTSs with changed index and devices appropriately



Virtual disks were set as shareable and made visible to their correct 
VMs and often do mount, but occasionally (more than half the time) 
fail as above.  Have I missed any important information that could 
point to the cause?



Once I get this VM environment stable, I intend to update it to lustre 
2.10.1.  Thanks in advance for any troubleshooting tips you can provide.



Cheers

Scott



___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi

Re: [lustre-discuss] Failure migrating OSTs in KVM lustre 2.7.0 testbed

2017-11-29 Thread Scott Wood
Heh.  Fair question, and yes.  You had to rule it out though.  fakemds1 and 
fakemds2 have /mnt/MGT and /mnt/MDT.  fakeoss1 and fakeoss2 have 
/mnt/OST{0..3}.  fakeoss3 and fakeoss4 have /mnt/OST{3..7}.  Also to clarify, 
every command in my previous email that has " at " was actually the at symbol.


Cheers

Scott


From: lustre-discuss  on behalf of 
Brian Andrus 
Sent: Thursday, 30 November 2017 9:32:24 AM
To: lustre-discuss@lists.lustre.org
Subject: Re: [lustre-discuss] Failure migrating OSTs in KVM lustre 2.7.0 testbed

I know it may be obvious, but did you 'mkdir /mnt/OST7'?

Brian Andrus


On 11/29/2017 3:09 PM, Scott Wood wrote:
> [root@fakeoss4 ~]# mount /mnt/OST7
> mount.lustre: increased /sys/block/vde/queue/max_sectors_kb from 1024
> to 2147483647
> mount.lustre: mount /dev/vde at /mnt/OST7 failed: No such file or
> directory

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Failure migrating OSTs in KVM lustre 2.7.0 testbed

2017-11-29 Thread Brian Andrus

I know it may be obvious, but did you 'mkdir /mnt/OST7'?

Brian Andrus


On 11/29/2017 3:09 PM, Scott Wood wrote:

[root@fakeoss4 ~]# mount /mnt/OST7
mount.lustre: increased /sys/block/vde/queue/max_sectors_kb from 1024 
to 2147483647
mount.lustre: mount /dev/vde at /mnt/OST7 failed: No such file or 
directory


___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] billions of 50k files

2017-11-29 Thread E.S. Rosenberg
Maybe this would be where multiple MDS + small files on MDS would shine?
My 1 millionth of a bitcoin,
Eli

On Thu, Nov 30, 2017 at 12:31 AM, Brian Andrus  wrote:

> All,
>
> I have always seen lustre as a good solution for large files and not the
> best for many small files.
> Recently, I have seen a request for a small lustre system (2 OSSes, 1 MDS)
> that would be for billions of files that average 50k-100k.
>
> It seems to me, that for this to be 'of worth', the block sizes on disks
> need to be small, but even then, with tcp overhead and inode limitations,
> it may still not perform all that well (compared to larger files).
>
> Am I off here? Have there been some developments in lustre that help this
> scenario (beyond small files being stored on the MDT directly)?
>
> Thanks for any insight,
>
> Brian Andrus
>
> ___
> lustre-discuss mailing list
> lustre-discuss@lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
>
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


[lustre-discuss] Failure migrating OSTs in KVM lustre 2.7.0 testbed

2017-11-29 Thread Scott Wood
Hi folks,


In an effort to replicate a production environment to do a test upgrade, I've 
created a six server KVM testbed on a Centos 7.4 host with CentOS 6 guests.
I have four OSS and two MDSs.  I have qcow2 virtual disks visible to the 
servers in pairs.  Each OSS has two OSTs and can also mount its paired server's 
two OSTs.  I have separate MGT and MGT volumes, again, both visible and 
mountable by either MDS.  When I unmount an OST from one of the OSSs and try to 
mount it on what will be its HA pair (failing over manually now until I get it 
working, then I'll install corosync and pacemaker), the second guest to mount 
the OST *occasionally* fails as follows:


[root@fakeoss4 ~]# mount /mnt/OST7
mount.lustre: increased /sys/block/vde/queue/max_sectors_kb from 1024 to 
2147483647
mount.lustre: mount /dev/vde at /mnt/OST7 failed: No such file or directory
Is the MGS specification correct?
Is the filesystem name correct?
If upgrading, is the copied client log valid? (see upgrade docs)

And, from /var/log/messages:

Nov 29 10:55:33 fakeoss4 kernel: LDISKFS-fs (vdd): mounted filesystem with 
ordered data mode. quota=on. Opts:
Nov 29 10:55:33 fakeoss4 kernel: LustreError: 
2326:0:(llog_osd.c:236:llog_osd_read_header()) fake-OST0006-osd: bad log 
fake-OST0006 [0xa:0x10:0x0] header magic: 0x0 (expected 0x10645539)
Nov 29 10:55:33 fakeoss4 kernel: LustreError: 
2326:0:(mgc_request.c:1739:mgc_llog_local_copy()) MGC192.168.122.5@tcp: failed 
to copy remote log fake-OST0006: rc = -5
Nov 29 10:55:33 fakeoss4 kernel: LustreError: 13a-8: Failed to get MGS log 
fake-OST0006 and no local copy.
Nov 29 10:55:33 fakeoss4 kernel: LustreError: 15c-8: MGC192.168.122.5@tcp: The 
configuration from log 'fake-OST0006' failed (-2). This may be the result of 
communication errors between this node and the MGS, a bad configuration, or 
other errors. See the syslog for more information.
Nov 29 10:55:33 fakeoss4 kernel: LustreError: 
2326:0:(obd_mount_server.c:1299:server_start_targets()) failed to start server 
fake-OST0006: -2
Nov 29 10:55:33 fakeoss4 kernel: LustreError: 
2326:0:(obd_mount_server.c:1783:server_fill_super()) Unable to start targets: -2
Nov 29 10:55:33 fakeoss4 kernel: LustreError: 
2326:0:(obd_mount_server.c:1498:server_put_super()) no obd fake-OST0006
Nov 29 10:55:34 fakeoss4 kernel: Lustre: server umount fake-OST0006 complete
Nov 29 10:55:34 fakeoss4 kernel: LustreError: 
2326:0:(obd_mount.c:1339:lustre_fill_super()) Unable to mount  (-2)

The OSS that fails to mount can see the MGS in question:

[root@fakeoss4 ~]# lctl ping 192.168.122.5
12345-0@lo
12345-192.168.122.5@tcp


The environment was built as follows:  A guest VM was installed from CentOS-6.5 
install media. The kernel was then updated to 2.6.32-504.8.1.el6_lustre.x86_64 
from the Intel repos,.  The intel binary rpms for lustre were then installed.  
"exclude=kernel*" was added to /etc/yum.repos.d and a "yum update" was run, so 
its an up to day system with the exception of the locked down kernel.  
e2fsprogs-1.42.12.wc1-7.el6.x86_64 is the version installed.  The VM was then 
cloned to make the six lustre servers and the filesystems were created with the 
following options:


[root@fakemds1 ~]# mkfs.lustre --fsname=fake --mgs 
--servicenode=192.168.122.5@tcp0 --servicenode=192.168.122.67@tcp0 /dev/vdb

[root@fakemds1 ~]# mkfs.lustre --reformat --fsname=fake --mdt --index=0 
--servicenode=192.168.122.5@tcp0 --servicenode=192.168.122.67@tcp0 
--mgsnode=192.168.122.5@tcp0:192.168.122.67@tcp0 /dev/vdc


[root@fakeoss1 ~]# mkfs.lustre --reformat --fsname=fake --ost --index=0 
--servicenode=192.168.122.197@tcp0 --servicenode=192.168.122.238@tcp0 
--mgsnode=192.168.122.5@tcp0:192.168.122.67@tcp0 /dev/vdb #repeated for 3 more 
OTSs with changed index and devices appropriately


[root@fakeoss3 ~]# mkfs.lustre --reformat --fsname=fake --ost --index=4 
--servicenode=192.168.122.97@tcp0 --servicenode=192.168.122.221@tcp0 
--mgsnode=192.168.122.5@tcp0:192.168.122.67@tcp0 /dev/vdb #repeated for 3 more 
OTSs with changed index and devices appropriately


Virtual disks were set as shareable and made visible to their correct VMs and 
often do mount, but occasionally (more than half the time) fail as above.  Have 
I missed any important information that could point to the cause?


Once I get this VM environment stable, I intend to update it to lustre 2.10.1.  
Thanks in advance for any troubleshooting tips you can provide.


Cheers

Scott
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


[lustre-discuss] billions of 50k files

2017-11-29 Thread Brian Andrus

All,

I have always seen lustre as a good solution for large files and not the 
best for many small files.
Recently, I have seen a request for a small lustre system (2 OSSes, 1 
MDS) that would be for billions of files that average 50k-100k.


It seems to me, that for this to be 'of worth', the block sizes on disks 
need to be small, but even then, with tcp overhead and inode 
limitations, it may still not perform all that well (compared to larger 
files).


Am I off here? Have there been some developments in lustre that help 
this scenario (beyond small files being stored on the MDT directly)?


Thanks for any insight,

Brian Andrus

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Lustre compilation error

2017-11-29 Thread Mohr Jr, Richard Frank (Rick Mohr)

> On Oct 18, 2017, at 9:44 AM, parag_k  wrote:
> 
> 
> I got the source from github.
> 
> My configure line is-
> 
> ./configure --disable-client 
> --with-kernel-source-header=/usr/src/kernels/3.10.0-514.el7.x86_64/ 
> --with-o2ib=/usr/src/ofa_kernel/default/
> 

Are you still running into this issue?  If so, try adding “—enable-server” and 
removing “—disable-client”.  I was building lustre 2.10.1 today, and I 
initially had both “—disable-client” and “—enable-server” in my configuration 
line.  When I did that, I got error messages like these:

make[3]: *** No rule to make target `fld.ko', needed by `all-am'.  Stop.

When I removed the “—disable-client” option, the error went away.

--
Rick Mohr
Senior HPC System Administrator
National Institute for Computational Sciences
http://www.nics.tennessee.edu

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Announce: Lustre Systems Administration Guide

2017-11-29 Thread Arman Khalatyan
Hello,
I am looking for some simple routing examples on ib0 to tcp.
All examples in the documentation are based on OPA or Melanox.
Found some inconsistency in the manual routing part:
http://wiki.lustre.org/LNet_Router_Config_Guide
section:ARP flux issue for MR node
The Ethernet part is missing in the examples it states " Below is an
example setup on a node with 2 Ethernet, 2 MLX and 2 OPA interfaces"
but examples are only with 4 interfaces described.


thank you beforehand,
Arman.

On Wed, Nov 29, 2017 at 1:50 PM, Shawn Hall  wrote:
> Andreas,
>
> We’ll bring the idea up on today’s OpenSFS board call. If the community has
> recommendations on what this might look like (preferred capabilities or
> suggestions for Q&A/forum software, or a pointer to existing hosted Q&A
> platforms like Stack Overflow), please let me know.
>
> Shawn
>
>
> On 11/28/17, 11:00 PM, "lustre-discuss on behalf of Dilger, Andreas"
>  andreas.dil...@intel.com> wrote:
>
> On Nov 17, 2017, at 20:20, Stu Midgley  wrote:
>>
>> Thank you both for the documentation. I know how hard it is to maintain.
>>
>> I've asked that all my admin staff to read it - even if some of it doesn't
>> directly apply to our environment.
>>
>> What we would like is well organised, comprehensive, accurate and up to
>> date documenation. Most of the time when I dive into the manual, or other
>> online material, I find it isn't quite right (path's slightly wrong or
>> outdated etc).
>
> The manual is open to contributions if you find problems therein. Please
> see:
>
> https://wiki.hpdd.intel.com/display/PUB/Making+changes+to+the+Lustre+Manual+source
>
>> I also have difficulty finding all the information I want in a single
>> location and in a logical fashon. These aren't new issues and blight all
>> documentation, but having the definitive source in a wiki might open it up
>> to more transparency, greater use and thus, ultimately, being kept up to
>> date, even if its by others outside Intel.
>
> I'd be thrilled if there were contributors to the manual outside of Intel.
> IMHO, users who are not intimately familiar with Lustre are the best people
> to know when the manual isn't clear or is missing information. I personally
> don't read the manual very often, though I do reference it on occasion. When
> I find something wrong or outdated, I submit a patch, and it generally is
> landed quickly.
>
>> I'd also like a section where people can post their experiences and
>> solutions. For example, in recent times, we have battled bad interactions
>> with ZFS+lustre which lead to poor performance and ZFS corruption. While we
>> have now tuned both lustre and zfs and the bugs have mostly been fixed, the
>> learnings, trouble shooting methods etc. should be preserved and might
>> assist others in the future diagnose tricky problems.
>
> Stack overflow for Lustre? I've been wondering about some kind of Q&A forum
> for Lustre for a while. This would be a great project to propose to OpenSFS
> to be hosted on the lustre.org site (Intel does not manage that site). I
> suspect there are numerous engines available for this already, and it just
> needs someone interested and/or knowledgeable enough to pick an engine and
> get it installed there.
>
> Cheers, Andreas
>
>> On Sat, Nov 18, 2017 at 6:03 AM, Dilger, Andreas
>>  wrote:
>> On Nov 16, 2017, at 22:41, Cowe, Malcolm J 
>> wrote:
>> >
>> > I am pleased to announce the availability of a new systems
>> > administration guide for the Lustre file system, which has been published 
>> > to
>> > wiki.lustre.org. The content can be accessed directly from the front page 
>> > of
>> > the wiki, or from the following URL:
>> >
>> > http://wiki.lustre.org/Category:Lustre_Systems_Administration
>> >
>> > The guide is intended to provide comprehensive instructions for the
>> > installation and configuration of production-ready Lustre storage clusters.
>> > Topics covered:
>> >
>> > • Introduction to Lustre
>> > • Lustre File System Components
>> > • Lustre Software Installation
>> > • Lustre Networking (LNet)
>> > • LNet Router Configuration
>> > • Lustre Object Storage Devices (OSDs)
>> > • Creating Lustre File System Services
>> > • Mounting a Lustre File System on Client Nodes
>> > • Starting and Stopping Lustre Services
>> > • Lustre High Availability
>> >
>> > Refer to the front page of the guide for the complete table of contents.
>>
>> Malcolm,
>> thanks so much for your work on this. It is definitely improving the
>> state of the documentation available today.
>>
>> I was wondering if people have an opinion on whether we should remove
>> some/all of the administration content from the Lustre Operations Manual,
>> and make that more of a reference manual that contains details of
>> commands, architecture, features, etc. as a second-level reference from
>> the wiki admin guide?
>>
>> For that matter, should we export the XML Manual into the wiki and
>> leave it there? We'd have to make sure that the wiki is being indexed
>> by Goog

Re: [lustre-discuss] Lustre 2.10.1 + RHEL7 Page Allocation Failures

2017-11-29 Thread Jones, Peter A
Ah yes. One more thing – I believe that this has been addressed in the upcoming 
RHEL 7.5, so that might be another option for you to consider.

On 2017-11-29, 5:47 AM, "lustre-discuss on behalf of Charles A Taylor" 
mailto:lustre-discuss-boun...@lists.lustre.org>
 on behalf of chas...@ufl.edu> wrote:

Thank you, Peter.  I figured that would be the response but wanted to ask.  We 
were hoping to get away from maintaining a MOFED build but it looks like that 
may not be the way to go.

And you are correct about the JIRA ticket.  I misspoke.  It was the associated 
RH kernel bug that was “private”, IIRC.

Thank you again,

Charlie

On Nov 29, 2017, at 8:09 AM, Jones, Peter A 
mailto:peter.a.jo...@intel.com>> wrote:

Charles

That ticket is completely open so you do have access to everything. As I 
understand it the options are to either use the latest MOFED update rather than 
relying on the in-kernel OFED (which I believe is the advise usually provided 
by Mellanox anyway) or else apply the kernel patch Andreas has created that is 
referenced in the ticket.

Peter

On 2017-11-29, 2:50 AM, "lustre-discuss on behalf of Charles A Taylor" 
mailto:lustre-discuss-boun...@lists.lustre.org>
 on behalf of chas...@ufl.edu> wrote:


Hi All,

We recently upgraded from Lustre 2.5.3.90 on EL6 to 2.10.1 on EL7 (details 
below) but have hit what looks like LU-10133 (order 8 page allocation failures).

We don’t have access to look at the JIRA ticket in more detail but from what we 
can tell the the fix is to change from vmalloc() to vmalloc_array() in the mlx4 
drivers.  However, the vmalloc_array() infrastructure is in an upstream (far 
upstream) kernel so I’m not sure when we’ll see that fix.

While this may not be a Lustre issue directly, I know we can’t be the only 
Lustre site running 2.10.1 over IB on Mellanox ConnectX-3 HCAs.  So far we have 
tried increasing vm.min_free_kbytes to 8GB but that does not help.  
Zone_reclaim_mode is disabled (for other reasons that may not be valid under 
EL7) but order 8 chunks get depleted on both NUMA nodes so I’m not sure that is 
the answer either (though we have not tried it yet).

[root@ufrcmds1 ~]# cat /proc/buddyinfo
Node 0, zone  DMA  1  0  0  0  2  1  1  0   
   1  1  3
Node 0, zoneDMA32   1554  13496  11481   5108150  0  0  0   
   0  0  0
Node 0, zone   Normal 114119 208080  78468  35679   6215690  0  0   
   0  0  0
Node 1, zone   Normal  81295 184795 106942  38818   4485293   1653  0   
   0  0  0

I’m wondering if other sites are hitting this and, if so, what are you doing to 
work around the issue on your OSSs.

Regards,

Charles Taylor
UF Research Computing


Some Details:
---
OS: RHEL 7.4 (Linux ufrcoss28.ufhpc 3.10.0-693.2.2.el7_lustre.x86_64)
Lustre: 2.10.1 (lustre-2.10.1-1.el7.x86_64)
Clients: ~1400 (still running 2.5.3.90 but we are in the process of upgrading)
Servers: 10 HA OSS pairs (20 OSSs)
   128 GB RAM
   6 OSTs (8+2 RAID-6) per OSS
   Mellanox ConnectX-3 IB/VPI HCAs
   RedHat Native IB Stack (i.e. not MOFED)
   mlx4_core driver:
  filename:   
/lib/modules/3.10.0-693.2.2.el7_lustre.x86_64/kernel/drivers/net/ethernet/mellanox/mlx4/mlx4_core.ko.xz
  version:2.2-1
  license:Dual BSD/GPL
  description:Mellanox ConnectX HCA low-level driver
  author: Roland Dreier
  rhelversion:7.4

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Lustre 2.10.1 + RHEL7 Page Allocation Failures

2017-11-29 Thread Charles A Taylor
Thank you, Peter.  I figured that would be the response but wanted to ask.  We 
were hoping to get away from maintaining a MOFED build but it looks like that 
may not be the way to go.

And you are correct about the JIRA ticket.  I misspoke.  It was the associated 
RH kernel bug that was “private”, IIRC.  

Thank you again,

Charlie

> On Nov 29, 2017, at 8:09 AM, Jones, Peter A  wrote:
> 
> Charles
> 
> That ticket is completely open so you do have access to everything. As I 
> understand it the options are to either use the latest MOFED update rather 
> than relying on the in-kernel OFED (which I believe is the advise usually 
> provided by Mellanox anyway) or else apply the kernel patch Andreas has 
> created that is referenced in the ticket.
> 
> Peter
> 
> On 2017-11-29, 2:50 AM, "lustre-discuss on behalf of Charles A Taylor" 
>   on behalf of chas...@ufl.edu 
> > wrote:
> 
>> 
>> Hi All,
>> 
>> We recently upgraded from Lustre 2.5.3.90 on EL6 to 2.10.1 on EL7 (details 
>> below) but have hit what looks like LU-10133 (order 8 page allocation 
>> failures).
>> 
>> We don’t have access to look at the JIRA ticket in more detail but from what 
>> we can tell the the fix is to change from vmalloc() to vmalloc_array() in 
>> the mlx4 drivers.  However, the vmalloc_array() infrastructure is in an 
>> upstream (far upstream) kernel so I’m not sure when we’ll see that fix.
>> 
>> While this may not be a Lustre issue directly, I know we can’t be the only 
>> Lustre site running 2.10.1 over IB on Mellanox ConnectX-3 HCAs.  So far we 
>> have tried increasing vm.min_free_kbytes to 8GB but that does not help.  
>> Zone_reclaim_mode is disabled (for other reasons that may not be valid under 
>> EL7) but order 8 chunks get depleted on both NUMA nodes so I’m not sure that 
>> is the answer either (though we have not tried it yet).
>> 
>> [root@ufrcmds1 ~]# cat /proc/buddyinfo 
>> Node 0, zone  DMA  1  0  0  0  2  1  1  
>> 0  1  1  3 
>> Node 0, zoneDMA32   1554  13496  11481   5108150  0  0  
>> 0  0  0  0 
>> Node 0, zone   Normal 114119 208080  78468  35679   6215690  0  
>> 0  0  0  0 
>> Node 1, zone   Normal  81295 184795 106942  38818   4485293   1653  
>> 0  0  0  0 
>> 
>> I’m wondering if other sites are hitting this and, if so, what are you doing 
>> to work around the issue on your OSSs.  
>> 
>> Regards,
>> 
>> Charles Taylor
>> UF Research Computing
>> 
>> 
>> Some Details:
>> ---
>> OS: RHEL 7.4 (Linux ufrcoss28.ufhpc 3.10.0-693.2.2.el7_lustre.x86_64)
>> Lustre: 2.10.1 (lustre-2.10.1-1.el7.x86_64)
>> Clients: ~1400 (still running 2.5.3.90 but we are in the process of 
>> upgrading)
>> Servers: 10 HA OSS pairs (20 OSSs)
>>128 GB RAM
>>6 OSTs (8+2 RAID-6) per OSS 
>>Mellanox ConnectX-3 IB/VPI HCAs 
>>RedHat Native IB Stack (i.e. not MOFED)
>>mlx4_core driver:
>>   filename:   
>> /lib/modules/3.10.0-693.2.2.el7_lustre.x86_64/kernel/drivers/net/ethernet/mellanox/mlx4/mlx4_core.ko.xz
>>   version:2.2-1
>>   license:Dual BSD/GPL
>>   description:Mellanox ConnectX HCA low-level driver
>>   author: Roland Dreier
>>   rhelversion:7.4

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Lustre 2.10.1 + RHEL7 Page Allocation Failures

2017-11-29 Thread Jones, Peter A
Charles

That ticket is completely open so you do have access to everything. As I 
understand it the options are to either use the latest MOFED update rather than 
relying on the in-kernel OFED (which I believe is the advise usually provided 
by Mellanox anyway) or else apply the kernel patch Andreas has created that is 
referenced in the ticket.

Peter

On 2017-11-29, 2:50 AM, "lustre-discuss on behalf of Charles A Taylor" 
mailto:lustre-discuss-boun...@lists.lustre.org>
 on behalf of chas...@ufl.edu> wrote:


Hi All,

We recently upgraded from Lustre 2.5.3.90 on EL6 to 2.10.1 on EL7 (details 
below) but have hit what looks like LU-10133 (order 8 page allocation failures).

We don’t have access to look at the JIRA ticket in more detail but from what we 
can tell the the fix is to change from vmalloc() to vmalloc_array() in the mlx4 
drivers.  However, the vmalloc_array() infrastructure is in an upstream (far 
upstream) kernel so I’m not sure when we’ll see that fix.

While this may not be a Lustre issue directly, I know we can’t be the only 
Lustre site running 2.10.1 over IB on Mellanox ConnectX-3 HCAs.  So far we have 
tried increasing vm.min_free_kbytes to 8GB but that does not help.  
Zone_reclaim_mode is disabled (for other reasons that may not be valid under 
EL7) but order 8 chunks get depleted on both NUMA nodes so I’m not sure that is 
the answer either (though we have not tried it yet).

[root@ufrcmds1 ~]# cat /proc/buddyinfo
Node 0, zone  DMA  1  0  0  0  2  1  1  0   
   1  1  3
Node 0, zoneDMA32   1554  13496  11481   5108150  0  0  0   
   0  0  0
Node 0, zone   Normal 114119 208080  78468  35679   6215690  0  0   
   0  0  0
Node 1, zone   Normal  81295 184795 106942  38818   4485293   1653  0   
   0  0  0

I’m wondering if other sites are hitting this and, if so, what are you doing to 
work around the issue on your OSSs.

Regards,

Charles Taylor
UF Research Computing


Some Details:
---
OS: RHEL 7.4 (Linux ufrcoss28.ufhpc 3.10.0-693.2.2.el7_lustre.x86_64)
Lustre: 2.10.1 (lustre-2.10.1-1.el7.x86_64)
Clients: ~1400 (still running 2.5.3.90 but we are in the process of upgrading)
Servers: 10 HA OSS pairs (20 OSSs)
   128 GB RAM
   6 OSTs (8+2 RAID-6) per OSS
   Mellanox ConnectX-3 IB/VPI HCAs
   RedHat Native IB Stack (i.e. not MOFED)
   mlx4_core driver:
  filename:   
/lib/modules/3.10.0-693.2.2.el7_lustre.x86_64/kernel/drivers/net/ethernet/mellanox/mlx4/mlx4_core.ko.xz
  version:2.2-1
  license:Dual BSD/GPL
  description:Mellanox ConnectX HCA low-level driver
  author: Roland Dreier
  rhelversion:7.4
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Announce: Lustre Systems Administration Guide

2017-11-29 Thread Shawn Hall
Andreas,

We’ll bring the idea up on today’s OpenSFS board call.  If the community has 
recommendations on what this might look like (preferred capabilities or 
suggestions for Q&A/forum software, or a pointer to existing hosted Q&A 
platforms like Stack Overflow), please let me know.

Shawn

On 11/28/17, 11:00 PM, "lustre-discuss on behalf of Dilger, Andreas" 
 
wrote:

On Nov 17, 2017, at 20:20, Stu Midgley  wrote:
> 
> Thank you both for the documentation.  I know how hard it is to maintain. 
> 
> I've asked that all my admin staff to read it - even if some of it 
doesn't directly apply to our environment.
> 
> What we would like is well organised, comprehensive, accurate and up to 
date documenation.  Most of the time when I dive into the manual, or other 
online material, I find it isn't quite right (path's slightly wrong or outdated 
etc). 

The manual is open to contributions if you find problems therein.  Please 
see:


https://wiki.hpdd.intel.com/display/PUB/Making+changes+to+the+Lustre+Manual+source

> I also have difficulty finding all the information I want in a single 
location and in a logical fashon.  These aren't new issues and blight all 
documentation, but having the definitive source in a wiki might open it up to 
more transparency, greater use and thus, ultimately, being kept up to date, 
even if its by others outside Intel.

I'd be thrilled if there were contributors to the manual outside of Intel.  
IMHO, users who are not intimately familiar with Lustre are the best people to 
know when the manual isn't clear or is missing information.  I personally don't 
read the manual very often, though I do reference it on occasion.  When I find 
something wrong or outdated, I submit a patch, and it generally is landed 
quickly.

> I'd also like a section where people can post their experiences and 
solutions.  For example, in recent times, we have battled bad interactions with 
ZFS+lustre which lead to poor performance and ZFS corruption.  While we have 
now tuned both lustre and zfs and the bugs have mostly been fixed, the 
learnings, trouble shooting methods etc. should be preserved and might assist 
others in the future diagnose tricky problems.

Stack overflow for Lustre?  I've been wondering about some kind of Q&A 
forum for Lustre for a while.  This would be a great project to propose to 
OpenSFS to be hosted on the lustre.org site (Intel does not manage that site).  
I suspect there are numerous engines available for this already, and it just 
needs someone interested and/or knowledgeable enough to pick an engine and get 
it installed there.

Cheers, Andreas

> On Sat, Nov 18, 2017 at 6:03 AM, Dilger, Andreas 
 wrote:
> On Nov 16, 2017, at 22:41, Cowe, Malcolm J  
wrote:
> >
> > I am pleased to announce the availability of a new systems 
administration guide for the Lustre file system, which has been published to 
wiki.lustre.org. The content can be accessed directly from the front page of 
the wiki, or from the following URL:
> >
> > http://wiki.lustre.org/Category:Lustre_Systems_Administration
> >
> > The guide is intended to provide comprehensive instructions for the 
installation and configuration of production-ready Lustre storage clusters. 
Topics covered:
> >
> >   • Introduction to Lustre
> >   • Lustre File System Components
> >   • Lustre Software Installation
> >   • Lustre Networking (LNet)
> >   • LNet Router Configuration
> >   • Lustre Object Storage Devices (OSDs)
> >   • Creating Lustre File System Services
> >   • Mounting a Lustre File System on Client Nodes
> >   • Starting and Stopping Lustre Services
> >   • Lustre High Availability
> >
> > Refer to the front page of the guide for the complete table of contents.
> 
> Malcolm,
> thanks so much for your work on this.  It is definitely improving the
> state of the documentation available today.
> 
> I was wondering if people have an opinion on whether we should remove
> some/all of the administration content from the Lustre Operations Manual,
> and make that more of a reference manual that contains details of
> commands, architecture, features, etc. as a second-level reference from
> the wiki admin guide?
> 
> For that matter, should we export the XML Manual into the wiki and
> leave it there?  We'd have to make sure that the wiki is being indexed
> by Google for easier searching before we could do that.
> 
> Cheers, Andreas
> 
> > In addition, for people who are new to Lustre, there is a high-level 
introduction to Lustre concepts, available as a PDF download:
> >
> > http://wiki.lustre.org/images/6/64/LustreArchitecture-v4.pdf
> >
> >
> > Malcolm Cowe

Cheers, Andreas
--
Andreas Dilger
Lustre Princi

Re: [lustre-discuss] Recompiling client from the source doesnot contain lnetctl

2017-11-29 Thread Arman Khalatyan
even in the extracted source code the lnetctl does not compile.
running make in the utils folder it is producing wirecheck,lst and
routerstat, but not lnetctl.
After running "make lnetctl" in the utils folder
/tmp/lustre-2.10.2_RC1/lnet/utils

it produces the executable.


On Wed, Nov 29, 2017 at 11:52 AM, Arman Khalatyan  wrote:
> Hi Andreas,
> I just checked the yaml-devel it is installed:
> yum list installed | grep yaml
> libyaml.x86_64 0.1.4-11.el7_0  @base
> libyaml-devel.x86_64   0.1.4-11.el7_0  @base
>
> and still no success:
>  rpm -qpl rpmbuild/RPMS/x86_64/*.rpm| grep lnetctl
> /usr/share/man/man8/lnetctl.8.gz
> /usr/src/debug/lustre-2.10.2_RC1/lnet/include/lnet/lnetctl.h
>
> are there any other dependencies ?
>
> Thanks,
> Arman.
>
> On Wed, Nov 29, 2017 at 6:46 AM, Dilger, Andreas
>  wrote:
>> On Nov 28, 2017, at 07:58, Arman Khalatyan  wrote:
>>>
>>> Hello,
>>> I would like to recompile the client from the rpm-source but looks
>>> like the packaging on the jenkins is wrong:
>>>
>>> 1) wget 
>>> https://build.hpdd.intel.com/job/lustre-b2_10/arch=x86_64,build_type=client,distro=el7,ib_stack=inkernel/lastSuccessfulBuild/artifact/artifacts/SRPMS/lustre-2.10.2_RC1-1.src.rpm
>>> 2) rpmbuild --rebuild --without servers lustre-2.10.2_RC1-1.src.rpm
>>> after the successful build the rpms doesn't contain the lnetctl but
>>> the help only
>>> 3) cd /root/rpmbuild/RPMS/x86_64
>>> 4) rpm -qpl ./*.rpm| grep lnetctl
>>> /usr/share/man/man8/lnetctl.8.gz
>>> /usr/src/debug/lustre-2.10.2_RC1/lnet/include/lnet/lnetctl.h
>>>
>>> The   lustre-client-2.10.2_RC1-1.el7.x86_64.rpm on the jenkins
>>> contains the lnetctl
>>> Maybe I should add more options to rebuild the client + lnetctl?
>>
>> You need to have libyaml-devel installed on your build node.
>>
>> Cheers, Andreas
>> --
>> Andreas Dilger
>> Lustre Principal Architect
>> Intel Corporation
>>
>>
>>
>>
>>
>>
>>
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


[lustre-discuss] Lustre 2.10.1 + RHEL7 Lock Callback Timer Expired

2017-11-29 Thread Charles A Taylor

We have a genomics pipeline app (supernova) that fails consistently due to the 
client being evicted on the OSSs with a  “lock callback timer expired”.  I 
doubled “nlm_enqueue_min” across the cluster but then the timer simply expired 
after 200s rather than 100s so I don’t think that is the answer.   The 
syslog/dmesg on the client shows no signs of distress and it is a “bigmem” 
machine with 1TB of RAM.  

The eviction appears to come while the application is processing a large number 
(~300) of data “chunks” (i.e. files) which occur in pairs.

-rw-r--r-- 1 chasman ufhpc 24 Nov 28 23:31 
./Tdtest915/ASSEMBLER_CS/_ASSEMBLER/_ASM_SN/SHARD_ASM/fork0/join/files/chunk233.sedge_bcs
-rw-r--r-- 1 chasman ufhpc 34M Nov 28 23:31 
./Tdtest915/ASSEMBLER_CS/_ASSEMBLER/_ASM_SN/SHARD_ASM/fork0/join/files/chunk233.sedge_asm

I assume the 24-byte file is metadata (an index or some such) and the 34M file 
is the actual data but I’m just guessing since I’m completely unfamiliar with 
the application.  

The write error is,

#define ENOTCONN107 /* Transport endpoint is not connected */

which occurs after the OSS eviction.  This was reproducible under 2.5.3.90 as 
well.  We hoped that upgrading to 2.10.1 would resolve the issue but it has 
not.  

This is the first application (in 10 years) we have encountered that 
consistently and reliably fails when run over Lustre.  I’m not sure at this 
point whether this is a bug or tuning issue.
If others have encountered and overcome something like this, we’d be grateful 
to hear from you.

Regards,

Charles Taylor
UF Research Computing

OSS:
--
Nov 28 23:41:41 ufrcoss28 kernel: LustreError: 
0:0:(ldlm_lockd.c:334:waiting_locks_callback()) ### lock callback timer expired 
after 201s: evicing client at 10.13.136.74@o2ib  ns: filter-testfs-OST002e_UUID 
lock: 880041717400/0x9bd23c8dc69323a1 lrc: 3/0,0 mode: PW/PW res: 
[0x7ef2:0x0:0x0].0x0 rrc: 3 type: EXT [0->18446744073709551615] (req 
4096->1802239) flags: 0x6400010020 nid: 10.13.136.74@o2ib remote: 
0xe54f26957f2ac591 expref: 45 pid: 6836 timeout: 6488120506 lvb_type: 0

Client:
———
Nov 28 23:41:42 s5a-s23 kernel: LustreError: 11-0: 
testfs-OST002e-osc-88c053fe3800: operation ost_write to node 
10.13.136.30@o2ib failed: rc = -107
Nov 28 23:41:42 s5a-s23 kernel: Lustre: testfs-OST002e-osc-88c053fe3800: 
Connection to testfs-OST002e (at 10.13.136.30@o2ib) was lost; in progress 
operations using this service will wait for recovery to complete
Nov 28 23:41:42 s5a-s23 kernel: LustreError: 167-0: 
testfs-OST002e-osc-88c053fe3800: This client was evicted by testfs-OST002e; 
in progress operations using this service will fail.
Nov 28 23:41:42 s5a-s23 kernel: LustreError: 11-0: 
testfs-OST002c-osc-88c053fe3800: operation ost_punch to node 
10.13.136.30@o2ib failed: rc = -107
Nov 28 23:41:42 s5a-s23 kernel: Lustre: testfs-OST002c-osc-88c053fe3800: 
Connection to testfs-OST002c (at 10.13.136.30@o2ib) was lost; in progress 
operations using this service will wait for recovery to complete
Nov 28 23:41:42 s5a-s23 kernel: LustreError: 167-0: 
testfs-OST002c-osc-88c053fe3800: This client was evicted by testfs-OST002c; 
in progress operations using this service will fail.
Nov 28 23:41:47 s5a-s23 kernel: LustreError: 11-0: 
testfs-OST-osc-88c053fe3800: operation ost_statfs to node 
10.13.136.23@o2ib failed: rc = -107
Nov 28 23:41:47 s5a-s23 kernel: Lustre: testfs-OST-osc-88c053fe3800: 
Connection to testfs-OST (at 10.13.136.23@o2ib) was lost; in progress 
operations using this service will wait for recovery to complete
Nov 28 23:41:47 s5a-s23 kernel: LustreError: 167-0: 
testfs-OST0004-osc-88c053fe3800: This client was evicted by testfs-OST0004; 
in progress operations using this service will fail.
Nov 28 23:43:11 s5a-s23 kernel: Lustre: testfs-OST0006-osc-88c053fe3800: 
Connection restored to 10.13.136.24@o2ib (at 10.13.136.24@o2ib)
Nov 28 23:43:38 s5a-s23 kernel: Lustre: testfs-OST002c-osc-88c053fe3800: 
Connection restored to 10.13.136.30@o2ib (at 10.13.136.30@o2ib)
Nov 28 23:43:45 s5a-s23 kernel: Lustre: testfs-OST-osc-88c053fe3800: 
Connection restored to 10.13.136.23@o2ib (at 10.13.136.23@o2ib)
Nov 28 23:43:48 s5a-s23 kernel: Lustre: testfs-OST0004-osc-88c053fe3800: 
Connection restored to 10.13.136.23@o2ib (at 10.13.136.23@o2ib)
Nov 28 23:43:48 s5a-s23 kernel: Lustre: Skipped 3 previous similar messages
Nov 28 23:43:55 s5a-s23 kernel: Lustre: testfs-OST0007-osc-88c053fe3800: 
Connection restored to 10.13.136.24@o2ib (at 10.13.136.24@o2ib)
Nov 28 23:43:55 s5a-s23 kernel: Lustre: Skipped 4 previous similar messages

Some Details:
---
OS: RHEL 7.4 (Linux ufrcoss28.ufhpc 3.10.0-693.2.2.el7_lustre.x86_64)
Lustre: 2.10.1 (lustre-2.10.1-1.el7.x86_64)
Client: 2.10.1 
 1 TB RAM
 Mellanox ConnectX-3 IB/VPI HCAs
 Linux s5a-s23.ufhpc 2.6.32-696.13.2.el6.x86_64 
 MOFED 3.2.2 IB stack
 Lustre 2.10.1
Servers: 

Re: [lustre-discuss] Recompiling client from the source doesnot contain lnetctl

2017-11-29 Thread Arman Khalatyan
Hi Andreas,
I just checked the yaml-devel it is installed:
yum list installed | grep yaml
libyaml.x86_64 0.1.4-11.el7_0  @base
libyaml-devel.x86_64   0.1.4-11.el7_0  @base

and still no success:
 rpm -qpl rpmbuild/RPMS/x86_64/*.rpm| grep lnetctl
/usr/share/man/man8/lnetctl.8.gz
/usr/src/debug/lustre-2.10.2_RC1/lnet/include/lnet/lnetctl.h

are there any other dependencies ?

Thanks,
Arman.

On Wed, Nov 29, 2017 at 6:46 AM, Dilger, Andreas
 wrote:
> On Nov 28, 2017, at 07:58, Arman Khalatyan  wrote:
>>
>> Hello,
>> I would like to recompile the client from the rpm-source but looks
>> like the packaging on the jenkins is wrong:
>>
>> 1) wget 
>> https://build.hpdd.intel.com/job/lustre-b2_10/arch=x86_64,build_type=client,distro=el7,ib_stack=inkernel/lastSuccessfulBuild/artifact/artifacts/SRPMS/lustre-2.10.2_RC1-1.src.rpm
>> 2) rpmbuild --rebuild --without servers lustre-2.10.2_RC1-1.src.rpm
>> after the successful build the rpms doesn't contain the lnetctl but
>> the help only
>> 3) cd /root/rpmbuild/RPMS/x86_64
>> 4) rpm -qpl ./*.rpm| grep lnetctl
>> /usr/share/man/man8/lnetctl.8.gz
>> /usr/src/debug/lustre-2.10.2_RC1/lnet/include/lnet/lnetctl.h
>>
>> The   lustre-client-2.10.2_RC1-1.el7.x86_64.rpm on the jenkins
>> contains the lnetctl
>> Maybe I should add more options to rebuild the client + lnetctl?
>
> You need to have libyaml-devel installed on your build node.
>
> Cheers, Andreas
> --
> Andreas Dilger
> Lustre Principal Architect
> Intel Corporation
>
>
>
>
>
>
>
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


[lustre-discuss] Lustre 2.10.1 + RHEL7 Page Allocation Failures

2017-11-29 Thread Charles A Taylor

Hi All,

We recently upgraded from Lustre 2.5.3.90 on EL6 to 2.10.1 on EL7 (details 
below) but have hit what looks like LU-10133 (order 8 page allocation failures).

We don’t have access to look at the JIRA ticket in more detail but from what we 
can tell the the fix is to change from vmalloc() to vmalloc_array() in the mlx4 
drivers.  However, the vmalloc_array() infrastructure is in an upstream (far 
upstream) kernel so I’m not sure when we’ll see that fix.

While this may not be a Lustre issue directly, I know we can’t be the only 
Lustre site running 2.10.1 over IB on Mellanox ConnectX-3 HCAs.  So far we have 
tried increasing vm.min_free_kbytes to 8GB but that does not help.  
Zone_reclaim_mode is disabled (for other reasons that may not be valid under 
EL7) but order 8 chunks get depleted on both NUMA nodes so I’m not sure that is 
the answer either (though we have not tried it yet).

[root@ufrcmds1 ~]# cat /proc/buddyinfo 
Node 0, zone  DMA  1  0  0  0  2  1  1  0   
   1  1  3 
Node 0, zoneDMA32   1554  13496  11481   5108150  0  0  0   
   0  0  0 
Node 0, zone   Normal 114119 208080  78468  35679   6215690  0  0   
   0  0  0 
Node 1, zone   Normal  81295 184795 106942  38818   4485293   1653  0   
   0  0  0 

I’m wondering if other sites are hitting this and, if so, what are you doing to 
work around the issue on your OSSs.  

Regards,

Charles Taylor
UF Research Computing


Some Details:
---
OS: RHEL 7.4 (Linux ufrcoss28.ufhpc 3.10.0-693.2.2.el7_lustre.x86_64)
Lustre: 2.10.1 (lustre-2.10.1-1.el7.x86_64)
Clients: ~1400 (still running 2.5.3.90 but we are in the process of upgrading)
Servers: 10 HA OSS pairs (20 OSSs)
   128 GB RAM
   6 OSTs (8+2 RAID-6) per OSS 
   Mellanox ConnectX-3 IB/VPI HCAs 
   RedHat Native IB Stack (i.e. not MOFED)
   mlx4_core driver:
  filename:   
/lib/modules/3.10.0-693.2.2.el7_lustre.x86_64/kernel/drivers/net/ethernet/mellanox/mlx4/mlx4_core.ko.xz
  version:2.2-1
  license:Dual BSD/GPL
  description:Mellanox ConnectX HCA low-level driver
  author: Roland Dreier
  rhelversion:7.4___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org