from:"Dilger, Andreas"

Re: [lustre-discuss] checking and turning on quota enforcement.

2018-06-06 Thread Dilger, Andreas

On May 30, 2018, at 08:18, Phill Harvey-Smith  
wrote:
> 
> Hi all,
> 
> About a year ago I migrated the departmental lustre servers to new hardware, 
> which involved setting up new hardware, installing lustre and copying the 
> files over. All of which went OK and we have been using the volumes without 
> problems for the last year or so.
> 
> However it has just become apparent that the new servers are not enforcing 
> quotas, though they are set for all users on two of our volumes. This has 
> lead to the situation where one of our volumes is almost full, and is 
> negatively affecting performance.
> 
> Looking at the documentation for lfs suotaon / quotaoff it suggests that they 
> are depreciated since lustre version 2.4.x. We are currently running 2.9.0 on 
> Centos 7.
> 
> How can I check to see if quota enforcement is turned on, and if not turn it 
> on once we have people back under quota again (or adjusted their quota).

You didn't mention if you are using ldiskfs or ZFS for the backing filesystem.  
I'm assuming since you upgraded from 2.4 that it is likely you are using 
ldiskfs.

In that case, you need to enable quota on the MDT/OST filesystems using 
"tune2fs -O quota" and run a full e2fsck to update the quota usage on each 
filesystem.

Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Intel Corporation







___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Re: [lustre-discuss] Using Lustre for distributed build use case -- Performance tuning suggestions

2018-05-29 Thread Dilger, Andreas

Unless you have huge directories, you may not see any improvement from DNE, and 
it may hurt performance because striped directories have more overhead when 
they are first created.

DNE is mostly useful when a single MDS is overloaded by many clients, but with 
the small IO workload here that may not be the case.

Also, you would likely benefit from IB networking, which is lower latency 
compared to TCP.

Cheers, Andreas

On May 29, 2018, at 17:26, meng ding 
mailto:dingm...@gmail.com>> wrote:

Hi,

We are in the process of helping a client to evaluate share file system 
solutions for their distributed build use case. Our client is a company with 
hundreds of developers around the globe doing system software development 24 
hours. At anytime, there could be many builds running, including interactive 
builds (mostly incremental build), or batch builds (mostly regression tests).

The size of the build is very big. For example, a single full build may have 
more than 6000 build tasks (i.e., make rules) that can all be run in parallel. 
Each build task takes about 6 seconds on average to run. So a sequential build 
using 1 CPU (e.g., make -j 1)  will take 10 hours to complete.

Our client is using a distributed build software to run the build across a 
cluster of build hosts. Think of (make -j N), only the build tasks are running 
simultaneously on multiple hosts, instead of one. Obviously for this to work, 
they need to have a shared file system with good performance. Our client is 
currently using NFS on NetApp, which most of the time provides good 
performance, but at a very high cost. With this combination, our client is able 
to complete the above mentioned build in less than 5 minutes in a build cluster 
with about 30 build hosts (25 cores per host). Another advantage of using a 
cluster of build hosts is to accommodate many builds from multiple developers 
at the same time, with each developer dynamically assigned a fair share of the 
total cores in the build cluster at any given time based on the resource 
requirement of each build.

The distributed build use case has the following characteristics:

  *   Mostly very small source files (tens of thousands of them) less than 16K 
to start with.
  *   Source files are read-only. All reads are sequential.
  *   Source files are read repetitively (e.g., the header files). So it can 
benefit hugely from client-side caching.
  *   Intermediate object files, libraries, or binary files are small to medium 
in size, the biggest binary generated is about several hundred megabytes.
  *   Binary/object files are generated by small random writes.
  *   There is NO concurrent/shared access to the same file. Each build task 
generates its own output file.

With this use case in mind, we are trying to explore alternative solutions to 
NFS on NetApp with the goal to achieve comparable performance with reduced 
cost. So far, we have done some benchmark with Lustre on distributed build of 
GCC 8.1 on AWS, but the performance is lagging quite a bit behind even kernel 
NFS:

Lustre Setup

Lustre Server

  *   2 MDS each has m5.2xlarge instance (8 vCPUS, 32GiB Mem, up to 10Gb 
network), backed by 80 GiB SSD formated with LDISKFS.
  *   DNE phase II (striped directory) is enabled.
  *   No data striping is enabled because most files are small.
  *   4 OSS each has m5.xlarge instance (4 vCPUS, 16GiB Mem, up to 10Gb 
network) , backed by 40 GiB SSD formated with LDISKFS.
Build cluster
30 build hosts m5.xlarge, 120 CPUs in total all mounting the same Lustre volume

  *   The following is configured on all build hosts:
mount -t lustre -o localflock …
lctl set_param osc./*.checksums=0
lctl set_param osc./*.max_rpcs_in_flight=32
lctl set_param osc./*.max_dirty_mb=128

Test and results
Running distributed build of GCC 8.1 in the Lustre mount across the build 
cluster:

Launching 1 build only:

  *   Takes on average 17 minutes 45 seconds to finish.

Launching 20 builds at the same time all sharing the same build cluster:

  *   Takes on average 46 minutes to finish for each build.

By the way, we have tried the Data-on-MDT feature since we are using Lustre 
2.11, but we did not observe performance improvement.

Kernel NFS Setup

NFS Server
1 NFS server m5.2xlarge (8 vCPUS, 32GiB Mem, up to 10Gb network), backed by 300 
GiB SSD formatted with XFS

Build cluster
30 build hosts m5.xlarge, 120 CPUs in total all mounting the same NFS volume 
using NFS v3 protocol.

Test and results
Running distributed build of GCC 8.1 in the NFS mount across the build cluster:
Launching 1 build only:

  *   Takes on average 16 minutes 36 seconds to finish. About 1 minute faster 
than Lustre.

Launching 20 builds at the same time all sharing the same build cluster:

  *   Takes on average 38 minutes to finish for each build. About 8 minutes 
faster than Lustre.


So our question to the Lustre experts, given the distributed build use, case do 
you suggest anything else that we can try to potentially

Re: [lustre-discuss] set_param permanent on client side ?

2018-05-28 Thread Dilger, Andreas

Running"lctl get_param -P" and "lctl conf_param" need to be done on the MGS to 
be able to store the records in the config log, and to ensure that the user has 
the correct access permissions. 

The clients are notified of the config log update and apply the logs locally, 
the parameters do not need to be present on the MGS for this to work. 

Cheers, Andreas

> On May 28, 2018, at 16:19, Riccardo Veraldi  
> wrote:
> 
> the problem is that some of these parameters seems like are not there on
> the MDS side
> 
> lctl get_param osc.*.checksums
> error: get_param: param_path 'osc/*/checksums': No such file or directory
> 
> lctl get_param osc.*.max_pages_per_rpc
> error: get_param: param_path 'osc/*/max_pages_per_rpc': No such file or
> directory
> 
> lctl get_param llite.*.max_cached_mb
> 
> only some of them are on the MDS side
> 
> lctl get_param osc.*.max_rpcs_in_flight
> osc.drpffb-OST0001-osc-MDT.max_rpcs_in_flight=64
> osc.drpffb-OST0002-osc-MDT.max_rpcs_in_flight=64
> osc.drpffb-OST0003-osc-MDT.max_rpcs_in_flight=64
> 
> 
> 
> 
>> On 5/23/18 1:15 AM, Artem Blagodarenko wrote:
>> Hello Riccardo,
>> 
>> There is “lctl set_param -P” command that set parameter permanently. It 
>> needs to be executed on MGS server (and only MGS must be mounted), but 
>> parameter is applied to given  target (or client). From your example:
>> 
>> lctl set_param -P osc.*.checksums=0   
>> 
>> Will execute “set_param osc.*.checksums=0” on all targets.
>> 
>> Best regards,
>> Artem Blagodarenko.
>> 
>>> On 23 May 2018, at 00:11, Riccardo Veraldi  
>>> wrote:
>>> 
>>> Hello,
>>> 
>>> how do I set_param in a persistent way on the lsutre clinet side so that
>>> it has not to be set every time after reboot ?
>>> 
>>> Not all of these parameters can be set on the MDS, for example the osc.* :
>>> 
>>> lctl set_param osc.*.checksums=0
>>> lctl set_param timeout=600
>>> lctl set_param at_min=250
>>> lctl set_param at_max=600
>>> lctl set_param ldlm.namespaces.*.lru_size=2000
>>> lctl set_param osc.*.max_rpcs_in_flight=64
>>> lctl set_param osc.*.max_dirty_mb=1024
>>> lctl set_param llite.*.max_read_ahead_mb=1024
>>> lctl set_param llite.*.max_cached_mb=81920
>>> lctl set_param llite.*.max_read_ahead_per_file_mb=1024
>>> lctl set_param subsystem_debug=0
>>> 
>>> thank you
>>> 
>>> 
>>> Rick
>>> 
>>> 
>>> 
>>> ___
>>> lustre-discuss mailing list
>>> lustre-discuss@lists.lustre.org
>>> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
> 
> 
> ___
> lustre-discuss mailing list
> lustre-discuss@lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Re: [lustre-discuss] Adding user_xattr option

2018-05-24 Thread Dilger, Andreas

On May 24, 2018, at 06:27, Trickey, Ron  
wrote:
> 
> I need to add the user_xattr option to lustre. I’ve read that the option must 
> be added to the MDS first. Will a restart of the MDS be required, or do I 
> simply need to remount? Also does anyone know if the user_xattr option 
> supports permissions applied with the setfacl command?

You will need to restart the MDS for this to take effect.  That said, this
should be the default unless you have a very old filesystem.

For ACLs to be enabled, you need to add the "acl" mount option on the MDS.

Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Intel Corporation







___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Re: [lustre-discuss] zfs has native dnode accounting supported... no

2018-05-16 Thread Dilger, Andreas

On May 16, 2018, at 00:22, Hans Henrik Happe  wrote:
> 
> When building 2.10.4-RC1 on CentOS 7.5 I noticed this during configure:
> 
> zfs has native dnode accounting supported... no
> 
> I'm using the kmod version of ZFS 0.7.9 from the official repos.
> Shouldn't native dnode accounting work with these versions?
> 
> Is there a way to detect if a Lustre filesystem is using native dnode
> accounting?

This looks like a bug.  The Lustre code was changed to detect ZFS project
quota (which has a different function signature in ZFS 0.8) but isn't
included in the ZFS 0.7.x releases, but lost the ability to detect the old
dnode accounting function signature.

I've pushed patch https://review.whamcloud.com/32418 that should fix this.

Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Intel Corporation

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Re: [lustre-discuss] Synchronous writes on a loaded ZFS OST

2018-05-15 Thread Dilger, Andreas


> On May 15, 2018, at 10:05, Steve Thompson  wrote:
> 
> On Tue, 8 May 2018, Vicker, Darby (JSC-EG311) wrote:
> 
>> The fix suggested by Andreas in that thread worked fairly well and we 
>> continue to use it.
> 
> I'd like to inquire whether the fixes of LU-4009 will be available in a 
> future Lustre version, and if so, when it is the likely release date. My 
> installation (2.10.3) is severely affected by the fsync() issue, and I'd like 
> to patch it as soon as possible. TIA,

There is no active development on LU-4009 ("Add ZIL support to osd-zfs")
so there is no ETA.

The only available workaround is LU-10460, which can be enabled with either
a module parameter:

options osd-zfs osd_txg_sync_delay_us=0

or at runtime via:

echo 0 > /sys/module/osd_zfs/parameters/osd_txg_sync_delay_us

This is available out-of-the box in Lustre 2.11 and Lustre 2.10.4 (when it is
released, which should be soon), or you can download and build b2_10 today.

Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Intel Corporation







___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Re: [lustre-discuss] Lustre Operations Manual PDF version

2018-05-11 Thread Dilger, Andreas

On May 11, 2018, at 12:29, Ms. Megan Larko  wrote:
> 
> Hi!
> 
> I am trying to get the PDF version of the Lustre Operations Manual from site
> http://doc.lustre.org   and click "PDF".   The direct link is shown as 
> http://doc.lustre.org/lustre_manual.pdf.
> 
> Today (11 May 2018) I am getting errors that the PDF cannot be opened.
> 
> Is there an issue or should I be using a different site to get the 
> most-current Manual?

The "latest build" of the manual is hosted on our build servers, which had a 
problem recently but should be available again now.

At one time we discussed having an rsync job to periodically copy the most 
recent .pdf and .html document builds to lustre.org, but I don't know if that 
happened.

Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Intel Corporation

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Re: [lustre-discuss] --disable-tests

2018-05-09 Thread Dilger, Andreas

On May 9, 2018, at 10:00, Michael Di Domenico  wrote:
> 
> when i try to
> 
> ./configure --disable-server --disable-tests
> 
> and then do
> 
> make && make rpms
> 
> the code compiles through until it gets to
> 
> Entering directory '/hpc/lustre/src/lustre-release/lustre/tests'
> 
> and then stops.  I know exactly why it stops, but am i mistaken in
> thinking that --disable-tests means that it shouldn't even be going in
> that directory?
> 
> fyi, it's failing because the binaries files (ie
> disk1_8_up_2_5-ldiskfs.tar.bz2 and it's brothers) are missing from my
> source tree (by design)

Feel free to submit a patch to the Makefiles to handle this situation,
if it is important to you.

Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Intel Corporation







___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Re: [lustre-discuss] /proc/sys/lnet gone ?

2018-05-08 Thread Dilger, Andreas

Please use "lctl get_param peers" or "lctl get_param nis". This will work with 
any version of lustre, since we have to move files from /proc to /sys to make 
upstream kernel folks happy. 

Cheers, Andreas

> On May 8, 2018, at 18:24, Riccardo Veraldi  
> wrote:
> 
> Hello,
> on my tunning lustre 2.11.0 testbed  I cannot find anymore /proc/sys/lnet
> was very handy to look at /proc/sys/lnet/peers nd /proc/sys/lnet/nis
> has this been moved somewhre else ?
> thank you
> 
> 
> Rick
> 
> 
> ___
> lustre-discuss mailing list
> lustre-discuss@lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Re: [lustre-discuss] PFL and OST Pools

2018-05-06 Thread Dilger, Andreas

This works as (hopefully) one would expect it to:

lfs setstripe -E4M -c1 --pool=pool-1 -E-1 -c-1 /mnt/lustre

There is no need to explicitly specify "--pool pool-1" for the second 
component, as it inherits the properties from the previous component. 

Setting it on the root directory would set this as the default for the whole 
filesystem, otherwise on any subdirectory it is inherited only by new 
files/directories created in that directory. 

Cheers, Andreas

> On May 6, 2018, at 05:49, Dzmitryj Jakavuk  wrote:
> 
> Hello 
> 
> I am  looking  for an ability  to use PFL inside the specific  OST pool.  For 
> example to stripe files which size is below 4MB to a single ost in pool-1and 
> and stripe  files which size is bigger  than 4 MB to all osts  in pool-1 ?
> 
> Thank  you  
> ___
> lustre-discuss mailing list
> lustre-discuss@lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Re: [lustre-discuss] Lustre on native ZFS encryption

2018-05-03 Thread Dilger, Andreas

If your mount is hanging, you could "modprobe libcfs" to load the debugging
infrastructure, then "lctl set_param debug=-1" to enable full debugging before
you try to mount the ZFS filesystem.  This will tell you where the code is
getting stuck.  Also, "echo t > /proc/sysrq-trigger" to dump the kernel
thread stacks would be useful...

It's probably worthwhile to start putting this info into a Jira ticket so
that it is available for future reference.  Since ZFS 0.8 is not released
yet we still have some time to debug this before there are any real users.
I don't recommend anyone to use ZFS 0.8 in production yet, because the disk
formats may still change during development.  However, testing it before
release is definitely valuable so that it works reasonably well _before_ it
is released and doesn't need 5 bugs fixed immediately after the release.

Cheers, Andreas

On May 2, 2018, at 07:59, Mark Miller <mmill...@jhu.edu> wrote:
> 
> Hi Andreas,
> 
> You are correct in that there is a separate key management step in the zfs 
> encryption, but in the testing I was doing, the key was already loaded.  The 
> key is loaded either a) when the “zfs create” is done, b) when a “zfs mount 
> -l . .” is done, or c) separately via the “zfs load-key ...” command (the 
> benefit of this third method is that key management can be done by a non-root 
> key owner).  In my testing, the key had been loaded when I either created the 
> zfs filesystem via mkfs.lustre, or, after a reboot, by running the “zfs 
> load-key” just prior to the “mount -t lustre ...” command.
> 
> Since I have the Lustre source code, I can start looking through it to see if 
> I can find where the Lustre mount system call may be getting hung up.  I have 
> no idea... but it feels like the Lustre mount may be trying to read something 
> from the zfs filesystem “under the hood”, and it’s getting the result back in 
> its encrypted state.  We always have LUKS encryption to fall back to.
> 
> Mark
> --
> Mark Miller – JHPCE Cluster Technology Manager
> Johns Hopkins Bloomberg School of Public Health
> Office E2530, 615 N. Wolfe St., Baltimore, MD, 21205
> 443-287-2774 |  https://jhpce.jhu.edu/
> 
> On 5/1/18, 7:11 PM, "Dilger, Andreas" <andreas.dil...@intel.com> wrote:
> 
>On May 1, 2018, at 11:10, Mark Miller <mmill...@jhu.edu> wrote:
>> 
>> Hi All,
>> 
>> I’m curious if anyone has gotten Lustre working on top of native ZFS 
>> encryption. 
> 
>Yes, I think this would be quite interesting and useful for a number of 
> environments.  Since Lustre already has over-the-network encryption (Kerberos 
> or SSK "privacy" mode), this would provide a reasonably secure solution, at 
> the expense of doing extra crypto operations on the server.  Not quite as 
> awesome as end-to-end encryption handled entirely on the client, but 
> definitely better than no disk encryption at all.
> 
>> I realize I’m stretching the bounds of compatibility, but I’m wondering if 
>> someone has gotten it to work.  I assumed that Lustre would just sit on top 
>> of the encryption layer of ZFS, but
>> it doesn’t seem to work for me.  I’m able to run the “mkfs.lustre” with the 
>> ZFS encryption options added to the mkfs, but when I try to mount the OST, 
>> the mount command hangs.
>> 
>> What does work:
>> - I can create Lustre OSTs without encryption options, and the OSTs gets 
>> created and can be mounted as expected.
>> - I can create encrypted ZFS filesystems, and the ZFS filesystem works as 
>> expected.
>> - I can use LUKS to create encrypted devices, build a zpool on top of those 
>> LUKS encrypted devices, then run “mkfs.lustre” (without encryption options), 
>> and the Lustre filesystem mounts and works as expected.
>> 
>> What doesn’t work:
>> - Creating a Lustre filesystem with encryption options (mkfs.lustre ... 
>> --mkfsoptions="encryption=on ...”), and then mounting the Lustre filesystem.
>> - Building a zpool with encryption on the pool, running mkfs.lustre on top 
>> of the encrypted zpool.
>> - I also tried building an MDT with encryption, but that also hung while 
>> trying to mount the filesystem.
> 
>I haven't looked at the ZFS encryption code, but I suspect that the Lustre 
> osd-zfs code would need to pass an encryption key when it opens the pool, or 
> at least before it is accessible.  That is likely what is preventing the 
> filesystem access when Lustre mounts the new MDT.
> 
>I don't know the mechanism by which the key would be passed to the 
> dataset.  It might be a parameter that is passed to the dataset at open time, 
> or it might be passed out-of-band from userspac

Re: [lustre-discuss] Error destroying object

2018-05-02 Thread Dilger, Andreas

This is an OST FID, so you would need to get the parent MDT FID to be able to 
resolve the pathname.

Assuming an ldiskfs OST you can use:

'debugfs -c -R "stat O/0/d$((0x1bfc24c % 32))/$((0x1bfc24c))" 
LABEL=wurfs-OST001c'

To get the parent FID, then "lfs fid2path /mnt/wurfs " on a client to find 
the path.

That said, the -115 error is "-EINPROGRESS", which means the OST thinks it is 
already trying to do this. Maybe a hung OST thread?

Cheers, Andreas

On May 2, 2018, at 06:53, Sidiney Crescencio 
>
 wrote:

Hi All,

I need help to discover what file is about this error or how to solve it.

Apr 30 13:48:02 storage06 kernel: LustreError: 
44779:0:(ofd_dev.c:1884:ofd_destroy_hdl()) wurfs-OST001c: error destroying 
object [0x1001c:0x1bfc24c:0x0]: -115

I've been trying to map this to a file but I can't since I don't have the FID

Anyone knows how to sort it out?

Thanks in advance

--
Best Regards,



Sidiney


___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Re: [lustre-discuss] Lustre on native ZFS encryption

2018-05-01 Thread Dilger, Andreas

On May 1, 2018, at 11:10, Mark Miller  wrote:
> 
> Hi All,
>  
> I’m curious if anyone has gotten Lustre working on top of native ZFS 
> encryption. 

Yes, I think this would be quite interesting and useful for a number of 
environments.  Since Lustre already has over-the-network encryption (Kerberos 
or SSK "privacy" mode), this would provide a reasonably secure solution, at the 
expense of doing extra crypto operations on the server.  Not quite as awesome 
as end-to-end encryption handled entirely on the client, but definitely better 
than no disk encryption at all.

> I realize I’m stretching the bounds of compatibility, but I’m wondering if 
> someone has gotten it to work.  I assumed that Lustre would just sit on top 
> of the encryption layer of ZFS, but
> it doesn’t seem to work for me.  I’m able to run the “mkfs.lustre” with the 
> ZFS encryption options added to the mkfs, but when I try to mount the OST, 
> the mount command hangs.
>  
> What does work:
> - I can create Lustre OSTs without encryption options, and the OSTs gets 
> created and can be mounted as expected.
> - I can create encrypted ZFS filesystems, and the ZFS filesystem works as 
> expected.
> - I can use LUKS to create encrypted devices, build a zpool on top of those 
> LUKS encrypted devices, then run “mkfs.lustre” (without encryption options), 
> and the Lustre filesystem mounts and works as expected.
>  
> What doesn’t work:
> - Creating a Lustre filesystem with encryption options (mkfs.lustre ... 
> --mkfsoptions="encryption=on ...”), and then mounting the Lustre filesystem.
> - Building a zpool with encryption on the pool, running mkfs.lustre on top of 
> the encrypted zpool.
> - I also tried building an MDT with encryption, but that also hung while 
> trying to mount the filesystem.

I haven't looked at the ZFS encryption code, but I suspect that the Lustre 
osd-zfs code would need to pass an encryption key when it opens the pool, or at 
least before it is accessible.  That is likely what is preventing the 
filesystem access when Lustre mounts the new MDT.

I don't know the mechanism by which the key would be passed to the dataset.  It 
might be a parameter that is passed to the dataset at open time, or it might be 
passed out-of-band from userspace with a "zfs" or "zpool" command?

Cheers, Andreas

> I’m on Centos 7.4 (3.10.0-693.21.1.el7.x86_64), and using zfs and spl 
> compiled from https://github.com/zfsonlinux (which has the encryption code 
> from Tom Caputi built into it).  For the Lustre versions, I’ve tried using 
> the version 2.10.3 rpms from 
> https://downloads.hpdd.intel.com/public/lustre/latest-release/el7/server as 
> well as the 2.11.50_74 version compiled from 
> git://git.hpdd.intel.com/fs/lustre-release.git, and the results are the same 
> on both versions of Lustre.
>  
>  
> Here is an example of what happens when I try to use Lustre with ZFS 
> encryption:
>  
> [root@oss1 lustre-release]# zpool create ost02-pool raidz2 disk0 disk1 disk2 
> disk3
> [root@oss1 lustre-release]# mkfs.lustre --ost --backfstype=zfs 
> --fsname=lustre01 --index=3 --mgsnode=192.168.56.131@tcp1 
> --mkfsoptions="encryption=on -o keyformat=passphrase -o 
> keylocation=file:///tmp/key" --servicenode=192.168.56.121@tcp1 ost02-pool/ost3
>  
>Permanent disk data:
> Target: lustre01:OST0003
> Index:  3
> Lustre FS:  lustre01
> Mount type: zfs
> Flags:  0x1062
>   (OST first_time update no_primnode )
> Persistent mount opts: 
> Parameters: mgsnode=192.168.56.131@tcp1 failover.node=192.168.56.121@tcp1
> checking for existing Lustre data: not found
> mkfs_cmd = zfs create -o canmount=off  -o encryption=on -o 
> keyformat=passphrase -o keylocation=file:///tmp/key ost02-pool/ost3
>   xattr=sa
>   dnodesize=auto
>   recordsize=1M
> Writing ost02-pool/ost3 properties
>   lustre:mgsnode=192.168.56.131@tcp1
>   lustre:failover.node=192.168.56.121@tcp1
>   lustre:version=1
>   lustre:flags=4194
>   lustre:index=3
>   lustre:fsname=lustre01
>   lustre:svname=lustre01:OST0003
>  
>  
> But then when I run:
>  
> [root@oss1 lustre-release]# mount -t lustre ost02-pool/ost3 
> /mnt/lustre/local/oss03
>  
> the command hangs. After I reboot and do an strace of the mount command, it 
> hangs at:
>  
> . . .
>  
> ioctl(3, _IOC(0, 0x5a, 0x16, 0x00), 0x7ffe87468b70) = 0
> ioctl(3, _IOC(0, 0x5a, 0x12, 0x00), 0x7ffe87465550) = 0
> ioctl(3, _IOC(0, 0x5a, 0x16, 0x00), 0x7ffe87468bd0) = 0
> ioctl(3, _IOC(0, 0x5a, 0x12, 0x00), 0x7ffe874655b0) = 0
> ioctl(3, _IOC(0, 0x5a, 0x16, 0x00), 0x7ffe87468bd0) = 0
> ioctl(3, _IOC(0, 0x5a, 0x12, 0x00), 0x7ffe874655b0) = 0
> mount("ost02-pool/ost3", "/mnt/lustre/local/oss03", "lustre", MS_STRICTATIME, 
> "osd=osd-zfs,mgsnode=192.168.56.131@tcp1,virgin,update,noprimnode,param=mgsnode=192.168.56.131@tcp1,param=failover.node=192.168.56.121@tcp1,svname=lustre01-OST0003,device=ost02-pool/ost3"
>  
>  
>  
> The LUKS encryption does work to provide encryption at rest, but there

Re: [lustre-discuss] Do I need Lustre?

2018-04-30 Thread Dilger, Andreas

On Apr 30, 2018, at 07:11, Thackeray, Neil L  wrote:
> 
> Sorry, I left out file size. We don't foresee growing tremendously. The plan 
> is for researchers to upload their data, get the results, and copy it down to 
> a mounted file system. This is going to be used by multiple researchers, and 
> we will be charging for compute time. We really don't want this cluster to be 
> used for storing data outside of the time needed for their computations. We 
> may just start with 100TB of SSD storage.

One of the major benefits of Lustre is that it can be used directly for 
large-scale computing.  Having users copy data to/from Lustre is fairly 
inefficient (though surprisingly copying files to/from a direct Lustre mount 
can be faster than FTP or SCP or other network copy tools).

You'd be better off to increase the size of your Lustre filesystem, enough that 
users can store "projects" there for some time while they compute, rather than 
needing to move the data on/off the filesystem a lot.

While using an all-SSD filesystem is appealing, you might find better 
performance with some kind of hybrid storage, like ZFS + L2ARC + Metadata 
Allocation Class (this feature is in development, target 2018-09, depending on 
your timeframe).  

You definitely want your MDT(s) to be SSDs, especially if you use the new 
Data-on-MDT feature to store small files tehre.  The OSTs can be HDDs to give 
you a lot more capacity for the same price.

Cheers, Andreas

> -Original Message-
> From: lustre-discuss  On Behalf Of 
> Philippe Weill
> Sent: Saturday, April 28, 2018 1:14 AM
> To: lustre-discuss@lists.lustre.org
> Subject: Re: [lustre-discuss] Do I need Lustre?
> 
> 
> 
> Le 27/04/2018 à 19:07, Thackeray, Neil L a écrit :
>> I’m new to the cluster realm, so I’m hoping for some good advice. We 
>> are starting up a new cluster, and I’ve noticed that lustre seems to be used 
>> widely in datacenters. The thing is I’m not sure the scale of our cluster 
>> will need it.
>> 
>> We are planning a small cluster, starting with 6 -8 nodes with 2 GPUs 
>> per node. They will be used for Deep Learning, MRI data processing, 
>> and Matlab among other things. With the size of the cluster we figure 
>> that 10Gb networking will be sufficient. We aren’t going to allow persistent 
>> storage on the cluster. Users will just upload and download data. I’m mostly 
>> concerned about I/O speeds. I don’t know if NFS would be fast enough to 
>> handle the data.
>> 
>> We are hoping that the cluster will grow over time. We are already talking 
>> about buying more nodes next fiscal year.
>> 
>> Thanks.
>> 
> 
> hello
> 
> you didn't say anything about filesystem size needed and if you are thinking 
> to grow fast we also run a small cluster ( 20 nodes ) but for climate data 
> modeling results and satellite atmospheric data analysis we are growing at 
> least 300TB per year (2PB now) and it's easier for us to grow with lustre
> 
> 
> --
> Weill Philippe -  Administrateur Systeme et Reseaux
> CNRS/UPMC/IPSL   LATMOS (UMR 8190)
> ___
> lustre-discuss mailing list
> lustre-discuss@lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
> ___
> lustre-discuss mailing list
> lustre-discuss@lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Intel Corporation







___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Re: [lustre-discuss] rhel 7.5

2018-04-30 Thread Dilger, Andreas

On Apr 30, 2018, at 11:49, Michael Di Domenico  wrote:
> 
> On Mon, Apr 30, 2018 at 10:09 AM, Jeff Johnson
>  wrote:
>> RHEL 7.5 support comes in Lustre 2.10.4. Only path I can think of off the
>> top of my head is to git clone and build a 2.10.4 prerelease and live on the
>> bleeding edge. I’m not sure if all of the 7.5 work is finished in the
>> current prerelease or not.
> 
> argh...  not sure i want to be that bleeding edge...  sadly, i can't
> find a release schedule for 2.10.4.  i wonder if 2.11 will work

You are free to do what you want, but I'd think 2.10.4 is far less "bleeding 
edge" than 2.11 compared to 2.10.2.  The 2.10.x branch only gets bug fixes, 
while significant new features are being added to 2.11.

Cheers, Andreas

>> On Mon, Apr 30, 2018 at 06:21 Michael Di Domenico 
>> wrote:
>>> 
>>> On Mon, Apr 30, 2018 at 9:19 AM, Michael Di Domenico
>>>  wrote:
 when i tried to compile 2.10.2 patchless client into rpms under rhel
 7.5 using kernel 3.10.0-862.el7.x86_64
 
 the compilation went fine as far as i can tell and the rpm creation
 seemed to work
 
 but when i went install the rpms i got
 
 Error: Package: kmod-lustre-client-2.10.2-1.el7.x86_64
 (/kmod-lustre-client-2.10.2-1.el7.x86_64
 requires: kernel < 3.10.0-694
>>> 
>>> premature send...
>>> 
>>> requires: kernel < 3.10.0-694
>>> Installed: kernel-3.10.0-862.el7.x86_64 (@updates/7.5)
>>> 
>>> did i do something wrong in the recompile of the rpms for the target
>>> kernel or is there a workaround for this?
>>> ___
>>> lustre-discuss mailing list
>>> lustre-discuss@lists.lustre.org
>>> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
>> 
>> --
>> --
>> Jeff Johnson
>> Co-Founder
>> Aeon Computing
>> 
>> jeff.john...@aeoncomputing.com
>> www.aeoncomputing.com
>> t: 858-412-3810 x1001   f: 858-412-3845
>> m: 619-204-9061
>> 
>> 4170 Morena Boulevard, Suite D - San Diego, CA 92117
>> 
>> High-Performance Computing / Lustre Filesystems / Scale-out Storage
> ___
> lustre-discuss mailing list
> lustre-discuss@lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Intel Corporation







___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Re: [lustre-discuss] lustre for home directories

2018-04-25 Thread Dilger, Andreas

On Apr 25, 2018, at 11:09, Riccardo Veraldi  
wrote:
> 
> Hello,
> just wondering if who is using lustre for home directories with several
> users is happy or not.

I can't comment for other people, but there are definitely some sites that are 
using Lustre for the /home directories.  Hopefully they will speak up here.

> I am considering to move home directories from NFS to Lustre/ZFS.
> it is quite easy to send the NFS server in troubles with just a few
> users copying files around.
> What special tuning is needed to optimize Lustre usage with small files?
> I guess 1M record size wold not be a good choice anymore.

You should almost certainly use a default of stripe_count=1 for home 
directories, on the assumption that files should not be gigantic.

In that case, stripe size does not matter if you have 1-stripe files.  This 
does not affect the on-disk allocation size.  If you have dedicated OSTs for 
the /home directories, then I'd recommend NOT to use recordsize=1M for ZFS, and 
instead leave it at the default (recordsize=128k).

Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Intel Corporation

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Re: [lustre-discuss] Lustre 2.11 lnet troubleshooting

2018-04-18 Thread Dilger, Andreas

On Apr 17, 2018, at 19:00, Faaland, Olaf P.  wrote:
> 
> So the problem was inded that "routing" was disabled on the router node.  I 
> added "routing: 1" to the lnet.conf file for the routers and lctl ping works 
> as expected.
> 
> The question about the lnet module option "forwarding" still stands.  The 
> lnet module still accepts a parameter, "forwarding", but it doesn't do what 
> it used to.   Is that just a leftover that needs to be cleaned up?

I would say that the module parameter should continue to work, and be 
equivalent to the "routing: 1" YAML parameter.  This facilitates upgrades.

Did you try this with 2.10 (which also has LNet Multi-Rail), or are you coming 
from 2.7 or 2.8?

I'd recommend to file a ticket in Jira for this.  I suspect it might also be 
broken in 2.10, and the fix should be backported there as well.

Cheers, Andreas

> 
> From: Faaland, Olaf P.
> Sent: Tuesday, April 17, 2018 5:05 PM
> To: lustre-discuss@lists.lustre.org
> Subject: Re: Lustre 2.11 lnet troubleshooting
> 
> Update:
> 
> Joe pointed out "lnetctl set routing 1".  After invoking that on the router 
> node, the compute node reports the route as up:
> 
> [root@ulna66:lustre-211]# lnetctl route show -v
> route:
>- net: o2ib100
>  gateway: 192.168.128.4@o2ib33
>  hop: -1
>  priority: 0
>  state: up
> 
> Does this replace the lnet module parameter "forwarding"?
> 
> Olaf P. Faaland
> Livermore Computing
> 
> 
> 
> From: lustre-discuss  on behalf of 
> Faaland, Olaf P. 
> Sent: Tuesday, April 17, 2018 4:34:22 PM
> To: lustre-discuss@lists.lustre.org
> Subject: [lustre-discuss] Lustre 2.11 lnet troubleshooting
> 
> Hi,
> 
> I've got a cluster running 2.11 with 2 routers and 68  compute nodes.  It's 
> the first time I've used a post-multi-rail version of Lustre.
> 
> The problem I'm trying to troubleshoot is that my sample compute node 
> (ulna66) seems to think the router I configured (ulna4) is down, and so an 
> attempt to ping outside the cluster results in failure and "no route to XXX" 
> on the console.  I can lctl ping the router from the compute node and 
> vice-versa.   Forwarding is enabled on the router node via modprobe argument.
> 
> lnetctl route show reports that the route is down.  Where I'm stuck is 
> figuring out what in userspace (e.g. lnetctl or lctl) can tell me why.
> 
> The compute node's lnet configuration is:
> 
> [root@ulna66:lustre-211]# cat /etc/lnet.conf
> ip2nets:
>  - net-spec: o2ib33
>interfaces:
> 0: hsi0
>ip-range:
> 0: 192.168.128.*
> route:
>- net: o2ib100
>  gateway: 192.168.128.4@o2ib33
> 
> After I start lnet, systemctl reports success and the state is as follows:
> 
> [root@ulna66:lustre-211]# lnetctl net show
> net:
>- net type: lo
>  local NI(s):
>- nid: 0@lo
>  status: up
>- net type: o2ib33
>  local NI(s):
>- nid: 192.168.128.66@o2ib33
>  status: up
>  interfaces:
>  0: hsi0
> 
> [root@ulna66:lustre-211]# lnetctl peer show --verbose
> peer:
>- primary nid: 192.168.128.4@o2ib33
>  Multi-Rail: False
>  peer ni:
>- nid: 192.168.128.4@o2ib33
>  state: up
>  max_ni_tx_credits: 8
>  available_tx_credits: 8
>  min_tx_credits: 7
>  tx_q_num_of_buf: 0
>  available_rtr_credits: 8
>  min_rtr_credits: 8
>  refcount: 4
>  statistics:
>  send_count: 2
>  recv_count: 2
>  drop_count: 0
> 
> [root@ulna66:lustre-211]# lnetctl route show --verbose
> route:
>- net: o2ib100
>  gateway: 192.168.128.4@o2ib33
>  hop: -1
>  priority: 0
>  state: down
> 
> I can instrument the code, but I figure there must be someplace available to 
> normal users to look, that I'm unaware of.
> 
> thanks,
> 
> Olaf P. Faaland
> Livermore Computing
> ___
> lustre-discuss mailing list
> lustre-discuss@lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
> ___
> lustre-discuss mailing list
> lustre-discuss@lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Intel Corporation







___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Re: [lustre-discuss] corrupt FID on zfs?

2018-04-09 Thread Dilger, Andreas

On Apr 9, 2018, at 02:10, Stu Midgley  wrote:
> 
> Afternoon
> 
> We have copied off all the files from an OST (lfs find identifies no files on 
> the OST) but the OST still has some left over files
> 
> eg.
> 
> 9.6G  O/0/d22/1277942
> 
> when I get the FID of this file using zfsobj2fid it appears to get a corrupt 
> FID
> 
> [0x20a48:0x1e86e:0x1]
> 
> which then returns
> 
> bad FID format '[0x20a48:0x1e86e:0x1]', should be [seq:oid:ver] (e.g. 
> [0x20400:0x2:0x0])
> 
> fid2path: error on FID [0x20a48:0x1e86e:0x1]: Invalid argument
> 
> when I check it with lfs fid2path

Try it with the last field as 0x0, like "[0x20a48:0x1e86e:0x0]".
On the OST, we use the last field to store the stripe index for the file,
so that LFSCK can reconstruct the file layout even if the MDT inode is
corrupted.

> WTF?
> 
> Checking a few OST's this isn't isolated.  I've seen a few different 
> corruptions eg.
> 
> [0x20a48:0x1e86e:0x7]
> [0x20a48:0x1e684:0x3]
> 
> 
> Extra, quite a file files under the O/0/ directory didn't have trusted.fid 
> set... which seemed strange.

That is not unusual, since the parent (MDT inode) FID is only stored into the
object if it is modified by a client, or if an LFSCK layout check is run.

> So a few questions.  
> How did the FID type get corrupt?
> How did this file get orphaned?
> 
> I had to modify zfsobj2fid  to work with a mounted snapshot of the ZFS volume
> 
> # diff ../zfsobj2fid /sbin/zfsobj2fid
> 38c38
> < p = subprocess.Popen(["zdb", "-O", "-vvv", sys.argv[1], sys.argv[2]],
> ---
> > p = subprocess.Popen(["zdb", "-e", "-vvv", sys.argv[1], sys.argv[2]],

It would be great if you could submit this as a patch to Gerrit.


Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Intel Corporation







___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Re: [lustre-discuss] bad performance with Lustre/ZFS on NVMe SSD

2018-04-09 Thread Dilger, Andreas

On Apr 6, 2018, at 23:04, Riccardo Veraldi  
wrote:
> 
> So I'm struggling since months with these low performances on Lsutre/ZFS.
> 
> Looking for hints.
> 
> 3 OSSes, RHEL 74  Lustre 2.10.3 and zfs 0.7.6
> 
> each OSS has one  OST raidz
> 
>   pool: drpffb-ost01
>  state: ONLINE
>   scan: none requested
>   trim: completed on Fri Apr  6 21:53:04 2018 (after 0h3m)
> config:
> 
> NAME  STATE READ WRITE CKSUM
> drpffb-ost01  ONLINE   0 0 0
>   raidz1-0ONLINE   0 0 0
> nvme0n1   ONLINE   0 0 0
> nvme1n1   ONLINE   0 0 0
> nvme2n1   ONLINE   0 0 0
> nvme3n1   ONLINE   0 0 0
> nvme4n1   ONLINE   0 0 0
> nvme5n1   ONLINE   0 0 0
> 
> while the raidz without Lustre perform well at 6GB/s (1GB/s per disk),
> with Lustre on top of it performances are really poor.
> most of all they are not stable at all and go up and down between
> 1.5GB/s and 6GB/s. I Tested with obfilter-survey
> LNET is ok and working at 6GB/s (using infiniband FDR)
> 
> What could be the cause of OST performance going up and down like a
> roller coaster ?

Riccardo,
to take a step back for a minute, have you tested all of the devices
individually, and also concurrently with some low-level tool like
sgpdd or vdbench?  After that is known to be working, have you tested
with obdfilter-survey locally on the OSS, then remotely on the client(s)
so that we can isolate where the bottleneck is being hit.

Cheers, Andreas


> for reference here are few considerations:
> 
> filesystem parameters:
> 
> zfs set mountpoint=none drpffb-ost01
> zfs set sync=disabled drpffb-ost01
> zfs set atime=off drpffb-ost01
> zfs set redundant_metadata=most drpffb-ost01
> zfs set xattr=sa drpffb-ost01
> zfs set recordsize=1M drpffb-ost01
> 
> NVMe SSD are  4KB/sector
> 
> ashift=12
> 
> 
> ZFS module parameters
> 
> options zfs zfs_prefetch_disable=1
> options zfs zfs_txg_history=120
> options zfs metaslab_debug_unload=1
> #
> options zfs zfs_vdev_scheduler=deadline
> options zfs zfs_vdev_async_write_active_min_dirty_percent=20
> #
> options zfs zfs_vdev_scrub_min_active=48
> options zfs zfs_vdev_scrub_max_active=128
> #options zfs zfs_vdev_sync_write_min_active=64
> #options zfs zfs_vdev_sync_write_max_active=128
> #
> options zfs zfs_vdev_sync_write_min_active=8
> options zfs zfs_vdev_sync_write_max_active=32
> options zfs zfs_vdev_sync_read_min_active=8
> options zfs zfs_vdev_sync_read_max_active=32
> options zfs zfs_vdev_async_read_min_active=8
> options zfs zfs_vdev_async_read_max_active=32
> options zfs zfs_top_maxinflight=320
> options zfs zfs_txg_timeout=30
> options zfs zfs_dirty_data_max_percent=40
> options zfs zfs_vdev_scheduler=deadline
> options zfs zfs_vdev_async_write_min_active=8
> options zfs zfs_vdev_async_write_max_active=32
> 
Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Intel Corporation







___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Re: [lustre-discuss] latest kernel version supported by Lustre ?

2018-04-08 Thread Dilger, Andreas

What version of Lustre?  I think 2.11 clients work with something like 4.8? 
kernels, while 2.10 works with 4.4?  Sorry, I can't check the specifics right 
now. 

If you need a specific kernel, the best thing to do is try the configure/build 
step for Lustre with that kernel, and then check Jira/Gerrit for tickets for 
each build failure you hit. 

It may be that there are some unlanded patches that can get you a running 
client. 

Cheers, Andreas

> On Apr 7, 2018, at 09:48, Riccardo Veraldi  
> wrote:
> 
> Hello,
> 
> if I would like to use kernel 4.* from elrepo on RHEL74 for the lustre
> OSSes what is the latest supported kernel 4 version  by Lustre ?
> 
> thank you
> 
> 
> Rick
> 
> ___
> lustre-discuss mailing list
> lustre-discuss@lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Re: [lustre-discuss] Upgrade to 2.11: unrecognized mount option

2018-04-06 Thread Dilger, Andreas

On Apr 6, 2018, at 06:27, Thomas Roth  wrote:
> 
> Hi all,
> 
> (don't know if it isn't a bit early to complain yet, but)
> I have upgraded an OSS and MDS von 2.10.2 to 2.11.0, just installing the 
> downloaded rpms - no issues here, except when mounting the MDS:
> 
> > LDISKFS-fs (drbd0): Unrecognized mount option 
> > "context="unconfined_u:object_r:user_tmp_t:s0"" or missing value
> 
> This mount option is visible also by 'tunefs.lustre --dryrun', so I followed 
> a tip on this list from last May and did
> 
> > tunefs.lustre --mountfsoptions="user_xattr,errors=remount-ro" /dev/drbd0
> 
> = keeping the rest of the mount options. Afterwards the mount worked.
> 
> 
> I checked, I formatted this MDS with
> 
> > mkfs.lustre --reformat --mgs --mdt --fsname=hebetest --index=0
> --servicenode=10.20.1.198@o2ib5 --servicenode=10.20.1.199@o2ib5
> --mgsnode=10.20.1.198@o2ib5 --mgsnode=10.20.1.199@o2ib5
> --mkfsoptions="-E stride=4,stripe-width=20 -O flex_bg,mmp,uninit_bg" 
> /dev/drbd0
> 
> 
> Just the defaults here?
> Where did the unknown mount option come from, and what does it mean anyway?

I suspect it's automatically added by SELinux, but I couldn't tell you
where or why.  Hopefully one of the people more familiar with SELinux
can answer, and it can be handled properly in the future.

Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Intel Corporation







___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Re: [lustre-discuss] DoM question

2018-04-06 Thread Dilger, Andreas

On Apr 5, 2018, at 03:01, Martin BALVERS  wrote:
> 
> I have a question about the new Data on MDT feature.
> The default dom_stripesize is 1M, does this mean that smaller files will
> also consume 1M on the MDT ?
> 
> I was thinking of using this for my home dirs, but there are a lot of
> smaller files there, so maybe dom_stripesize=64k would be better.

The dom_stripesize setting is the MAXIMUM size that can be stored on the MDT.
It does not affect the amount of data allocated to the file, which will be in
units of 4096-byte blocks.

The intent is that, depending on how much space is available on the MDT and
your file size distribution, you set a PFL layout for the filesystem that puts
the first component on the MDT, and the rest of the file on the OSTs, like:

  lfs setstripe -E1M -L mdt -E64M -c1 -S1M -E8G -c4 -E-1 -c-1 -S4M /mnt/testfs

or whatever is appropriate for your file size mix.  This alleviates the need
for most users to set a layout for their files, while having good performance
for a wide variety of use cases.

Since most sites have file size distribution like 90% of files are below 1MB
and use only 5% of space, while 90% of space is used by only 5% of very large
files, it isn't a big deal that every file has the first 1MB on the MDT.  You
would only save 10% of space on the MDT by optimizing the remaining files to
not store the first 1MB there.

Note that if you are using DoM and FLR, you probably want to format your MDT
with non-default parameters (e.g. 256KB/inode, "-i 262144", if using ldiskfs),
or it will normally 50% filled with inodes by default and only has a limited
amount of space for data (< 1.5KB/inode), directories, logs, etc.

For ZFS this is less of a concern, since it dynamically allocates blocks and
inodes, but it is still more likely to fill with data depending on your 
workload.

Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Intel Corporation

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Re: [lustre-discuss] /proc/fs/lustre/llite/*/stats

2018-04-06 Thread Dilger, Andreas

On Apr 5, 2018, at 19:44, Faaland, Olaf P.  wrote:
> 
> Hi,
> 
> I have a couple of questions about these stats.  If these are documented 
> somewhere, by all means point me to them.  What I found in the operations 
> manual and on the web did not answer my questions.
> 
> What do
> 
> read_bytes25673 samples [bytes] 1 3366225 145121869
> write_bytes   13641 samples [bytes] 1 3366225 468230469
> 
> mean in more detail?  I understand that the last three values are 
> MIN/MAX/SUM, and that their units are bytes, and that they reflect activity 
> since the file system was mounted or since the stats were last cleared.  But 
> more specifically:
> 
> samples:  Is this the number of requests issued to servers, e.g. RPC issued 
> with opcode OST_READ?  

No, these stats in the llite.*.stats file are "llite level" stats (i.e. they 
relate to the VFS operations).  If you want to get RPC-level stats you need to 
look at osc.*.stats.

> So if the user called read() 200 times on the same 1K file, which didn't ever 
> change and remained cached by the lustre client, and all the data was fetched 
> in a single RPC in the first place, then samples would be 1?  
> 
> And in that case, would the sum be 1K rather than 200K?

Simple testing shows that the read_bytes line has the number of read() syscalls 
and the total number of bytes read by the syscall (not the data read from the 
OST), even though both reads are from cache:

# lctl set_param llite.*.stats=clear
llite.testfs-880007524000.stats=clear
# dd if=/dev/zero of=/mnt/testfs/ff bs=1M count=1
1048576 bytes (1.0 MB) copied, 0.00220207 s, 476 MB/s
# dd of=/dev/null if=/mnt/testfs/ff bs=1k count=1k
1048576 bytes (1.0 MB) copied, 0.00197065 s, 532 MB/s
# dd of=/dev/null if=/mnt/testfs/ff bs=1k count=1k
1048576 bytes (1.0 MB) copied, 0.00188529 s, 556 MB/s
# lctl get_param llite.*.stats
llite.testfs-880007524000.stats=
snapshot_time 1523008010.817348638 secs.nsecs
read_bytes2048 samples [bytes] 1024 1024 2097152
write_bytes   1 samples [bytes] 1048576 1048576 1048576
open  3 samples [regs]
close 3 samples [regs]
seek  2 samples [regs]
truncate  1 samples [regs]
getxattr  1 samples [regs]
removexattr   1 samples [regs]
inode_permission  7 samples [regs]

Checking the OSC-level stats shows that there was a single write RPC of 1MB, 
and no read RPC at all, since the data remains in the client cache.

# lfs getstripe -i /mnt/testfs/ff
2
# lctl get_param osc.testfs-OST0002*.stats
osc.testfs-OST0002-osc-880007524000.stats=
snapshot_time 1523008200.913698356 secs.nsecs
req_waittime  83 samples [usec] 119 2461 51353 41125171
req_active83 samples [reqs] 1 1 83 83
ldlm_extent_enqueue   1 samples [reqs] 1 1 1 1
write_bytes   1 samples [bytes] 1048576 1048576 1048576 
1099511627776
ost_write 1 samples [usec] 2461 2461 2461 6056521
ost_connect   1 samples [usec] 280 280 280 78400
ost_punch 1 samples [usec] 291 291 291 84681
ost_statfs1 samples [usec] 119 119 119 14161
obd_ping  78 samples [usec] 164 1352 46717 29485783

Similarly, the ost.OSS.ost_io.stats file on the OSS will show the RPC stats 
handled by the whole server, while obdfilter.testfs-OST0002.stats will show the 
RPCs handled by this target, and osd-*.testfs-OST0002.brw_stats will show how 
the write was sent to disk (it will not show any read).  If a read is processed 
from the OSS read cache, it will appear at the ost_io and obdfilter level, but 
not at the osd-* level, since there was not actually any IO to disk.

Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Intel Corporation







___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Re: [lustre-discuss] Static lfs?

2018-03-24 Thread Dilger, Andreas

The "lfs" found in the build tree is a script wrapper generated by libtool to 
set up LD_LIBRARY_PATH and call the actual binary, which is in 
lustre/utils/.libs/lt-lfs or something similar.  I'm not much for libtool, but 
I figured out this much a few weeks ago when I wanted to run strace on lfs.

This is needed because the lfs binary is dynamically linked to libraries in the 
build tree, which are needed for it to run directly from the build tree, but 
would not be found by "ld" in their current locations otherwise.

If you run "ldd" on the real binary, it will tell you which libraries are 
needed. You can copy the requisite paths to the new system library directories 
to be able to run the actual binary there, and just ignore the lfs libtool 
wrapper script.

Cheers, Andreas

On Mar 23, 2018, at 15:59, Patrick Farrell 
> wrote:

Another off list note pointing out that lfs is likely a script now.  So here's 
the bitter end:

Ah, it looks like you're correct.  There's still an lfs.c but it no longer 
generates the "lfs" executable as it previously - Instead there's a lengthy and 
complex script named "lfs" which is not invoked by "make", but only during the 
install process.  That generates the lfs binary that is actually installed...

Uck.  Well, I found where it squirrels away the real binary when executed.

Run the script lustre/utils/lfs in your build dir, and it will start lfs.  Quit 
it, and you will find the actual lfs binary in lustre/utils/.libs/lt-lfs

Maybe this particular bit of build tooling would be clearer if it didn't try to 
pretend it didn't exist by apeing the binary without actually being it?

Thanks to John Bauer for help with this.

From: lustre-discuss 
>
 on behalf of Patrick Farrell >
Sent: Friday, March 23, 2018 3:17:14 PM
To: lustre-discuss@lists.lustre.org
Subject: Re: [lustre-discuss] Static lfs?

Ah, interesting – I got a question off list about this, but I thought I’d reply 
here.

‘ldd’ on the lfs binary says “not a dynamic executable”.

So it seems I’m confused (never was much for compilers and linkers).  Here are 
the errors I get trying to run it on another node:
./lfs: line 202: cd: /home/build/paf/[……..]/lustre/utils: No such file or 
directory

gcc: error: lfs.o: No such file or directory

gcc: error: lfs_project.o: No such file or directory

gcc: error: ./.libs/liblustreapi.so: No such file or directory

gcc: error: ../../lnet/utils/lnetconfig/.libs/liblnetconfig.so

From: lustre-discuss 
>
 on behalf of Patrick Farrell >
Date: Friday, March 23, 2018 at 3:03 PM
To: "lustre-discuss@lists.lustre.org" 
>
Subject: [lustre-discuss] Static lfs?

Good afternoon,

I’ve got a developer question that perhaps someone has some insight on.  After 
some recent (a few months ago now) changes to make the Lustre libraries and 
utilities build dynamically linked rather than statically linked, I’ve got a 
problem.  If I build an lfs binary just by doing “make”, the resultant binary 
looks for various libraries in the build directories and cannot be run on any 
system other than the one it was built on (well, I guess without replicating 
the build directory structure).  When doing make rpms and installing the RPMs, 
it works fine.  The problem is “make rpms” takes ~5 minutes, as opposed to ~1 
second for “make” in /utils.  (I assume “make install” works too, but I 
explicitly need to test on nodes other than the one where I’m doing the build, 
so that’s not an option.)

Does anyone have any insight on a way around this for a developer?  Either some 
tweak I can make locally to get static builds again, or some fix to make that 
would let the dynamically linked binary from “make” have correct library paths? 
 (To be completely clear: The dynamically linked binary from “make” looks for 
libraries in the locations where they are built, regardless of whether or not 
they’re already installed in the normal system library locations.)

Regards,

Patrick Farrell

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Re: [lustre-discuss] Mixed size OST's

2018-03-16 Thread Dilger, Andreas

On Mar 15, 2018, at 09:48, Steve Thompson  wrote:
> 
> Lustre newbie here (1 month). Lustre 2.10.3, CentOS 7.4, ZFS 0.7.5. All 
> networking is 10 GbE.
> 
> I am building a test Lustre filesystem. So far, I have two OSS's, each with 
> 30 disks of 2 TB each, all in a single zpool per OSS. Everything works well, 
> and was suprisingly easy to build. Thus, two OST's of 60 TB each. File types 
> are comprised of home directories. Clients number about 225 HPC systems 
> (about 2400 cores).
> 
> In about a month, I will have a third OSS available, and about a month after 
> that, a fourth. Each of these two systems has 48 disks of 4 TB each. I am 
> looking for advice on how best to configure this. If I go with one OST per 
> system (one zpool comprising 8 x 6 RAIDZ2 vdevs), I will have a lustre f/s 
> comprised of two 60 TB OST's and two 192 TB OST's (minus RAIDZ2 overhead). 
> This is obviously a big mismatch between OST sizes. I have not encountered 
> any discussion of the effect of mixing disparate OST sizes. I could instead 
> format two 96 TB OST's on each system (two zpools of 4 x 6 RAIDZ2 vdevs), or 
> three 64 TB OST's, and so on. More OST's means more striping possibilities, 
> but less vdev's per zpool impacts ZFS performance negatively. More OST's per 
> OSS does not help with network bandwidth to the OSS. How would you go about 
> this?

This is a little bit tricky.  Lustre itself can handle different OST sizes,
as it will run in "QOS allocator" mode (essentially "Quantity of Space", the
full "Quality of Service" was not implemented).  This balances file allocation
across OSTs based on percentage of free space, at the expense of performance
being lower as the only the two new OSTs would be used for 192/252 ~= 75%
of the files, since it isn't possible to *also* use all the OSTs evenly at the
same time (assuming that network speed is your bottleneck, and not disk speed).

For home directory usage this may not be a significant issue. This performance
imbalance would balance out as the larger OSTs became more full, and would not
be seen when files are striped across all OSTs.

I also thought about creating 3x OSTs per new OSS, so they would all be about
the same size and allocated equally.  That means the new OSS nodes would see
about 3x as much IO traffic as the old ones, especially for files striped over
all OSTs.  The drawback here is that the performance imbalance would stay
forever, so in the long run I don't think this is as good as just having a
single larger OST.  This will also become less of a factor as more OSTs are
added to the filesystem and/or you eventually upgrade the initial OSTs to
have larger disks and/or more VDEVs.

Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Intel Corporation

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Re: [lustre-discuss] Hole Identification in file

2018-03-10 Thread Dilger, Andreas

The FIEMAP ioctl is implemented slightly differently for multi-stripe files in 
Lustre - it essentially does the FIEMAP separately for each stripe so that it 
is easier to see whether the file is fragmented. See the filefrag program in 
the Lustre-patched e2fsprogs.

Cheers, Andreas

On Mar 9, 2018, at 06:01, lokesh jaliminche 
<lokesh.jalimin...@gmail.com<mailto:lokesh.jalimin...@gmail.com>> wrote:

Ahh got it!!
Meanwhile, I found one implicit method to find holes on the file i.e. using 
fiemap ioctl.
I have tested it and works fine

Logs :
=
dd if=/dev/urandom  of=sparse_file seek=10 bs=1M  count=1
1+0 records in
1+0 records out
1048576 bytes (1.0 MB) copied, 0.0969376 s, 10.8 MB/s
[lokesh]# dd if=/dev/urandom  of=sparse_file seek=20 bs=1M  count=1
1+0 records in
1+0 records out
1048576 bytes (1.0 MB) copied, 0.087068 s, 12.0 MB/s
[lokesh]# dd if=/dev/urandom  of=sparse_file seek=30 bs=1M  count=1
1+0 records in
1+0 records out
1048576 bytes (1.0 MB) copied, 0.0918352 s, 11.4 MB/s
[lokesh]# dd if=/dev/urandom  of=sparse_file seek=40 bs=1M  count=10
^[[A^[[D10+0 records in
10+0 records out
10485760 bytes (10 MB) copied, 0.946796 s, 11.1 MB/s
[lokesh]# dd if=/dev/urandom  of=sparse_file seek=50 bs=1M  count=10
10+0 records in
10+0 records out
10485760 bytes (10 MB) copied, 0.978465 s, 10.7 MB/s
[lokesh]# ./a.out sparse_file
Extents in file "sparse_file":4
Extents returned: 4
Logical: ###[10485760]Ext length: ###[1048576]Physical: 
###[104666759168]flags: 2147483648
Logical: ###[20971520]Ext length: ###[1048576]Physical: 
###[104677244928]flags: 2147483648
Logical: ###[31457280]Ext length: ###[1048576]Physical: 
###[104687730688]flags: 2147483648
Logical: ###[41943040]Ext length: ###[20971520]Physical: 
###[104689827840]flags: 2147483649

On Fri, Mar 9, 2018 at 2:12 PM, Dilger, Andreas 
<andreas.dil...@intel.com<mailto:andreas.dil...@intel.com>> wrote:
On Mar 6, 2018, at 00:12, lokesh jaliminche 
<lokesh.jalimin...@gmail.com<mailto:lokesh.jalimin...@gmail.com>> wrote:
>
> Hi,
>
> Does lustre support SEEK_HOLE/SEEK_DATA flag for lseek?
>
> I did some experiment with the below program to find out if lustre supports 
> SEEK_HOLE and SEEK_DATA flag. I found that lustre always returns the end of 
> the file for SEEK_HOLE and 0 for SEEK_DATA and as per the man page this is 
> the simplest implementation that file system can have(If they dont want to 
> support these flags). So just wanted to confirm.

Lustre does not directly support the SEEK_HOLE/SEEK_DATA interface currently.
What you are seeing is the default/minimum implementation provided by the
lseek interface.

Implementing these seek options might be quite complex for a Lustre file
because of multiple stripes in the layout.  You would need to locate holes
in the stripes, and then map the hole offset on the object to the file offset.

If this is something you are interested to implement I would be happy to
discuss it with you further.  Probably the best place to start is to file
an LU ticket with details, and then interested parties can discuss the
implementation there.

Cheers, Andreas

> Program :
> 
> #include 
> #include 
> #include 
> #include 
> #include 
> #define SEEK_OFF 10485760
> int main()
> {
> int fd;
> char buffer[80];
> int i = 0;
> static char message[]="Hello world";
> fd=open("myfile", O_RDWR);
> if (fd != -1)
> return 1;
> printf("creating hole by writing at each of %d strides\n", SEEK_OFF);
> for (i = 1; i < 10; i++)
> {
> int seek_off = i * SEEK_OFF;
> int sz;
> printf("seek_off %ld\n", lseek(fd, seek_off, SEEK_SET));
> sz = write(fd,message,sizeof(message));
> printf("write size = %d\n", sz);
> printf("String : %s\n", message);
> }
> printf("Demonstrating SEEK_HOLE and SEEK_DATA %d \n", SEEK_OFF);
> int start_off = 0;
> lseek(fd, 0, SEEK_SET);
> printf("after SEEK_HOLE start_off %ld\n", lseek(fd, 0, SEEK_HOLE));
> printf("after SEEK_DATA start_off %ld\n", lseek(fd, start_off, 
> SEEK_DATA));
> printf("after SEEK_HOLE start_off %ld\n", lseek(fd, 10485760, SEEK_HOLE));
> printf("after SEEK_DATA start_off %ld\n", lseek(fd, (10485760 *2 ), 
> SEEK_DATA));
> close(fd);
> }
>
> output:
> =
> after SEEK_HOLE start_off 94372142
> after SEEK_DATA start_off 0
> after SEEK_HOLE start_off 94372142
> after SEEK_DATA start_off 0
>
> Regards,
> Lokesh
> ___
> lustre-discuss mailing list
> lustre-discuss@lists.lustre.org<mailto:lustre-discuss@lists.lustre.org>
>

Re: [lustre-discuss] Hole Identification in file

2018-03-09 Thread Dilger, Andreas

On Mar 6, 2018, at 00:12, lokesh jaliminche  wrote:
> 
> Hi,
> 
> Does lustre support SEEK_HOLE/SEEK_DATA flag for lseek?
> 
> I did some experiment with the below program to find out if lustre supports 
> SEEK_HOLE and SEEK_DATA flag. I found that lustre always returns the end of 
> the file for SEEK_HOLE and 0 for SEEK_DATA and as per the man page this is 
> the simplest implementation that file system can have(If they dont want to 
> support these flags). So just wanted to confirm. 

Lustre does not directly support the SEEK_HOLE/SEEK_DATA interface currently.
What you are seeing is the default/minimum implementation provided by the
lseek interface.

Implementing these seek options might be quite complex for a Lustre file
because of multiple stripes in the layout.  You would need to locate holes
in the stripes, and then map the hole offset on the object to the file offset.

If this is something you are interested to implement I would be happy to
discuss it with you further.  Probably the best place to start is to file
an LU ticket with details, and then interested parties can discuss the
implementation there.

Cheers, Andreas

> Program :
> 
> #include 
> #include 
> #include 
> #include 
> #include 
> #define SEEK_OFF 10485760
> int main()
> {
> int fd; 
> char buffer[80];
> int i = 0;
> static char message[]="Hello world";
> fd=open("myfile", O_RDWR);
> if (fd != -1) 
> return 1;
> printf("creating hole by writing at each of %d strides\n", SEEK_OFF);
> for (i = 1; i < 10; i++)
> {   
> int seek_off = i * SEEK_OFF;
> int sz; 
> printf("seek_off %ld\n", lseek(fd, seek_off, SEEK_SET));
> sz = write(fd,message,sizeof(message));
> printf("write size = %d\n", sz);
> printf("String : %s\n", message);
> }   
> printf("Demonstrating SEEK_HOLE and SEEK_DATA %d \n", SEEK_OFF);
> int start_off = 0;
> lseek(fd, 0, SEEK_SET);
> printf("after SEEK_HOLE start_off %ld\n", lseek(fd, 0, SEEK_HOLE));
> printf("after SEEK_DATA start_off %ld\n", lseek(fd, start_off, 
> SEEK_DATA));
> printf("after SEEK_HOLE start_off %ld\n", lseek(fd, 10485760, SEEK_HOLE));
> printf("after SEEK_DATA start_off %ld\n", lseek(fd, (10485760 *2 ), 
> SEEK_DATA));
> close(fd);
> }
> 
> output:
> =
> after SEEK_HOLE start_off 94372142
> after SEEK_DATA start_off 0
> after SEEK_HOLE start_off 94372142
> after SEEK_DATA start_off 0
> 
> Regards,
> Lokesh
> ___
> lustre-discuss mailing list
> lustre-discuss@lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Intel Corporation







___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Re: [lustre-discuss] Lustre client 2.10.3 with 2.1 server

2018-02-27 Thread Dilger, Andreas

On Feb 27, 2018, at 16:19, Raj Ayyampalayam  wrote:
> 
> We are using a lustre 2.1 server with 2.5 client.
> 
> Can the latest 2.10.3 client can be used with the 2.1 server?
> I figured I would ask the list before I start installing the client on a test 
> node.

I don't believe this is possible, due to changes in the protocol.  In any case, 
we haven't tested the 2.1 code in many years.

Very likely your "2.1" server is really a vendor port with thousands of 
patches, so you might consider to ask the vendor, in case they've tested this.  
If not, then I'd strongly recommend to upgrade to a newer release on the server.

Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Intel Corporation

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Re: [lustre-discuss] lustre mount in heterogeneous net environment

2018-02-27 Thread Dilger, Andreas

On Feb 27, 2018, at 13:08, Ms. Megan Larko  wrote:
> 
> Hello List!
> 
> We have some 2.7.18 lustre servers using TCP.  Through some dual-homed Lustre 
> LNet routes we desire to connect some Mellanox (mlx4) InfiniBand Lustre 2.7.0 
> clients.  

Is there any reason to be running 2.7.0 clients?  Those are missing a huge 
number of fixes compared to newer clients.  Better to run matching 2.7.18 
clients, or 2.10.3.

Cheers, Andreas

> The "lctl ping" command works from both the server co-located MGS/MDS and 
> from the client.
> The mount of the TCP lustre server share from the IB client starts and then 
> shortly thereafter fails with "Input/output errorIs the MGS running?"
> 
> The Lustre MDS at approximate 20 min. intervals from client mount request 
> /var/log/messages reports:
> Lustre: MGS: Client  (at A.B.C.D@o2ib) reconnecting 
> 
> The IB client mount command:
> mount -t lustre C.D.E.F@tcp0:/lustre /mnt/lustre
> 
> Waits about a minute then returns:
> mount.lustre C.D.E.F@tcp0:/lustre at /mnt/lustre failed:  Input/output error
> Is the MGS running?.
> 
> The IB client /var/log/messages file contains:
> Lustre: client.c:19349:ptlrpc_expire_one_request(()) @@@ Request sent has 
> timed out for slow reply .. -->MGCC.D.E.F@tcp was lost; in progress 
> operations using this service will fail
> LustreError: 15c-8: MGCC.D.E.F@tcp: The configuration from log 
> 'lustre-client' failed (-5)  This may be the result of communication errors 
> between this node and the MGS, a bad configuration, or other errors.  See the 
> syslog for more information.
> Lustre: MGCC.D.E.F@tcp: Connection restored to MGS (at C.D.E.F@tcp)
> Lustre: Unmounted lustre-client
> LustreError: 22939:0:(obd_mount.c:lustre_fill_super()) Unable to mount (-5)
> 
> We have not (yet) set any non-default values on the Lustre File System.
> *  Server: Lustre 2.7.18  CentOS Linux release 7.3.1611 (Core)  kernel 
> 3.10.0-514.2.2.el7_lustre.x86_64   The server is ethernet; no IB.
> 
> *  Client: Lustre-2.7.0  RHEL 6.8  kernel 2.6.32-696.3.2.el6.x86_64The 
> client uses Mellanox InfiniBand mlx4.
> 
> The mount point does exist on the client.   The firewall is not an issue; 
> checked.  SELinux is disabled.
> 
> NOTE: The server does server the same /lustre file system to other TCP Lustre 
> clients.
> The client does mount other /lustre_mnt from other IB servers.
> 
> The info on 
> http://wiki.lustre.org/Mounting_a_Lustre_File_System_on_Client_Nodes 
> describes the situation exceedingly similar to ours.   I'm not sure what 
> Lustre settings to check if I have not explicitly set any to be different 
> that the default value.
> 
> Any hints would be genuinely appreciated.
> Cheers,
> megan
> 
> ___
> lustre-discuss mailing list
> lustre-discuss@lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Intel Corporation







___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Re: [lustre-discuss] What does your (next) MDT look like?

2018-02-23 Thread Dilger, Andreas

I definitely think this would be interesting.  It needs some testing, and very 
likely a bit of work to get inline_data to work together with Lustre, but it 
makes sense now that DoM is available.  With the change to 1KB inode size in 
ldiskfs MDTs, this should allow files up to about 700 bytes to be stored inside 
the inode (or larger if the inode size was increased).

Cheers, Andreas

> On Feb 23, 2018, at 07:58, Ben Evans <bev...@cray.com> wrote:
> 
> Slightly left-field question for MDTs, will we enable data on inode for
> really tiny files in ldiskfs?
> 
> -Ben
> 
> On 2/22/18, 2:02 PM, "lustre-discuss on behalf of Dilger, Andreas"
> <lustre-discuss-boun...@lists.lustre.org on behalf of
> andreas.dil...@intel.com> wrote:
> 
>> On Feb 6, 2018, at 10:32, E.S. Rosenberg <esr+lus...@mail.hebrew.edu>
>> wrote:
>>> 
>>> Hello fellow Lustre users :)
>>> 
>>> Since I didn't want to take the "size of MDT, inode count, inode size"
>>> thread too far off-topic I'm starting a new thread.
>>> 
>>> I'm curious how many people are using SSD MDTs?
>>> Also how practical is such a thing in a 2.11.x Data On MDT scenario?
>>> Is using some type of mix between HDD and SSD storage for MDTs
>>> practical?
>>> Does SSD vs HDD have an effect as far as ldiskfs vs zfs?
>> 
>> It is worthwhile to mention that using DoM is going to be a lot easier
>> with ZFS in a "fluid" usage environment than it will be with ldiskfs.
>> The ZFS MDTs do not have pre-allocated inode/data separation, so enabling
>> DoM will just mean you can put fewer inodes on the MDT if you put more
>> data there.  With ldiskfs you have to decide this ratio at format time.
>> The drawback is that ZFS is somewhat slower for metadata than ldiskfs,
>> though it has improved in 2.10 significantly.
>> 
>> Cheers, Andreas
>> --
>> Andreas Dilger
>> Lustre Principal Architect
>> Intel Corporation
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> ___
>> lustre-discuss mailing list
>> lustre-discuss@lists.lustre.org
>> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
> 

Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Intel Corporation







___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Re: [lustre-discuss] What does your (next) MDT look like?

2018-02-22 Thread Dilger, Andreas

On Feb 6, 2018, at 10:32, E.S. Rosenberg  wrote:
> 
> Hello fellow Lustre users :)
> 
> Since I didn't want to take the "size of MDT, inode count, inode size" thread 
> too far off-topic I'm starting a new thread.
> 
> I'm curious how many people are using SSD MDTs?
> Also how practical is such a thing in a 2.11.x Data On MDT scenario?
> Is using some type of mix between HDD and SSD storage for MDTs practical?
> Does SSD vs HDD have an effect as far as ldiskfs vs zfs?

It is worthwhile to mention that using DoM is going to be a lot easier with ZFS 
in a "fluid" usage environment than it will be with ldiskfs.  The ZFS MDTs do 
not have pre-allocated inode/data separation, so enabling DoM will just mean 
you can put fewer inodes on the MDT if you put more data there.  With ldiskfs 
you have to decide this ratio at format time.  The drawback is that ZFS is 
somewhat slower for metadata than ldiskfs, though it has improved in 2.10 
significantly.

Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Intel Corporation

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Re: [lustre-discuss] size of MDT, inode count, inode size

2018-02-04 Thread Dilger, Andreas

On Feb 4, 2018, at 13:10, E.S. Rosenberg <esr+lus...@mail.hebrew.edu> wrote:
> On Sat, Feb 3, 2018 at 4:45 AM, Dilger, Andreas <andreas.dil...@intel.com> 
> wrote:
>> On Jan 26, 2018, at 07:56, Thomas Roth <t.r...@gsi.de> wrote:
>> >
>> > Hmm, option-testing leads to more confusion:
>> >
>> > With this 922GB-sdb1 I do
>> >
>> > mkfs.lustre --reformat --mgs --mdt ... /dev/sdb1
>> >
>> > The output of the command says
>> >
>> >   Permanent disk data:
>> > Target: test0:MDT
>> > ...
>> >
>> > device size = 944137MB
>> > formatting backing filesystem ldiskfs on /dev/sdb1
>> >   target name   test0:MDT
>> >   4k blocks 241699072
>> >   options-J size=4096 -I 1024 -i 2560 -q -O 
>> > dirdata,uninit_bg,^extents,mmp,dir_nlink,quota,huge_file,flex_bg -E 
>> > lazy_journal_init -F
>> >
>> > mkfs_cmd = mke2fs -j -b 4096 -L test0:MDT  -J size=4096 -I 1024 -i 
>> > 2560 -q -O 
>> > dirdata,uninit_bg,^extents,mmp,dir_nlink,quota,huge_file,flex_bg -E 
>> > lazy_journal_init -F /dev/sdb1 241699072
>> 
>> The default options have to be conservative, as we don't know in advance how 
>> a filesystem will be used.  It may be that some sites will have lots of hard 
>> links or long filenames (which consume directory space == blocks, but not 
>> inodes), or they will have widely-striped files (which also consume xattr 
>> blocks).  The 2KB/inode ratio includes the space for the inode itself (512B 
>> in 2.7.x 1024B in 2.10), at least one directory entry (~64 bytes), some 
>> fixed overhead for the journal (up to 4GB on the MDT), and Lustre-internal 
>> overhead (OI entry = ~64 bytes), ChangeLog, etc.
>> 
>> If you have a better idea of space usage at your site, you can specify 
>> different parameters.
>> 
>> > Mount this as ldiskfs, gives 369 M inodes.
>> >
>> > One would assume that specifying one / some of the mke2fs-options here in 
>> > the mkfs.lustre-command will change nothing.
>> >
>> > However,
>> >
>> > mkfs.lustre --reformat --mgs --mdt ... --mkfsoptions="-I 1024" /dev/sdb1
>> >
>> > says
>> >
>> > device size = 944137MB
>> > formatting backing filesystem ldiskfs on /dev/sdb1
>> >   target name   test0:MDT
>> >   4k blocks 241699072
>> >   options   -I 1024 -J size=4096 -i 1536 -q -O 
>> > dirdata,uninit_bg,^extents,mmp,dir_nlink,quota,huge_file,flex_bg -E 
>> > lazy_journal_init -F
>> >
>> > mkfs_cmd = mke2fs -j -b 4096 -L test0:MDT -I 1024 -J size=4096 -i 1536 
>> > -q -O dirdata,uninit_bg,^extents,mmp,dir_nlink,quota,huge_file,flex_bg -E 
>> > lazy_journal_init -F /dev/sdb1 241699072
>> >
>> > and the mounted devices now has 615 M inodes.
>> >
>> > So, whatever makes the calculation for the "-i / bytes-per-inode" value 
>> > becomes ineffective if I specify the inode size by hand?
>> 
>> This is a bit surprising.  I agree that specifying the same inode size value 
>> as the default should not affect the calculation for the bytes-per-inode 
>> ratio.
>> 
>> > How many bytes-per-inode do I need?
>> >
>> > This ratio, is it what the manual specifies as "one inode created for each 
>> > 2kB of LUN" ?
>> 
>> That was true with 512B inodes, but with the increase to 1024B inodes in 
>> 2.10 (to allow for PFL file layouts, since they are larger) the inode ratio 
>> has also gone up 512B to 2560B/inode.
> 
> Does this mean that someone who updates their servers from 2.x to 2.10 will 
> not be able to use PFL since the MDT was formatted in a way that can't 
> support it? (in our case formatted under Lustre 2.5 currently running 2.8)

It will be possible to use PFL layouts with older MDTs, but there may be a
performance impact if the MDTs are HDD based because a multi-component PFL
layout is unlikely to fit into the 512-byte inode, so they will allocate an
extra xattr block for each PFL file.  For SSD-based MDTs the extra seek is
not likely to impact performance significantly, but for HDD-based MDTs this
extra seek for accessing every file will reduce the metadata performance.

If you formatted the MDT filesystem for a larger default stripe count (e.g.
use "mkfs.lustre ... --stripe-count-hint=8" or more) then you will already
have 1024-byte inodes, and this is a non-issue.

That said, the overall impact to your applications may be minimal if you do
not

Re: [lustre-discuss] restrict client access to lustre

2018-02-03 Thread Dilger, Andreas

On Jan 30, 2018, at 01:39, Ekaterina Popova  wrote:
> 
> Hello!
> 
> I would be very appreciated if you cleared things up to me.
> 
> If we use NFS we can export policies to restrict NFS access to volumes to 
> clients that match specific parameters. Can I do it on Lustre? Are there any 
> built-in mechanisms in Lustre filesystem for client access restriction?
> 
> Thank you very much for your assistance in advance!

Since Lustre 2.9 it is possible to use the "nodemap" feature to limit
the access client nodes with specific NIDs.

If you want stronger authentication than just the client addresses,
then you can also use Shared Secret Key or Kerberos to identify the
clients from their crypto key or Kerberos ticket.  Unidentified clients
can be blocked from accessing the filesystem.

Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Intel Corporation

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Re: [lustre-discuss] size of MDT, inode count, inode size

2018-02-02 Thread Dilger, Andreas

On Jan 26, 2018, at 07:56, Thomas Roth  wrote:
> 
> Hmm, option-testing leads to more confusion:
> 
> With this 922GB-sdb1 I do
> 
> mkfs.lustre --reformat --mgs --mdt ... /dev/sdb1
> 
> The output of the command says
> 
>   Permanent disk data:
> Target: test0:MDT
> ...
> 
> device size = 944137MB
> formatting backing filesystem ldiskfs on /dev/sdb1
>   target name   test0:MDT
>   4k blocks 241699072
>   options-J size=4096 -I 1024 -i 2560 -q -O 
> dirdata,uninit_bg,^extents,mmp,dir_nlink,quota,huge_file,flex_bg -E 
> lazy_journal_init -F
> 
> mkfs_cmd = mke2fs -j -b 4096 -L test0:MDT  -J size=4096 -I 1024 -i 2560 
> -q -O dirdata,uninit_bg,^extents,mmp,dir_nlink,quota,huge_file,flex_bg -E 
> lazy_journal_init -F /dev/sdb1 241699072

The default options have to be conservative, as we don't know in advance how a 
filesystem will be used.  It may be that some sites will have lots of hard 
links or long filenames (which consume directory space == blocks, but not 
inodes), or they will have widely-striped files (which also consume xattr 
blocks).  The 2KB/inode ratio includes the space for the inode itself (512B in 
2.7.x 1024B in 2.10), at least one directory entry (~64 bytes), some fixed 
overhead for the journal (up to 4GB on the MDT), and Lustre-internal overhead 
(OI entry = ~64 bytes), ChangeLog, etc.

If you have a better idea of space usage at your site, you can specify 
different parameters.

> Mount this as ldiskfs, gives 369 M inodes.
> 
> One would assume that specifying one / some of the mke2fs-options here in the 
> mkfs.lustre-command will change nothing.
> 
> However,
> 
> mkfs.lustre --reformat --mgs --mdt ... --mkfsoptions="-I 1024" /dev/sdb1
> 
> says
> 
> device size = 944137MB
> formatting backing filesystem ldiskfs on /dev/sdb1
>   target name   test0:MDT
>   4k blocks 241699072
>   options   -I 1024 -J size=4096 -i 1536 -q -O 
> dirdata,uninit_bg,^extents,mmp,dir_nlink,quota,huge_file,flex_bg -E 
> lazy_journal_init -F
> 
> mkfs_cmd = mke2fs -j -b 4096 -L test0:MDT -I 1024 -J size=4096 -i 1536 -q 
> -O dirdata,uninit_bg,^extents,mmp,dir_nlink,quota,huge_file,flex_bg -E 
> lazy_journal_init -F /dev/sdb1 241699072
> 
> and the mounted devices now has 615 M inodes.
> 
> So, whatever makes the calculation for the "-i / bytes-per-inode" value 
> becomes ineffective if I specify the inode size by hand?

This is a bit surprising.  I agree that specifying the same inode size value as 
the default should not affect the calculation for the bytes-per-inode ratio.

> How many bytes-per-inode do I need?
> 
> This ratio, is it what the manual specifies as "one inode created for each 
> 2kB of LUN" ?

That was true with 512B inodes, but with the increase to 1024B inodes in 2.10 
(to allow for PFL file layouts, since they are larger) the inode ratio has also 
gone up 512B to 2560B/inode.

> Perhaps the raw size of an MDT device should better be such that it leads to 
> "-I 1024 -i 2048"?

Yes, that is probably reasonable, since the larger inode also means that there 
is less chance of external xattr blocks being allocated.

Note that with ZFS there is no need to specify the inode ratio at all.  It will 
dynamically allocate inode blocks as needed, along with directory blocks, OI 
tables, etc., until the filesystem is full.

Cheers, Andreas

> On 01/26/2018 03:10 PM, Thomas Roth wrote:
>> Hi all,
>> what is the relation between raw device size and size of a formatted MDT? 
>> Size of inodes + free space = raw size?
>> The example:
>> MDT device has 922 GB in /proc/partions.
>> Formatted under Lustre 2.5.3 with default values for mkfs.lustre resulted in 
>> a 'df -h' MDT of 692G and more importantly 462M inodes.
>> So, the space used for inodes + the 'df -h' output add up to the raw size:
>>  462M inodes * 0.5kB/inode + 692 GB = 922 GB
>> On that system there are now 330M files, more than 70% of the available 
>> inodes.
>> 'df -h' says '692G  191G  456G  30% /srv/mds0'
>> What do I need the remaining 450G for? (Or the ~400G left once all the 
>> inodes are eaten?)
>> Should the format command not be tuned towards more inodes?
>> Btw, on a Lustre 2.10.2 MDT I get 369M inodes and 550 G space (with a 922G 
>> raw device): inode size is now 1024.
>> However, according to the manual and various Jira/Ludocs the size should be 
>> 2k nowadays?
>> Actually, the command within mkfs.lustre reads
>> mke2fs -j -b 4096 -L test0:MDT  -J size=4096 -I 1024 -i 2560  -F 
>> /dev/sdb 241699072
>> -i 2560 ?
>> Cheers,
>> Thomas
> 
> -
> ___
> lustre-discuss mailing list
> lustre-discuss@lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Intel Corporation







___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org

Re: [lustre-discuss] undelete

2018-01-24 Thread Dilger, Andreas

On Jan 24, 2018, at 13:59, E.S. Rosenberg <e...@cs.huji.ac.il> wrote:
> On Wed, Jan 24, 2018 at 1:05 PM, Dilger, Andreas <andreas.dil...@intel.com> 
> wrote:
>> On Jan 22, 2018, at 19:03, E.S. Rosenberg <esr+lus...@mail.hebrew.edu> wrote:
>> >
>> > Dragging the old discussion back up
>> > First of thanks for all the replies last time!
>> >
>> > Last time in the end we didn't need to recover but now another user made
>> > a bigger mistake and we do need to recover data.
>> 
>> Sounds like it is time for backups and/or snapshots to avoid these issues in 
>> the future.  If you don't have space for a full filesystem backup, doing 
>> daily backups of the MDT simplifies such data recovery significantly.  
>> Preferably the backup is done from a snapshot using "dd" or "e2image", but 
>> even without a snapshot it is better to do a backup from the raw device than 
>> not at all.
> 
> Yeah I think our next Lustre system is going to be ZFS based so we should 
> have at least 1 snapshot at all times (more then that will probably be too 
> expensive).
> OTOH this whole saga is also excellent user education that will genuinely 
> drive the point home that they should only store reproducible data on Lustre 
> which as defined by us is scratch and not backed up. 
> 
>> > I have shut down our Lustre filesystem and am going to do some simulations
>> > on a test system trying various undelete tools.
>> >
>> > autopsy (sleuthkit) on the metadata shows that at least the structure is
>> > still there and hopefully we'll be able to recover more.
>> 
>> You will need to be able to recover the file layout from the deleted MDT 
>> inode (which existing ext4 recovery tools might help with), including the 
>> "lov" xattr, which is typically stored inside the inode itself unless the 
>> file was widely striped.
>> 
>> Secondly, you will also need to recover the matching OST inodes/objects that 
>> were deleted.  There may be deleted entries in the OST object directories 
>> (O/0/d*/) that tell you which inodes the objects were using.  Failing that, 
>> you may be able to tell from the "fid" xattr of deleted inodes which object 
>> they were.  Using the Lustre debugfs "stat " command may help on the 
>> OST.
>> 
>> You would need to undelete all of the objects in a multi-stripe file for 
>> that to be very useful.
>> 
>> > Has anyone ever done true recovery of Lustre or is it all just theoretical
>> > knowledge at the moment?
>> >
>> > What are the consequences of say undeleting data on OSTs that is then not
>> > referenced on the MDS? Could I cause corruption of the whole filesystem by
>> > doing stuff like that?
>> 
>> As long as you are not corrupting the actual OST or MDT filesystems by 
>> undeleting an inode whose blocks were reallocated to another file, it won't 
>> harm things.  At worst it would mean OST objects that are not reachable 
>> through the MDT namespace.  Running an lfsck namespace scan (2.7+) would 
>> link such OST objects into the $MOUNT/.lustre/lost+found directory if they 
>> are not referenced from any MDT inode.
>> 
>> > (As far as the files themselves go they are most likely all single striped
>> > since that is our default and we are pre PFL so that should be easier I
>> > think).
>> 
>> That definitely simplifies things significantly.
> 
> Some of what I wrote before was due to my hope to do in-place recovery and 
> make stuff 'visible' again on lustre.
> I actually ran into a different interesting issue, it seems extundelete  
> balks at huge ext filesystems (33T) it considers some of the superblock 
> values to be out-of-domain (a quick look at the source suggests to me that 
> they assumed INT32, but 32T is also the limit of ext3).

This seems like it wouldn't be too hard for you to fix?

> ext4magic returns the error 2133571465 from e2fsprogs which according to the 
> source maps to EXT2_ET_CANT_USE_LEGACY_BITMAPS not sure what to make of that.
> and bringing up the rear is ext3grep which doesn't know xattrs and therefor 
> stops.

It would be possible for you to update these tools to support the new features.
Look at the e2fsprogs git history for when EXT2_ET_CANT_USE_LEGACY_BITMAPS was
added and IIRC it needs to add a flag to the ext2fs_open() code, and possibly
some use of wrapper functions if it is accessing bitmaps.

Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Intel Corporation







___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Re: [lustre-discuss] undelete

2018-01-24 Thread Dilger, Andreas

On Jan 22, 2018, at 19:03, E.S. Rosenberg <esr+lus...@mail.hebrew.edu> wrote:
> 
> Dragging the old discussion back up
> First of thanks for all the replies last time!
> 
> Last time in the end we didn't need to recover but now another user made a 
> bigger mistake and we do need to recover data.

Sounds like it is time for backups and/or snapshots to avoid these issues in 
the future.  If you don't have space for a full filesystem backup, doing daily 
backups of the MDT simplifies such data recovery significantly.  Preferably the 
backup is done from a snapshot using "dd" or "e2image", but even without a 
snapshot it is better to do a backup from the raw device than not at all.

> I have shut down our Lustre filesystem and am going to do some simulations on 
> a test system trying various undelete tools.
> 
> autopsy (sleuthkit) on the metadata shows that at least the structure is 
> still there and hopefully we'll be able to recover more.

You will need to be able to recover the file layout from the deleted MDT inode 
(which existing ext4 recovery tools might help with), including the "lov" 
xattr, which is typically stored inside the inode itself unless the file was 
widely striped.

Secondly, you will also need to recover the matching OST inodes/objects that 
were deleted.  There may be deleted entries in the OST object directories 
(O/0/d*/) that tell you which inodes the objects were using.  Failing that, you 
may be able to tell from the "fid" xattr of deleted inodes which object they 
were.  Using the Lustre debugfs "stat " command may help on the OST.

You would need to undelete all of the objects in a multi-stripe file for that 
to be very useful.

> Has anyone ever done true recovery of Lustre or is it all just theoretical 
> knowledge at the moment?
> 
> What are the consequences of say undeleting data on OSTs that is then not 
> referenced on the MDS? Could I cause corruption of the whole filesystem by 
> doing stuff like that?

As long as you are not corrupting the actual OST or MDT filesystems by 
undeleting an inode whose blocks were reallocated to another file, it won't 
harm things.  At worst it would mean OST objects that are not reachable through 
the MDT namespace.  Running an lfsck namespace scan (2.7+) would link such OST 
objects into the $MOUNT/.lustre/lost+found directory if they are not referenced 
from any MDT inode.

> (As far as the files themselves go they are most likely all single striped 
> since that is our default and we are pre PFL so that should be easier I 
> think).

That definitely simplifies things significantly.

Cheers, Andreas

> On Thu, May 4, 2017 at 2:21 AM, Dilger, Andreas <andreas.dil...@intel.com> 
> wrote:
> On Apr 27, 2017, at 05:43, E.S. Rosenberg <esr+lus...@mail.hebrew.edu> wrote:
> >
> > A user just rm'd a big archive of theirs on lustre, any way to recover it 
> > before it gets destroyed by other writes?
> 
> Just noticed this email.
> 
> In some cases, an immediate power-off followed by some ext4 recovery tools 
> (e.g. ext3grep) might get you some data back, but that is very uncertain.
> 
> With ZFS MDT/OST filesystems (or to a lesser extent LVM) it is possible to 
> create periodic snapshots of the filesystems for recovery purposes.  ZFS 
> handles this fairly well performance wise, LVM much less so.  With Lustre 
> 2.10 there are new tools to manage the ZFS snapshots and allow mounting them 
> as a separate Lustre filesystem.

Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Intel Corporation

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Re: [lustre-discuss] Moving MGT to new host

2018-01-19 Thread Dilger, Andreas

On Jan 18, 2018, at 09:00, Michael Watters  wrote:
> 
> What is the proper way to move an MGT from one server to another?  I
> tried moving the disk between VMs and remounting it however the NID for
> the target never comes up.  Is there a way to add a new NID using a
> secondary IP with lnetctl?  I don't see a command to do so listed in the
> man page.

There is "lctl replace_nids"?

The manual also mentions:

38.18.4. Examples

Change the MGS's NID address. (This should be done on each target disk,
since they should all contact the same MGS.)

tunefs.lustre --erase-param --mgsnode= --writeconf /dev/sda

Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Intel Corporation







___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Re: [lustre-discuss] lnet shutdown issue while cleint reboot

2018-01-03 Thread Dilger, Andreas

Please run "lctl get_param version" on a client to find the version currently 
running. 

There is a problem if you don't have libyaml-devel installed at build time then 
lnetctl will not be built. As of 2.10.2 the libyaml-devel library is required 
so that lnetctl is always built. 

Cheers, Andreas

> On Jan 3, 2018, at 08:41, Parag Khuraswar  wrote:
> 
> Hi Jones,
> 
> I installed this version 2 months back. But now I am facing this client 
> reboot issue. Lustre has been setup and working properly.
> 
> Regards,
> Parag
> 
> 
> 
> -Original Message-
> From: Jones, Peter A [mailto:peter.a.jo...@intel.com] 
> Sent: Wednesday, January , 2018 8:31 PM
> To: Parag Khuraswar; 'Arman Khalatyan'
> Cc: 'Lustre discussion'
> Subject: Re: [lustre-discuss] lnet shutdown issue while cleint reboot
> 
> Ok, so that is the correct place and you meant git rather than GitHub. So 
> what tag are you building against? As mentioned previously 2.20 is not a 
> version of Lustre so perhaps this is a typo too?
> 
> 
> 
> 
>> On 2018-01-03, 6:54 AM, "lustre-discuss on behalf of Parag Khuraswar" 
>> > para...@citilindia.com> wrote:
>> 
>> Hi Jones,
>> 
>> I cloned from here-
>> 
>> git clone git://git.hpdd.intel.com/fs/lustre-release.git
>> 
>> Regards,
>> Parag
>> 
>> 
>> 
>> -Original Message-
>> From: Jones, Peter A [mailto:peter.a.jo...@intel.com]
>> Sent: Wednesday, January , 2018 7:52 PM
>> To: Parag Khuraswar; 'Arman Khalatyan'
>> Cc: 'Lustre discussion'
>> Subject: Re: [lustre-discuss] lnet shutdown issue while cleint reboot
>> 
>> What location on GitHub? Do you mean the IML repo?
>> 
>> 
>> 
>> 
>>> On 2018-01-03, 6:04 AM, "lustre-discuss on behalf of Parag Khuraswar" 
>>> >> para...@citilindia.com> wrote:
>>> 
>>> I cloned from github. Lnetctl is there on lustre servers but on clients 
>>> only lctl is available.
>>> 
>>> Regards,
>>> Parag
>>> 
>>> 
>>> 
>>> -Original Message-
>>> From: Arman Khalatyan [mailto:arm2...@gmail.com]
>>> Sent: Wednesday, January , 2018 7:06 PM
>>> To: Parag Khuraswar
>>> Cc: Lustre discussion
>>> Subject: Re: [lustre-discuss] lnet shutdown issue while cleint reboot
>>> 
>>> Strange, Is it some custom lustre ? the 2.20 is not yet there:
>>> http://lustre.org/download/
>>> lnetctl is inside since 2.10.x
>>> 
>>> 
 On Wed, Jan 3, 2018 at 12:46 PM, Parag Khuraswar  
 wrote:
 Hi,
 
 In my version of lustre on client nodes 'lnetctl " command is not 
 available.
 
 Regards,
 Parag
 
 
 
 -Original Message-
 From: Arman Khalatyan [mailto:arm2...@gmail.com]
 Sent: Wednesday, January , 2018 4:54 PM
 To: Parag Khuraswar
 Cc: Lustre discussion
 Subject: Re: [lustre-discuss] lnet shutdown issue while cleint 
 reboot
 
 hi,
 Try this before reboot:
 umount /lustre
 service lnet stop
 lnetctl lnet unconfigure
 lustre_rmmod
 then reboot
 On Centos 7.4 it works.
 
 Cheers,
 Arman.
 
 
> On Wed, Jan 3, 2018 at 10:39 AM, Parag Khuraswar  
> wrote:
> Hi,
> 
> 
> 
> I am using lustre 2.20.1 on RHEL 7.3. On the lustre client nodes 
> when I shutdown I get attached error and nodes don’t get shutdown.
> 
> The procedure I follow to shut down the nodes are-
> 
> 1)  Unmounts all lustre file systems,
> 
> 2)  Stop lnet service ( lnet service stops successfully.),
> 
> 3)  Unload lustre module,
> 
> 
> 
> After performing above steps and rebooting node it sucks, error is 
> attached.
> 
> 
> 
> Regards,
> 
> Parag
> 
> 
> 
> 
> 
> 
> ___
> lustre-discuss mailing list
> lustre-discuss@lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
> 
 
>>> 
>>> ___
>>> lustre-discuss mailing list
>>> lustre-discuss@lists.lustre.org
>>> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
>> 
>> ___
>> lustre-discuss mailing list
>> lustre-discuss@lists.lustre.org
>> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
> 
> ___
> lustre-discuss mailing list
> lustre-discuss@lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Re: [lustre-discuss] Lustre Client in a container

2018-01-03 Thread Dilger, Andreas

On Dec 31, 2017, at 01:50, David Cohen  wrote:
> 
> Patrick,
> Thanks for you response.
> I looking for a way to migrate from 1.8.9 system to 2.10.2, stable enough to 
> run the several weeks or more that it might take.

Note that there is no longer direct support for upgrading from 1.8 to 2.10.  

That said, are you upgrading the filesystem in place, or are you copying the 
data from the 1.8.9 filesystem to the 2.10.2 filesystem?  In the latter case, 
the upgrade compatibility doesn't really matter.  What you need is a client 
that can mount both server versions at the same time.

Unfortunately, no 2.x clients can mount the 1.8.x server filesystem directly, 
so that does limit your options.  There was a time of interoperability with 1.8 
clients being able to mount 2.1-ish servers, but that doesn't really help you.  
You could upgrade the 1.8 servers to 2.1 or later, and then mount both 
filesystems with a 2.5-ish client, or upgrade the servers to 2.5.

Cheers, Andreas

> On Sun, Dec 31, 2017 at 12:12 AM, Patrick Farrell  wrote:
> David,
> 
> I have no direct experience trying this, but I would imagine not - Lustre is 
> a kernel module (actually a set of kernel modules), so unless the container 
> tech you're using allows loading multiple different versions of *kernel 
> modules*, this is likely impossible.  My limited understanding of container 
> tech on Linux suggests that this would be impossible, containers allow 
> userspace separation but there is only one kernel/set of modules/drivers.
> 
> I don't know of any way to run multiple client versions on the same node.
> 
> The other question is *why* do you want to run multiple client versions on 
> one node...?  Clients are usually interoperable across a pretty generous set 
> of server versions.
> 
> - Patrick
> 
> 
> From: lustre-discuss  on behalf of 
> David Cohen 
> Sent: Saturday, December 30, 2017 11:45:15 AM
> To: lustre-discuss@lists.lustre.org
> Subject: [lustre-discuss] Lustre Client in a container
>  
> Hi,
> Is it possible to run Lustre client in a container?
> The goal is to run two different client version on the same node, can it be 
> done?
> 
> David
> 
> 
> ___
> lustre-discuss mailing list
> lustre-discuss@lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Intel Corporation







___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Re: [lustre-discuss] Building client packages on Fedora 20

2017-12-26 Thread Dilger, Andreas

Could you please submit a patch for this so that it is fixed for the next user, 
and for your future builds. Please see 
https://wiki.hpdd.intel.com/display/PUB/Using+Gerrit for details.

Cheers, Andreas

On Dec 26, 2017, at 12:42, Michael Watters 
> wrote:


In case anybody is curious I was able to build packages after fixing the 
kernel_module_package rpm macro as follows.

diff -u kmodtool.orig kmodtool
--- kmodtool.orig   2017-12-21 13:04:21.343437273 -0500
+++ kmodtool2017-12-26 13:34:15.076952532 -0500
@@ -114,7 +114,7 @@

 if [ "no" != "$kmp_nobuildreqs" ]
 then
-echo "BuildRequires: kernel${dashvariant}-devel-%{_target_cpu} = ${verrel}"
+echo "BuildRequires: kernel${dashvariant}-devel"
 fi

 if [ "" != "$kmp_override_preamble" ]

This macro is located in the /usr/lib/rpm/redhat/kmodtool file.


On 12/22/2017 07:56 AM, Michael Watters wrote:

Has anybody attempted to build lustre client rpms on Fedora 20?  I am
able to compile the lustre v2_9_59_0 branch however the make rpms
command fails with an error as follows.

Processing files: lustre-client-2.9.59-1.fc20.x86_64
error: File not found: 
/tmp/rpmbuild-lustre-root-nrGADExI/BUILDROOT/lustre-2.9.59-1.x86_64/etc/init.d/lsvcgss

RPM build errors:
File not found: 
/tmp/rpmbuild-lustre-root-nrGADExI/BUILDROOT/lustre-2.9.59-1.x86_64/etc/init.d/lsvcgss

Patching the lustre.spec.in file to avoid this error then results in a
different error as shown below.

RPM build errors:
Installed (but unpackaged) file(s) found:
   /lib/modules/3.19.8-100.fc20.x86_64/extra/lustre-client-tests/fs/llog_test.ko
   /lib/modules/3.19.8-100.fc20.x86_64/extra/lustre-client/fs/fid.ko
   /lib/modules/3.19.8-100.fc20.x86_64/extra/lustre-client/fs/fld.ko
   /lib/modules/3.19.8-100.fc20.x86_64/extra/lustre-client/fs/lmv.ko
   /lib/modules/3.19.8-100.fc20.x86_64/extra/lustre-client/fs/lov.ko
   /lib/modules/3.19.8-100.fc20.x86_64/extra/lustre-client/fs/lustre.ko
   /lib/modules/3.19.8-100.fc20.x86_64/extra/lustre-client/fs/mdc.ko
   /lib/modules/3.19.8-100.fc20.x86_64/extra/lustre-client/fs/mgc.ko
   /lib/modules/3.19.8-100.fc20.x86_64/extra/lustre-client/fs/obdclass.ko
   /lib/modules/3.19.8-100.fc20.x86_64/extra/lustre-client/fs/obdecho.ko
   /lib/modules/3.19.8-100.fc20.x86_64/extra/lustre-client/fs/osc.ko
   /lib/modules/3.19.8-100.fc20.x86_64/extra/lustre-client/fs/ptlrpc.ko
   /lib/modules/3.19.8-100.fc20.x86_64/extra/lustre-client/net/ko2iblnd.ko
   /lib/modules/3.19.8-100.fc20.x86_64/extra/lustre-client/net/ksocklnd.ko
   /lib/modules/3.19.8-100.fc20.x86_64/extra/lustre-client/net/libcfs.ko
   /lib/modules/3.19.8-100.fc20.x86_64/extra/lustre-client/net/lnet.ko
   /lib/modules/3.19.8-100.fc20.x86_64/extra/lustre-client/net/lnet_selftest.ko
make: *** [rpms] Error 1

Here are the commands I used to build the packages.

git checkout v2_9_59_0
sh ./autogen.sh
./configure --disable-server --enable-client --with-linux=/lib/modules/`uname 
-r`/build
make rpms

Any suggestions on how to fix this?  I know that Fedora 20 is EOL
however I would still like to be able to build a lustre client package
if possible.





___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Re: [lustre-discuss] Announce: Lustre Systems Administration Guide

2017-12-15 Thread Dilger, Andreas

On Dec 13, 2017, at 09:02, DEGREMONT Aurelien <aurelien.degrem...@cea.fr> wrote:
> 
> Hello
> 
> My recommendation would be to go for something like Documentation-as-a-code.
> It is now very easy to deploy an infrastructure to automatically generate 
> nice looking HTML and PDF versions of a rst or markdown documentation thanks 
> to readthedocs.io

Maybe I'm missing something, but it seems like the Sphynix markup is just a 
different form of inline text markup compared to DocBook XML that the current 
manual is using?  Also, we already host the manual repo in Git, and 
automatically build PDF, HTML, and ePub formats.

My main thought about moving to a wiki format is that this makes it easier for 
users to contribute to the manual, since the text could be edited "in place" in 
a web browser.  Hopefully that would reduce the barrier to entry, and more 
people would update the manual and improve the places where it is outdated.

> readthedocs.io uses Sphinx as backend, likewise kernel.org 
> (https://www.kernel.org/doc/html/v4.11/doc-guide/sphinx.html)
> 
> A github project could be easy plugged to readthedocs, but you could use your 
> own hooks.

We started with DOxygen markup for the code (I don't know why it wasn't the 
same as the kernel), but it would probably be possible to write a script to 
convert the existing code markup to Sphynx if we wanted.

> If you use GitHub to store your source code, you can easily edit the 
> documentation source code and create a pull request. I know this is not in 
> line with Gerrit usage, but this is an interesting workflow.

Could you please explain?  It isn't clear to me how this is significantly 
different from Git+Gerrit?  It is also possible to edit source files directly 
in Gerrit, though it isn't clear to me if it is possible to _start_ editing a 
file and submit a patch vs. being able to edit an existing change.  In either 
case, one still needs to use the inline markup language.  Does Github have a 
WYSIWYG editor for the manual?

> I really think we should really uses such technologies which produce very 
> nice looking documentation, which could be easily exports, versionned and 
> coupled with a git repository. See how kernel.org uses it.

I'm not necessarily against this, but it would be a considerable amount of work 
to convert the existing manual, and there would have to be a clear benefit.  Do 
people thing that the Sphynx markup is much easier to use than the DocBook XML? 
I don't think we use many unusual features in the XML.

I guess the other important factor would be whether there are people interested 
to do this work?  I contribute to the existing manual when I get a chance, and 
don't find the XML very hard to use, but if converting to a new format would 
convince more people to contribute then this is something we should consider.

Cheers, Andreas

> MediaWiki is not a good fit for doc manual like Lustre one. Not easy to 
> browse, not easy to export.
> 
> 
> My 2 cents
> 
> Aurélien
> 
> Le 17/11/2017 à 23:03, Dilger, Andreas a écrit :
>> On Nov 16, 2017, at 22:41, Cowe, Malcolm J <malcolm.j.c...@intel.com> wrote:
>>> I am pleased to announce the availability of a new systems administration 
>>> guide for the Lustre file system, which has been published to 
>>> wiki.lustre.org. The content can be accessed directly from the front page 
>>> of the wiki, or from the following URL:
>>>  http://wiki.lustre.org/Category:Lustre_Systems_Administration
>>>  The guide is intended to provide comprehensive instructions for the 
>>> installation and configuration of production-ready Lustre storage clusters. 
>>> Topics covered:
>>> • Introduction to Lustre
>>> • Lustre File System Components
>>> • Lustre Software Installation
>>> • Lustre Networking (LNet)
>>> • LNet Router Configuration
>>> • Lustre Object Storage Devices (OSDs)
>>> • Creating Lustre File System Services
>>> • Mounting a Lustre File System on Client Nodes
>>> • Starting and Stopping Lustre Services
>>> • Lustre High Availability
>>>  Refer to the front page of the guide for the complete table of contents.
>> Malcolm,
>> thanks so much for your work on this.  It is definitely improving the
>> state of the documentation available today.
>> 
>> I was wondering if people have an opinion on whether we should remove
>> some/all of the administration content from the Lustre Operations Manual,
>> and make that more of a reference manual that contains details of
>> commands, architecture, features, etc. as a second-level reference from
>> the wiki admin guide?
>> 
>> For that matter, should we export the XML Manual

Re: [lustre-discuss] Can Linux FS-Cache/CacheFS run on Lustre 2.10.x

2017-12-12 Thread Dilger, Andreas

On Dec 11, 2017, at 20:31, forrest.wc.l...@dell.com wrote:
> 
> Dear All:
>  
> The best practice of DGX-1 on storage for DL, requires 4x SSDs to be local 
> cache backend to improve IO performance by using Linux cacheFS. 
> http://docs.nvidia.com/deeplearning/dgx/pdf/Best-Practices.pdf
> 6.1.1. Internal Storage
> The first storage consideration is storage within the DGX-1 itself. For the 
> best possible performance, a NFS read cache has been included in the DGX-1 
> appliance using the Linux cacheFS capability. It uses four SSD’s in a RAID-0 
> group. The drives are connected to a dedicated hardware RAID controller.
>  
> I looked Redhat Linux, the manual shows the FS-Cache is a persistent local 
> cache that can be used by file systems to take data retrieved from over the 
> network and cache it on local disk. This helps minimize network traffic for 
> users accessing data from a file system mounted over the network (for 
> example, NFS).
>  
> https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html/storage_administration_guide/ch-fscache
>   
>  
> Can Linux FS-Cache/CacheFS run on Lustre 2.10.x or will the Persistent 
> Client-side Cache on Lustre 2.12 have the same function as FS-Cache on NFS to 
> be released next year ?

It is not possible to use FSCache/CacheFS with Lustre today (at least there is 
no public patch that I know of that does this).

The Persistent Client Cache feature will provide equivalent functionality for 
Lustre as CacheFS/FSCache.  We have been discussing whether to use 
CacheFS/FSCache for Lustre instead of the dedicated PCC code, but there are 
significant differences in the architecture (file vs. block based cache), and 
we are leaning toward the dedicated PCC code as providing better functionality 
for Lustre, as well as the flexibility to modify it to suit our needs.

If you are interested in this, it would be interesting/useful if you are able 
to test out the patch and run the NVidia benchmarks to compare Lustre+PCC vs. 
NFS+CacheFS.  I doubt SSDs+RAID-0 makes sense, vs. having a single PCI NVMe 
device like P3700 or similar.

Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Intel Corporation

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Re: [lustre-discuss] BAD CHECKSUM

2017-12-09 Thread Dilger, Andreas

Based on the messages on the client, this isn’t related to mmap() or writes 
done by the client, since the data has the same checksum from before it was 
sent and after it got the checksum error returned from the server. That means 
the pages did not change on the client.

Possible causes include the client network card, server network card, memory, 
or possibly the OFED driver?  It could of course be something in Lustre/LNet, 
though we haven’t had any reports of anything similar.

When the checksum code was first written, it was motivated by a faulty Ethernet 
NIC that had TCP checksum offload, but bad onboard cache, and the data was 
corrupted when copied onto the NIC but the TCP checksum was computed on the bad 
data and the checksum was “correct” when received by the server, so it didn’t 
cause TCP resends.

Are you seeing this on multiple servers?  The client log only shows one server, 
while the server log shows multiple clients.  If it is only happening on one 
server it might point to hardware.

Did you also upgrade the kernel and OFED at the same time as Lustre? You could 
try building Lustre 2.10.1 on the old 2.9.0 kernel and OFED to see if that 
works properly.

Cheers, Andreas

On Dec 9, 2017, at 11:09, Hans Henrik Happe <ha...@nbi.dk<mailto:ha...@nbi.dk>> 
wrote:



On 09-12-2017 18:57, Hans Henrik Happe wrote:
On 07-12-2017 21:36, Dilger, Andreas wrote:
On Dec 7, 2017, at 10:37, Hans Henrik Happe <ha...@nbi.dk<mailto:ha...@nbi.dk>> 
wrote:
Hi,

Can an application cause BAD CHECKSUM errors in Lustre logs by somehow
overwriting memory while being DMA'ed to network?

After upgrading to 2.10.1 on the server side we started seeing this from
a user's application (MPI I/O). Both 2.9.0 and 2.10.1 clients emit these
errors. We have not yet established weather the application is doing
things correctly.
If applications are using mmap IO it is possible for the page to become 
inconsistent after the checksum has been computed.  However, mmap IO is
normally detected by the client and no message should be printed.

There isn't anything that the application needs to do, since the client will 
resend the data if there is a checksum error, but the resends do slow down the 
IO.  If the inconsistency is on the client, there is no cause for concern 
(though it would be good to figure out the root cause).

It would be interesting to see what the exact error message is, since that will 
say whether the data became inconsistent on the client, or over the network.  
If the inconsistency is over the network or on the server, then that may point 
to hardware issues.
I've attached logs from a server and a client.

There was a cut n' paste error in the first set of files. This should be
better.

Looks like a something goes wrong over the network.

Cheers,
Hans Henrik



___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org<mailto:lustre-discuss@lists.lustre.org>
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Re: [lustre-discuss] BAD CHECKSUM

2017-12-07 Thread Dilger, Andreas

On Dec 7, 2017, at 10:37, Hans Henrik Happe  wrote:
> 
> Hi,
> 
> Can an application cause BAD CHECKSUM errors in Lustre logs by somehow
> overwriting memory while being DMA'ed to network?
> 
> After upgrading to 2.10.1 on the server side we started seeing this from
> a user's application (MPI I/O). Both 2.9.0 and 2.10.1 clients emit these
> errors. We have not yet established weather the application is doing
> things correctly.

If applications are using mmap IO it is possible for the page to become 
inconsistent after the checksum has been computed.  However, mmap IO is
normally detected by the client and no message should be printed.

There isn't anything that the application needs to do, since the client will 
resend the data if there is a checksum error, but the resends do slow down the 
IO.  If the inconsistency is on the client, there is no cause for concern 
(though it would be good to figure out the root cause).

It would be interesting to see what the exact error message is, since that will 
say whether the data became inconsistent on the client, or over the network.  
If the inconsistency is over the network or on the server, then that may point 
to hardware issues.

Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Intel Corporation

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Re: [lustre-discuss] Lustre 2.10.1 + RHEL7 Lock Callback Timer Expired

2017-12-06 Thread Dilger, Andreas

On Nov 29, 2017, at 04:34, Charles A Taylor  wrote:
> 
> 
> We have a genomics pipeline app (supernova) that fails consistently due to 
> the client being evicted on the OSSs with a  “lock callback timer expired”.  
> I doubled “nlm_enqueue_min” across the cluster but then the timer simply 
> expired after 200s rather than 100s so I don’t think that is the answer.   
> The syslog/dmesg on the client shows no signs of distress and it is a 
> “bigmem” machine with 1TB of RAM.  

Hi Charles,
I haven't seen much action on this email, so I thought I'd ask a few questions 
to see what is unusual in your system/application that might be causing 
problems.

The problem appears to be the client being unresponsive to server requests, or 
possibly a network problem, since the client is complaining about a number of 
different servers at the same time.

Firstly, is it only the "bigmem" client that is having problems, or does that 
happen with other clients?  Does this application run on other clients with 
less RAM?  I'm just wondering if there is something Lustre is doing that takes 
too long when there is a large amount of RAM (e.g. managing DLM locks, pages in 
RAM, etc.)?

I see that you are using MOFED on the client, and in-kernel OFED on the 
servers.  I'm not a network person, but this could be a source of problems at 
the network level.  You might consider to enable LNet error logging to the 
console, to see if there are connection problems reported:

client# lctl set_param printk=+neterror

If this problem is seen on a regular basis, you could try changing a few Lustre 
tunables to see if that reduces/eliminates the problem.  Firstly, reducing the 
amount of data and locks that Lustre will cache may help avoid the time it is 
spending on requests from the servers:

client# lctl set_param llite.*.max_cached_mb=128G
client# lctl set_param ldlm.namespaces.*.lru_size=1024

If it is not a network problem, but rather a "client is busy" problem that is 
helped by reducing the RAM/locks usage, it would be useful to make "perf trace" 
flame graphs to see where the CPU is used:

http://www.brendangregg.com/flamegraphs.html

Cheers, Andreas

> The eviction appears to come while the application is processing a large 
> number (~300) of data “chunks” (i.e. files) which occur in pairs.
> 
> -rw-r--r-- 1 chasman ufhpc 24 Nov 28 23:31 
> ./Tdtest915/ASSEMBLER_CS/_ASSEMBLER/_ASM_SN/SHARD_ASM/fork0/join/files/chunk233.sedge_bcs
> -rw-r--r-- 1 chasman ufhpc 34M Nov 28 23:31 
> ./Tdtest915/ASSEMBLER_CS/_ASSEMBLER/_ASM_SN/SHARD_ASM/fork0/join/files/chunk233.sedge_asm
> 
> I assume the 24-byte file is metadata (an index or some such) and the 34M 
> file is the actual data but I’m just guessing since I’m completely unfamiliar 
> with the application.  
> 
> The write error is,
> 
>#define ENOTCONN107 /* Transport endpoint is not connected */
> 
> which occurs after the OSS eviction.  This was reproducible under 2.5.3.90 as 
> well.  We hoped that upgrading to 2.10.1 would resolve the issue but it has 
> not.  
> 
> This is the first application (in 10 years) we have encountered that 
> consistently and reliably fails when run over Lustre.  I’m not sure at this 
> point whether this is a bug or tuning issue.
> If others have encountered and overcome something like this, we’d be grateful 
> to hear from you.
> 
> Regards,
> 
> Charles Taylor
> UF Research Computing
> 
> OSS:
> --
> Nov 28 23:41:41 ufrcoss28 kernel: LustreError: 
> 0:0:(ldlm_lockd.c:334:waiting_locks_callback()) ### lock callback timer 
> expired after 201s: evicing client at 10.13.136.74@o2ib  ns: 
> filter-testfs-OST002e_UUID lock: 880041717400/0x9bd23c8dc69323a1 lrc: 
> 3/0,0 mode: PW/PW res: [0x7ef2:0x0:0x0].0x0 rrc: 3 type: EXT 
> [0->18446744073709551615] (req 4096->1802239) flags: 0x6400010020 nid: 
> 10.13.136.74@o2ib remote: 0xe54f26957f2ac591 expref: 45 pid: 6836 timeout: 
> 6488120506 lvb_type: 0
> 
> Client:
> ———
> Nov 28 23:41:42 s5a-s23 kernel: LustreError: 11-0: 
> testfs-OST002e-osc-88c053fe3800: operation ost_write to node 
> 10.13.136.30@o2ib failed: rc = -107
> Nov 28 23:41:42 s5a-s23 kernel: Lustre: testfs-OST002e-osc-88c053fe3800: 
> Connection to testfs-OST002e (at 10.13.136.30@o2ib) was lost; in progress 
> operations using this service will wait for recovery to complete
> Nov 28 23:41:42 s5a-s23 kernel: LustreError: 167-0: 
> testfs-OST002e-osc-88c053fe3800: This client was evicted by 
> testfs-OST002e; in progress operations using this service will fail.
> Nov 28 23:41:42 s5a-s23 kernel: LustreError: 11-0: 
> testfs-OST002c-osc-88c053fe3800: operation ost_punch to node 
> 10.13.136.30@o2ib failed: rc = -107
> Nov 28 23:41:42 s5a-s23 kernel: Lustre: testfs-OST002c-osc-88c053fe3800: 
> Connection to testfs-OST002c (at 10.13.136.30@o2ib) was lost; in progress 
> operations using this service will wait for recovery to complete
> Nov 28 23:41:42 s5a-s23

Re: [lustre-discuss] Recompiling client from the source doesnot contain lnetctl

2017-11-30 Thread Dilger, Andreas

You should also check the config.log to see if it properly detected libyaml 
being installed and enabled “USE_DLC” for the build:

configure:35728: checking for yaml_parser_initialize in -lyaml
configure:35791: result: yes
configure:35801: checking whether to enable dlc
configure:35815: result: yes

Cheers, Andreas

On Nov 29, 2017, at 05:28, Arman Khalatyan 
<arm2...@gmail.com<mailto:arm2...@gmail.com>> wrote:

even in the extracted source code the lnetctl does not compile.
running make in the utils folder it is producing wirecheck,lst and
routerstat, but not lnetctl.
After running "make lnetctl" in the utils folder
/tmp/lustre-2.10.2_RC1/lnet/utils

it produces the executable.


On Wed, Nov 29, 2017 at 11:52 AM, Arman Khalatyan 
<arm2...@gmail.com<mailto:arm2...@gmail.com>> wrote:
Hi Andreas,
I just checked the yaml-devel it is installed:
yum list installed | grep yaml
libyaml.x86_64 0.1.4-11.el7_0  @base
libyaml-devel.x86_64   0.1.4-11.el7_0  @base

and still no success:
rpm -qpl rpmbuild/RPMS/x86_64/*.rpm| grep lnetctl
/usr/share/man/man8/lnetctl.8.gz
/usr/src/debug/lustre-2.10.2_RC1/lnet/include/lnet/lnetctl.h

are there any other dependencies ?

Thanks,
Arman.

On Wed, Nov 29, 2017 at 6:46 AM, Dilger, Andreas
<andreas.dil...@intel.com<mailto:andreas.dil...@intel.com>> wrote:
On Nov 28, 2017, at 07:58, Arman Khalatyan 
<arm2...@gmail.com<mailto:arm2...@gmail.com>> wrote:

Hello,
I would like to recompile the client from the rpm-source but looks
like the packaging on the jenkins is wrong:

1) wget 
https://build.hpdd.intel.com/job/lustre-b2_10/arch=x86_64,build_type=client,distro=el7,ib_stack=inkernel/lastSuccessfulBuild/artifact/artifacts/SRPMS/lustre-2.10.2_RC1-1.src.rpm
2) rpmbuild --rebuild --without servers lustre-2.10.2_RC1-1.src.rpm
after the successful build the rpms doesn't contain the lnetctl but
the help only
3) cd /root/rpmbuild/RPMS/x86_64
4) rpm -qpl ./*.rpm| grep lnetctl
/usr/share/man/man8/lnetctl.8.gz
/usr/src/debug/lustre-2.10.2_RC1/lnet/include/lnet/lnetctl.h

The   lustre-client-2.10.2_RC1-1.el7.x86_64.rpm on the jenkins
contains the lnetctl
Maybe I should add more options to rebuild the client + lnetctl?

You need to have libyaml-devel installed on your build node.

Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Intel Corporation







___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Re: [lustre-discuss] billions of 50k files

2017-11-29 Thread Dilger, Andreas

On Nov 29, 2017, at 15:31, Brian Andrus  wrote:
> 
> All,
> 
> I have always seen lustre as a good solution for large files and not the best 
> for many small files.
> Recently, I have seen a request for a small lustre system (2 OSSes, 1 MDS) 
> that would be for billions of files that average 50k-100k.

This is about 75TB of usable capacity per billion files.  Are you looking at 
HDD or SSD storage?  RAID or mirror?  What kind of client load, and how much 
does this system need to scale in the future?

> It seems to me, that for this to be 'of worth', the block sizes on disks need 
> to be small, but even then, with tcp overhead and inode limitations, it may 
> still not perform all that well (compared to larger files).

Even though Lustre does 1MB or 4MB RPCs, it only allocates as much space on the 
OSTs as needed for the file data.  This means 4KB blocks with ldiskfs, and 
variable (power-of-two) blocksize on ZFS (64KB or 128KB blocks by default). You 
could constrain ZFS to smaller blocks if needed (e.g. recordsize=32k), or 
enable ZFS compression to try and fit the data into smaller blocks (depends 
whether your data is compressible or not).

The drawback is that every Lustre file currently needs an MDT inode (1KB+) and 
an OST inode, so Lustre isn't the most efficient for small files.

> Am I off here? Have there been some developments in lustre that help this 
> scenario (beyond small files being stored on the MDT directly)?

The Data-on-MDT feature (DoM) has landed for 2.11, which seems like it would 
suit your workload well, since it only needs a single MDT inode for small 
files, and reduces the overhead when accessing the file.  DoM will still be a 
couple of months before that is released, though you could start testing now if 
you were interested.  Currently DoM is intended to be used together with OSTs, 
but if there is a demand we could look into what is needed to run an MDT-only 
filesystem configuration (some checks in the code that prevent the filesystem 
becoming available before at least one OST is mounted would need to be removed).

That said, you could also just set up a single NFS server with ZFS to handle 
the 75TB * N of storage, unless you need highly concurrent access to the files. 
 This would probably be acceptable if you don't need to scale too much (in 
capacity or performance), and don't have a large number of clients connecting.

One of the other features we're currently investigating (not sure how much 
interest there is yet) is to be able to "import" an existing ext4 or ZFS 
filesystem into Lustre as MDT (with DoM), and be able to grow horizontally 
by adding more MDTs or OSTs.  Some work is already being done that will 
facilitate this in 2.11 (DoM, and OI Scrub for ZFS), but more would be needed 
for this to work.  That would potentially allow you to start with a ZFS or ext4 
NFS server, and then migrate to Lustre if you need to scale it up.

Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Intel Corporation

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Re: [lustre-discuss] Lustre compilation error

2017-11-29 Thread Dilger, Andreas

Rick,
Would you be able to open a ticket for this, and possibly submit a patch to fix 
the build?

Cheers, Andreas

On Nov 29, 2017, at 14:18, Mohr Jr, Richard Frank (Rick Mohr) 
> wrote:

On Oct 18, 2017, at 9:44 AM, parag_k 
> wrote:

I got the source from github.

My configure line is-

./configure --disable-client 
--with-kernel-source-header=/usr/src/kernels/3.10.0-514.el7.x86_64/ 
--with-o2ib=/usr/src/ofa_kernel/default/

Are you still running into this issue?  If so, try adding “—enable-server” and 
removing “—disable-client”.  I was building lustre 2.10.1 today, and I 
initially had both “—disable-client” and “—enable-server” in my configuration 
line.  When I did that, I got error messages like these:

make[3]: *** No rule to make target `fld.ko', needed by `all-am'.  Stop.

When I removed the “—disable-client” option, the error went away.

--
Rick Mohr
Senior HPC System Administrator
National Institute for Computational Sciences
http://www.nics.tennessee.edu

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Re: [lustre-discuss] Lustre 2.10.1 + RHEL7 Page Allocation Failures

2017-11-29 Thread Dilger, Andreas

In particular, see the patch https://review.whamcloud.com/30164

LU-10133 o2iblnd: fall back to vmalloc for mlx4/mlx5

If a large QP is allocated with kmalloc(), but fails due to memory
fragmentation, fall back to vmalloc() to handle the allocation.
This is done in the upstream kernel, but was only fixed in mlx4
in the RHEL7.3 kernel, and neither mlx4 or mlx5 in the RHEL6 kernel.
Also fix mlx5 for SLES12 kernels.

Test-Parameters: trivial
Signed-off-by: Andreas Dilger
Change-Id: Ie74800edd27bf4c3210724079cbebbae532d1318

On Nov 29, 2017, at 06:09, Jones, Peter A  wrote:
> 
> Charles
> 
> That ticket is completely open so you do have access to everything. As I 
> understand it the options are to either use the latest MOFED update rather 
> than relying on the in-kernel OFED (which I believe is the advise usually 
> provided by Mellanox anyway) or else apply the kernel patch Andreas has 
> created that is referenced in the ticket.
> 
> Peter
> 
> On 2017-11-29, 2:50 AM, "lustre-discuss on behalf of Charles A Taylor" 
>  wrote:
> 
>> 
>> Hi All,
>> 
>> We recently upgraded from Lustre 2.5.3.90 on EL6 to 2.10.1 on EL7 (details 
>> below) but have hit what looks like LU-10133 (order 8 page allocation 
>> failures).
>> 
>> We don’t have access to look at the JIRA ticket in more detail but from what 
>> we can tell the the fix is to change from vmalloc() to vmalloc_array() in 
>> the mlx4 drivers.  However, the vmalloc_array() infrastructure is in an 
>> upstream (far upstream) kernel so I’m not sure when we’ll see that fix.
>> 
>> While this may not be a Lustre issue directly, I know we can’t be the only 
>> Lustre site running 2.10.1 over IB on Mellanox ConnectX-3 HCAs.  So far we 
>> have tried increasing vm.min_free_kbytes to 8GB but that does not help.  
>> Zone_reclaim_mode is disabled (for other reasons that may not be valid under 
>> EL7) but order 8 chunks get depleted on both NUMA nodes so I’m not sure that 
>> is the answer either (though we have not tried it yet).
>> 
>> [root@ufrcmds1 ~]# cat /proc/buddyinfo 
>> Node 0, zone  DMA  1  0  0  0  2  1  1  
>> 0  1  1  3 
>> Node 0, zoneDMA32   1554  13496  11481   5108150  0  0  
>> 0  0  0  0 
>> Node 0, zone   Normal 114119 208080  78468  35679   6215690  0  
>> 0  0  0  0 
>> Node 1, zone   Normal  81295 184795 106942  38818   4485293   1653  
>> 0  0  0  0 
>> 
>> I’m wondering if other sites are hitting this and, if so, what are you doing 
>> to work around the issue on your OSSs.  
>> 
>> Regards,
>> 
>> Charles Taylor
>> UF Research Computing
>> 
>> 
>> Some Details:
>> ---
>> OS: RHEL 7.4 (Linux ufrcoss28.ufhpc 3.10.0-693.2.2.el7_lustre.x86_64)
>> Lustre: 2.10.1 (lustre-2.10.1-1.el7.x86_64)
>> Clients: ~1400 (still running 2.5.3.90 but we are in the process of 
>> upgrading)
>> Servers: 10 HA OSS pairs (20 OSSs)
>>128 GB RAM
>>6 OSTs (8+2 RAID-6) per OSS 
>>Mellanox ConnectX-3 IB/VPI HCAs 
>>RedHat Native IB Stack (i.e. not MOFED)
>>mlx4_core driver:
>>   filename:   
>> /lib/modules/3.10.0-693.2.2.el7_lustre.x86_64/kernel/drivers/net/ethernet/mellanox/mlx4/mlx4_core.ko.xz
>>   version:2.2-1
>>   license:Dual BSD/GPL
>>   description:Mellanox ConnectX HCA low-level driver
>>   author: Roland Dreier
>>   rhelversion:7.4
> ___
> lustre-discuss mailing list
> lustre-discuss@lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Intel Corporation







___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Re: [lustre-discuss] Recompiling client from the source doesnot contain lnetctl

2017-11-28 Thread Dilger, Andreas

On Nov 28, 2017, at 07:58, Arman Khalatyan  wrote:
> 
> Hello,
> I would like to recompile the client from the rpm-source but looks
> like the packaging on the jenkins is wrong:
> 
> 1) wget 
> https://build.hpdd.intel.com/job/lustre-b2_10/arch=x86_64,build_type=client,distro=el7,ib_stack=inkernel/lastSuccessfulBuild/artifact/artifacts/SRPMS/lustre-2.10.2_RC1-1.src.rpm
> 2) rpmbuild --rebuild --without servers lustre-2.10.2_RC1-1.src.rpm
> after the successful build the rpms doesn't contain the lnetctl but
> the help only
> 3) cd /root/rpmbuild/RPMS/x86_64
> 4) rpm -qpl ./*.rpm| grep lnetctl
> /usr/share/man/man8/lnetctl.8.gz
> /usr/src/debug/lustre-2.10.2_RC1/lnet/include/lnet/lnetctl.h
> 
> The   lustre-client-2.10.2_RC1-1.el7.x86_64.rpm on the jenkins
> contains the lnetctl
> Maybe I should add more options to rebuild the client + lnetctl?

You need to have libyaml-devel installed on your build node.

Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Intel Corporation







___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Re: [lustre-discuss] Announce: Lustre Systems Administration Guide

2017-11-28 Thread Dilger, Andreas

On Nov 17, 2017, at 20:20, Stu Midgley <sdm...@gmail.com> wrote:
> 
> Thank you both for the documentation.  I know how hard it is to maintain. 
> 
> I've asked that all my admin staff to read it - even if some of it doesn't 
> directly apply to our environment.
> 
> What we would like is well organised, comprehensive, accurate and up to date 
> documenation.  Most of the time when I dive into the manual, or other online 
> material, I find it isn't quite right (path's slightly wrong or outdated 
> etc). 

The manual is open to contributions if you find problems therein.  Please see:

https://wiki.hpdd.intel.com/display/PUB/Making+changes+to+the+Lustre+Manual+source

> I also have difficulty finding all the information I want in a single 
> location and in a logical fashon.  These aren't new issues and blight all 
> documentation, but having the definitive source in a wiki might open it up to 
> more transparency, greater use and thus, ultimately, being kept up to date, 
> even if its by others outside Intel.

I'd be thrilled if there were contributors to the manual outside of Intel.  
IMHO, users who are not intimately familiar with Lustre are the best people to 
know when the manual isn't clear or is missing information.  I personally don't 
read the manual very often, though I do reference it on occasion.  When I find 
something wrong or outdated, I submit a patch, and it generally is landed 
quickly.

> I'd also like a section where people can post their experiences and 
> solutions.  For example, in recent times, we have battled bad interactions 
> with ZFS+lustre which lead to poor performance and ZFS corruption.  While we 
> have now tuned both lustre and zfs and the bugs have mostly been fixed, the 
> learnings, trouble shooting methods etc. should be preserved and might assist 
> others in the future diagnose tricky problems.

Stack overflow for Lustre?  I've been wondering about some kind of Q forum 
for Lustre for a while.  This would be a great project to propose to OpenSFS to 
be hosted on the lustre.org site (Intel does not manage that site).  I suspect 
there are numerous engines available for this already, and it just needs 
someone interested and/or knowledgeable enough to pick an engine and get it 
installed there.

Cheers, Andreas

> On Sat, Nov 18, 2017 at 6:03 AM, Dilger, Andreas <andreas.dil...@intel.com> 
> wrote:
> On Nov 16, 2017, at 22:41, Cowe, Malcolm J <malcolm.j.c...@intel.com> wrote:
> >
> > I am pleased to announce the availability of a new systems administration 
> > guide for the Lustre file system, which has been published to 
> > wiki.lustre.org. The content can be accessed directly from the front page 
> > of the wiki, or from the following URL:
> >
> > http://wiki.lustre.org/Category:Lustre_Systems_Administration
> >
> > The guide is intended to provide comprehensive instructions for the 
> > installation and configuration of production-ready Lustre storage clusters. 
> > Topics covered:
> >
> >   • Introduction to Lustre
> >   • Lustre File System Components
> >   • Lustre Software Installation
> >   • Lustre Networking (LNet)
> >   • LNet Router Configuration
> >   • Lustre Object Storage Devices (OSDs)
> >   • Creating Lustre File System Services
> >   • Mounting a Lustre File System on Client Nodes
> >   • Starting and Stopping Lustre Services
> >   • Lustre High Availability
> >
> > Refer to the front page of the guide for the complete table of contents.
> 
> Malcolm,
> thanks so much for your work on this.  It is definitely improving the
> state of the documentation available today.
> 
> I was wondering if people have an opinion on whether we should remove
> some/all of the administration content from the Lustre Operations Manual,
> and make that more of a reference manual that contains details of
> commands, architecture, features, etc. as a second-level reference from
> the wiki admin guide?
> 
> For that matter, should we export the XML Manual into the wiki and
> leave it there?  We'd have to make sure that the wiki is being indexed
> by Google for easier searching before we could do that.
> 
> Cheers, Andreas
> 
> > In addition, for people who are new to Lustre, there is a high-level 
> > introduction to Lustre concepts, available as a PDF download:
> >
> > http://wiki.lustre.org/images/6/64/LustreArchitecture-v4.pdf
> >
> >
> > Malcolm Cowe

Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Intel Corporation







___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Re: [lustre-discuss] Lustre and Elasticsearch

2017-11-26 Thread Dilger, Andreas

The flock locks (and regular LDLM locks for Lustre metadata and data extents) 
are reconstructed from client state if the MDS or OSS crash.

Cheers, Andreas

On Nov 26, 2017, at 21:03, John Bent 
<johnb...@gmail.com<mailto:johnb...@gmail.com>> wrote:

How does the lock manager avoid disk IO?  Locks don’t survive MDS0 failure?

On Nov 26, 2017, at 8:29 PM, Dilger, Andreas 
<andreas.dil...@intel.com<mailto:andreas.dil...@intel.com>> wrote:

The flock functionality only affects applications that are actually using it. 
It does not add any overhead for applications that do not use flock.

There are two flock options:

 - localflock, which only keeps locking on the local client node and is 
sufficient for applications that only run on a single node
- flock, which adds locking between applications on different clients mounted 
with this option. This is if you have a distributed application that is running 
on multiple clients that controls its file access via flock (e.g. 
Producer/consumer).

The overhead itself depends on how much the application is actually using 
flock. The lock manager is on MDS0, and uses Lustre RPCs (which can run at 
100k/s or higher), and does not involve any disk IO.

Cheers, Andreas

On Nov 26, 2017, at 12:03, E.S. Rosenberg 
<esr+lus...@mail.hebrew.edu<mailto:esr+lus...@mail.hebrew.edu>> wrote:

Hi Torsten,
Thanks that worked!

Do you or anyone on the list know if/how flock affects Lustre performance?

Thanks again,
Eli

On Tue, Nov 21, 2017 at 9:18 AM, Torsten Harenberg 
<torsten.harenb...@cern.ch<mailto:torsten.harenb...@cern.ch>> wrote:
Hi Eli,

Am 21.11.17 um 01:26 schrieb E.S. Rosenberg:
> So I was wondering would this issue be solved by Lustre bindings for
> Java or is this a way of locking that isn't supported by Lustre?

I know nothing about Elastic Search, but have you tried to mount Lustre
with "flock" in the mount options?

Cheers

 Torsten

--
<><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><>
<>  <>
<> Dr. Torsten Harenberg 
torsten.harenb...@cern.ch<mailto:torsten.harenb...@cern.ch>  <>
<> Bergische Universitaet   <>
<> Fakutät 4 - PhysikTel.: +49 (0)202 
439-3521<tel:%2B49%20%280%29202%20439-3521>  <>
<> Gaussstr. 20  Fax : +49 (0)202 
439-2811<tel:%2B49%20%280%29202%20439-2811>  <>
<> 42097 Wuppertal   @CERN: Bat. 1-1-049<>
<>  <>
<><><><><><><>< Of course it runs NetBSD http://www.netbsd.org ><>

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org<mailto:lustre-discuss@lists.lustre.org>
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org<mailto:lustre-discuss@lists.lustre.org>
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org<mailto:lustre-discuss@lists.lustre.org>
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Re: [lustre-discuss] Lustre and Elasticsearch

2017-11-26 Thread Dilger, Andreas

The flock functionality only affects applications that are actually using it. 
It does not add any overhead for applications that do not use flock.

There are two flock options:

 - localflock, which only keeps locking on the local client node and is 
sufficient for applications that only run on a single node
- flock, which adds locking between applications on different clients mounted 
with this option. This is if you have a distributed application that is running 
on multiple clients that controls its file access via flock (e.g. 
Producer/consumer).

The overhead itself depends on how much the application is actually using 
flock. The lock manager is on MDS0, and uses Lustre RPCs (which can run at 
100k/s or higher), and does not involve any disk IO.

Cheers, Andreas

On Nov 26, 2017, at 12:03, E.S. Rosenberg 
> wrote:

Hi Torsten,
Thanks that worked!

Do you or anyone on the list know if/how flock affects Lustre performance?

Thanks again,
Eli

On Tue, Nov 21, 2017 at 9:18 AM, Torsten Harenberg 
> wrote:
Hi Eli,

Am 21.11.17 um 01:26 schrieb E.S. Rosenberg:
> So I was wondering would this issue be solved by Lustre bindings for
> Java or is this a way of locking that isn't supported by Lustre?

I know nothing about Elastic Search, but have you tried to mount Lustre
with "flock" in the mount options?

Cheers

 Torsten

--
<><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><>
<>  <>
<> Dr. Torsten Harenberg 
torsten.harenb...@cern.ch  <>
<> Bergische Universitaet   <>
<> Fakutät 4 - PhysikTel.: +49 (0)202 
439-3521  <>
<> Gaussstr. 20  Fax : +49 (0)202 
439-2811  <>
<> 42097 Wuppertal   @CERN: Bat. 1-1-049<>
<>  <>
<><><><><><><>< Of course it runs NetBSD http://www.netbsd.org ><>

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Re: [lustre-discuss] lfs_migrate rsync vs. lfs migrate and layout swap

2017-11-25 Thread Dilger, Andreas

You should be able to push using SSH, which I'd imagine would not be blocked?  
It is possible to also fetch patches via http and git protocol, but I don't 
think we allow unauthenticated pushes. 

Cheers, Andreas

> On Nov 25, 2017, at 15:01, Daniel Kobras <kob...@linux.de> wrote:
> 
> Hi!
> 
> 
>> Am 20.11.2017 um 00:01 schrieb Dilger, Andreas <andreas.dil...@intel.com>:
>> 
>> It would be interesting to strace your rsync vs. "lfs migrate" read/write 
>> patterns so that the copy method of "lfs migrate" can be improved to match 
>> rsync. Since they are both userspace copy actions they should be about the 
>> same performance. It may be that "lfs migrate" is using O_DIRECT to minimize 
>> client cache pollution (I don't have the code handy to check right now).  In 
>> the future we could use "copyfile()" to avoid this as well. 
> 
> lfs migrate indeed uses O_DIRECT for reading the source file. A few tests on 
> a system running 2.10.1 yielded a 10x higher throughput with a modified lfs 
> migrate that simply dropped the O_DIRECT flag. I’ve filed 
> https://jira.hpdd.intel.com/browse/LU-10278 about it. (A simple patch to make 
> O_DIRECT optional is ready, but I still need to charm the gods of the 
> firewall to let me push to Gerrit.)
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Re: [lustre-discuss] lfs_migrate rsync vs. lfs migrate and layout swap

2017-11-24 Thread Dilger, Andreas

On Nov 24, 2017, at 06:55, Dauchy, Nathan (ARC-TNC)[CSRA, LLC] 
<nathan.dau...@nasa.gov> wrote:
> 
> For those following along and interested in another difference between the 
> two migrate methods...
> 
> lfs migrate layout swap is (apparently) able to handle files with multiple 
> links, whereas the rsync method bails out.

There is a patch to allow rsync to handle hard-linked files, but I'm not sure 
if it has landed yet.  The layout swap method is of course preferred.


> ____________
> From: Dilger, Andreas [andreas.dil...@intel.com]
> Sent: Sunday, November 19, 2017 4:01 PM
> To: Dauchy, Nathan (ARC-TNC)[CSRA, LLC]
> Cc: lustre-discuss@lists.lustre.org
> Subject: Re: [lustre-discuss] lfs_migrate rsync vs. lfs migrate and layout 
> swap
> 
> It would be interesting to strace your rsync vs. "lfs migrate" read/write 
> patterns so that the copy method of "lfs migrate" can be improved to match 
> rsync. Since they are both userspace copy actions they should be about the 
> same performance. It may be that "lfs migrate" is using O_DIRECT to minimize 
> client cache pollution (I don't have the code handy to check right now).  In 
> the future we could use "copyfile()" to avoid this as well.
> 
> The main benefit of migrate is that it keeps the open file handles and inode 
> number on the MDS. Using rsync is just a copy+rename, which is why it is not 
> safe for in-use files.
> 
> There is no need to clean up volatile files, they are essentially 
> open-unlinked files, so they clean up automatically if the program or client 
> crash.
> 
> Cheers, Andreas
> 
>> On Nov 19, 2017, at 11:31, Dauchy, Nathan (ARC-TNC)[CSRA, LLC] 
>> <nathan.dau...@nasa.gov> wrote:
>> 
>> Greetings,
>> 
>> I'm trying to clarify and confirm the differences between lfs_migrate's use 
>> of rsync vs. "lfs migrate".  This is in regards to performance, 
>> checksumming, and interrupts.  Relevant code changes that introduced the two 
>> methods are here:
>> https://jira.hpdd.intel.com/browse/LU-2445
>> https://review.whamcloud.com/#/c/5620/
>> 
>> The quick testing I have done is with a 8GB file with stripe count of 4, and 
>> included the patch to lfs_migrate from:
>> https://review.whamcloud.com/#/c/20621/
>> (and client cache was dropped between each test)
>> 
>> $ time ./lfs_migrate -y bigfile
>> real1m13.643s
>> 
>> $ time ./lfs_migrate -y -s bigfile
>> real1m13.194s
>> 
>> $ time ./lfs_migrate -y -f bigfile
>> real0m31.791s
>> 
>> $ time ./lfs_migrate -y -f -s bigfile
>> real0m28.020s
>> 
>> * Performance:  The migrate runs faster when forcing rsync (assuming 
>> multiple stripes).  There is also minimal performance benefit to skipping 
>> the checksum with the rsync method.  Interestingly, performance with "lfs 
>> migrate" as the backend is barely effected (and within the noise when I ran 
>> multiple tests) by the choice of checksumming or not.  So, my question is 
>> whether there is some serialization going on with the layout swap method 
>> which causes it to be slower?
>> 
>> * Checksums:  In reading the migrate code in lfs.c, it is not obvious to me 
>> that there is any checksumming done at all for "lfs migrate".  That would 
>> explain why there is minimal performance difference.  How is data integrity 
>> ensured with this method?  Does the file data version somehow capture the 
>> checksum too?
>> 
>> * Interrupts:  If the rsync method is interrupted (kill -9, or client 
>> reboot) then a ".tmp.XX" file is left.  This is reasonably easy to 
>> search for and clean up.  With the lfs migrate layout swap method, what 
>> happens to the "volatile file" and it's objects?  Is an lfsck required in 
>> order to clean up the objects?
>> 
>> At this point, the "old" method seems preferable.  Are there other benefits 
>> to using the lfs migrate layout swap method that I'm missing?
>> 
>> Thanks for any clarifications or other suggestions!
>> 
>> -Nathan
>> ___
>> lustre-discuss mailing list
>> lustre-discuss@lists.lustre.org
>> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
> ___
> lustre-discuss mailing list
> lustre-discuss@lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Intel Corporation







___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Re: [lustre-discuss] strange time of reading for large file

2017-11-24 Thread Dilger, Andreas

On Nov 23, 2017, at 11:52, Rosana Guimaraes Ribeiro 
 wrote:
> 
> Hi,
> 
> I have some doubts about Lustre, I already sent my issues to forums but no 
> one answer me.
> In our application, during the performance testing on lustre 2.4.2 we got 
> times of reading and writing to test I/O operations with a file of almost 
> 400GB. 
> Running this application a lot of times, consecutively, we see that in write 
> operations, I/O time remain on same range, but in read operations there are a 
> huge difference of time.

I would really suggest to upgrade to a newer version of Lustre.  There have
definitely been a lot of IO performance improvements since 2.4.2.

> As you can see below:
> Write time [sec]:
> 325.77
> 318.80
> 325.44
> 458.54
> 316.89
> 327.75
> 344.90
> 340.34
> 383.57
> 316.35
> Read time [sec]:
> 570.48
> 601.11
> 447.14
> 406.39
> 480.44
> 5824.40
> 299.40
> 293.54
> 1049.93
> 4190.47
> We ran on the single client with 1 process and tested on same infrastructure 
> (hardware and network).
> Could you explain why is reading time so distorted? What kind of problem 
> might be occurring?
> 
> Regards,
> Rosana
> 
> ___
> lustre-discuss mailing list
> lustre-discuss@lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Intel Corporation







___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Re: [lustre-discuss] ost-survey hangs on lustre-2.10.0 client when using different size values

2017-11-24 Thread Dilger, Andreas

On Nov 23, 2017, at 18:02, Jae-Hyuck Kwak  wrote:
> 
> Hi, I'm newbie on lustre.
> 
> I am using Lustre-2.10.0. When I use ost-survey with default -s value, 
> it works well. But when I changes -s value, it hangs at read step.
> (see below)
> 
> ost-survey seems to change max_cached_mb to 256 * system page size 
> in MB which is 16 in our lustre environment.
> 
> I changed this value to a larger value and it works well.
> 
> I think minimum max_cached_mb value for ost-survey has something wrong.
> 
> Do you have any comments or something?

It would be useful to get stack traces and/or console messages from the
client and server after it hangs.  Best would be to file a new ticket in
Jira.

Cheers, Andreas

> 
> [root@cn11 ~]# ost-survey /lustre
> /usr/bin/ost-survey: 11/24/17 OST speed survey on /lustre from 
> 10.0.0.111@o2ib1
> Number of Active OST devices : 8
> Page Size is 4096
> write index 0 done.
> write index 1 done.
> write index 2 done.
> write index 3 done.
> write index 4 done.
> write index 5 done.
> write index 6 done.
> write index 7 done.
> read index 0 done.
> read index 1 done.
> read index 2 done.
> read index 3 done.
> read index 4 done.
> read index 5 done.
> read index 6 done.
> read index 7 done.
> Worst  Read OST indx: 0 speed: 544.158868
> Best   Read OST indx: 7 speed: 745.733589
> Read Average: 642.827346 +/- 63.038560 MB/s
> Worst  Write OST indx: 2 speed: 165.359455
> Best   Write OST indx: 0 speed: 547.385382
> Write Average: 284.413980 +/- 118.452906 MB/s
> Ost#  Read(MB/s)  Write(MB/s)  Read-time  Write-time
> 
> 0 544.159   547.3850.055  0.055
> 1 597.003   245.3470.050  0.122
> 2 622.987   165.3590.048  0.181
> 3 648.340   172.6480.046  0.174
> 4 730.477   384.7880.041  0.078
> 5 607.521   218.6560.049  0.137
> 6 646.398   262.8120.046  0.114
> 7 745.734   278.3170.040  0.108
> [root@cn11 ~]# ost-survey -s 10 /lustre
> /usr/bin/ost-survey: 11/24/17 OST speed survey on /lustre from 
> 10.0.0.111@o2ib1
> Number of Active OST devices : 8
> Page Size is 4096
> write index 0 done.
> write index 1 done.
> write index 2 done.
> write index 3 done.
> write index 4 done.
> write index 5 done.
> write index 6 done.
> write index 7 done.
> read index 0 done.
> read index 1 done.
> read index 2 done.
> read index 3 done.
> read index 4 done.
> read index 5 done.
> read index 6 done.
> read index 7 done.
> Worst  Read OST indx: 4 speed: 323.487301
> Best   Read OST indx: 3 speed: 425.770117
> Read Average: 378.171698 +/- 32.609314 MB/s
> Worst  Write OST indx: 5 speed: 142.140286
> Best   Write OST indx: 0 speed: 361.154509
> Write Average: 248.073472 +/- 75.279234 MB/s
> Ost#  Read(MB/s)  Write(MB/s)  Read-time  Write-time
> 
> 0 335.843   361.1550.030  0.028
> 1 386.369   244.2610.026  0.041
> 2 396.778   214.6150.025  0.047
> 3 425.770   158.5090.023  0.063
> 4 323.487   330.9270.031  0.030
> 5 364.589   142.1400.027  0.070
> 6 386.113   314.5920.026  0.032
> 7 406.425   218.3880.025  0.046
> [root@cn11 ~]# ost-survey -s 100 /lustre
> /usr/bin/ost-survey: 11/24/17 OST speed survey on /lustre from 
> 10.0.0.111@o2ib1
> Number of Active OST devices : 8
> Page Size is 4096
> write index 0 done.
> write index 1 done.
> write index 2 done.
> write index 3 done.
> write index 4 done.
> write index 5 done.
> write index 6 done.
> write index 7 done.
> (hang)
> 
> ___
> lustre-discuss mailing list
> lustre-discuss@lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Intel Corporation







___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Re: [lustre-discuss] lfs_migrate rsync vs. lfs migrate and layout swap

2017-11-19 Thread Dilger, Andreas

It would be interesting to strace your rsync vs. "lfs migrate" read/write 
patterns so that the copy method of "lfs migrate" can be improved to match 
rsync. Since they are both userspace copy actions they should be about the same 
performance. It may be that "lfs migrate" is using O_DIRECT to minimize client 
cache pollution (I don't have the code handy to check right now).  In the 
future we could use "copyfile()" to avoid this as well. 

The main benefit of migrate is that it keeps the open file handles and inode 
number on the MDS. Using rsync is just a copy+rename, which is why it is not 
safe for in-use files. 

There is no need to clean up volatile files, they are essentially open-unlinked 
files, so they clean up automatically if the program or client crash. 

Cheers, Andreas

> On Nov 19, 2017, at 11:31, Dauchy, Nathan (ARC-TNC)[CSRA, LLC] 
>  wrote:
> 
> Greetings,
> 
> I'm trying to clarify and confirm the differences between lfs_migrate's use 
> of rsync vs. "lfs migrate".  This is in regards to performance, checksumming, 
> and interrupts.  Relevant code changes that introduced the two methods are 
> here:
> https://jira.hpdd.intel.com/browse/LU-2445
> https://review.whamcloud.com/#/c/5620/
> 
> The quick testing I have done is with a 8GB file with stripe count of 4, and 
> included the patch to lfs_migrate from:
> https://review.whamcloud.com/#/c/20621/
> (and client cache was dropped between each test)
> 
> $ time ./lfs_migrate -y bigfile
> real1m13.643s
> 
> $ time ./lfs_migrate -y -s bigfile
> real1m13.194s
> 
> $ time ./lfs_migrate -y -f bigfile
> real0m31.791s
> 
> $ time ./lfs_migrate -y -f -s bigfile
> real0m28.020s
> 
> * Performance:  The migrate runs faster when forcing rsync (assuming multiple 
> stripes).  There is also minimal performance benefit to skipping the checksum 
> with the rsync method.  Interestingly, performance with "lfs migrate" as the 
> backend is barely effected (and within the noise when I ran multiple tests) 
> by the choice of checksumming or not.  So, my question is whether there is 
> some serialization going on with the layout swap method which causes it to be 
> slower?
> 
> * Checksums:  In reading the migrate code in lfs.c, it is not obvious to me 
> that there is any checksumming done at all for "lfs migrate".  That would 
> explain why there is minimal performance difference.  How is data integrity 
> ensured with this method?  Does the file data version somehow capture the 
> checksum too?
> 
> * Interrupts:  If the rsync method is interrupted (kill -9, or client reboot) 
> then a ".tmp.XX" file is left.  This is reasonably easy to search for and 
> clean up.  With the lfs migrate layout swap method, what happens to the 
> "volatile file" and it's objects?  Is an lfsck required in order to clean up 
> the objects?
> 
> At this point, the "old" method seems preferable.  Are there other benefits 
> to using the lfs migrate layout swap method that I'm missing?
> 
> Thanks for any clarifications or other suggestions!
> 
> -Nathan
> ___
> lustre-discuss mailing list
> lustre-discuss@lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Re: [lustre-discuss] Announce: Lustre Systems Administration Guide

2017-11-17 Thread Dilger, Andreas

On Nov 16, 2017, at 22:41, Cowe, Malcolm J  wrote:
> 
> I am pleased to announce the availability of a new systems administration 
> guide for the Lustre file system, which has been published to 
> wiki.lustre.org. The content can be accessed directly from the front page of 
> the wiki, or from the following URL:
>  
> http://wiki.lustre.org/Category:Lustre_Systems_Administration
>  
> The guide is intended to provide comprehensive instructions for the 
> installation and configuration of production-ready Lustre storage clusters. 
> Topics covered:
>  
>   • Introduction to Lustre
>   • Lustre File System Components
>   • Lustre Software Installation
>   • Lustre Networking (LNet)
>   • LNet Router Configuration
>   • Lustre Object Storage Devices (OSDs)
>   • Creating Lustre File System Services
>   • Mounting a Lustre File System on Client Nodes
>   • Starting and Stopping Lustre Services
>   • Lustre High Availability
>  
> Refer to the front page of the guide for the complete table of contents.

Malcolm,
thanks so much for your work on this.  It is definitely improving the
state of the documentation available today.

I was wondering if people have an opinion on whether we should remove
some/all of the administration content from the Lustre Operations Manual,
and make that more of a reference manual that contains details of
commands, architecture, features, etc. as a second-level reference from
the wiki admin guide?

For that matter, should we export the XML Manual into the wiki and
leave it there?  We'd have to make sure that the wiki is being indexed
by Google for easier searching before we could do that.

Cheers, Andreas

> In addition, for people who are new to Lustre, there is a high-level 
> introduction to Lustre concepts, available as a PDF download:
>  
> http://wiki.lustre.org/images/6/64/LustreArchitecture-v4.pdf
>  
>  
> Malcolm Cowe
> High Performance Data Division
>  
> Intel Corporation | www.intel.com
>  
> ___
> lustre-discuss mailing list
> lustre-discuss@lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Intel Corporation







___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Re: [lustre-discuss] Dependency errors with Lustre 2.10.1 packages

2017-11-16 Thread Dilger, Andreas

On Nov 15, 2017, at 11:40, Michael Watters  wrote:
> 
> I am attempting to install lustre packages on a new OST node running
> CentOS 7.4.1708 and it appears that there is a broken dependency in the
> rpm packages.  Attempting to install the lustre package results in an
> error as shown below.
> 
> [root@lustre-ost03 ~]# yum install lustre
> Loaded plugins: fastestmirror, versionlock
> Loading mirror speeds from cached hostfile
> Resolving Dependencies
> --> Running transaction check
> ---> Package lustre.x86_64 0:2.10.1-1.el7 will be installed
> --> Processing Dependency: kmod-lustre = 2.10.1 for package: 
> lustre-2.10.1-1.el7.x86_64
> --> Processing Dependency: lustre-osd for package: lustre-2.10.1-1.el7.x86_64
> --> Processing Dependency: lustre-osd-mount for package: 
> lustre-2.10.1-1.el7.x86_64
> --> Processing Dependency: libyaml-0.so.2()(64bit) for package: 
> lustre-2.10.1-1.el7.x86_64
> --> Running transaction check
> ---> Package kmod-lustre.x86_64 0:2.10.1-1.el7 will be installed
> ---> Package kmod-lustre-osd-ldiskfs.x86_64 0:2.10.1-1.el7 will be installed
> --> Processing Dependency: ldiskfsprogs >= 1.42.7.wc1 for package: 
> kmod-lustre-osd-ldiskfs-2.10.1-1.el7.x86_64
> ---> Package libyaml.x86_64 0:0.1.4-11.el7_0 will be installed
> ---> Package lustre-osd-ldiskfs-mount.x86_64 0:2.10.1-1.el7 will be installed
> --> Finished Dependency Resolution
> Error: Package: kmod-lustre-osd-ldiskfs-2.10.1-1.el7.x86_64 (lustre)
>Requires: ldiskfsprogs >= 1.42.7.wc1
>  You could try using --skip-broken to work around the problem
> 
> I've checked the repos and don't see a package for ldiskfsprogs at all. 
> Does anybody know how to resolve this?

You should install e2fsprogs-1.42.13.wc6 to provide the ldiskfsprogs 
dependency.  That is our hook to install a Lustre-aware version of e2fsprogs, 
since there are features not available in the vanilla e2fsprogs.

The good news is that several of the Lustre features are being merged into 
upstream ext4/e2fsprogs (large_dir and xattr_inode landed, dirdata under 
review) so there may be a day when we can use vanilla e2fsprogs, but that day 
isn't here yet.

Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Intel Corporation







___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Re: [lustre-discuss] how to set max_pages_per_rpc (I have done something wrong and need help)

2017-11-15 Thread Dilger, Andreas

On Nov 15, 2017, at 12:56, Harald van Pee <p...@hiskp.uni-bonn.de> wrote:
> 
> Hello Andreas,
> 
> thanks for your information, now I have the feeling I'm not completly lost.
> With erasing configuration parameters do you mean the
> writeconf procedure? (chapter 14.4)

Yes.

> Or is it possible to erase the unknown parameter?

You could try "lctl conf_param -d " to delete the parameter.

Cheers, Andreas

> On Wednesday 15 November 2017 20:37:20 Dilger, Andreas wrote:
>> The problem that Lustre clients fail to mount when they get an unknown
>> parameter is fixed in newer Lustre releases (2.9+) via patch
>> https://review.whamcloud.com/21112 .
>> 
>> The current maintenance release is 2.10.1.
>> 
>> You could also work around this by erasing the configuration parameters
>> (see Lustre manual).
>> 
>> Cheers, Andreas
>> 
>> On Nov 15, 2017, at 09:26, Harald van Pee
>> <p...@hiskp.uni-bonn.de<mailto:p...@hiskp.uni-bonn.de>> wrote:
>> 
>> Here are more informations:
>> 
>> if I try to mount the filesystem on the client I get similar messages as
>> from the failing conf_param command. It seems one have to remove this
>> failed configuration but how?
>> Here the syslog output on the client:
>> 
>> kernel: [ 4203.506437] LustreError: 3698:0:
>> (obd_mount.c:1340:lustre_fill_super()) Unable to mount  (-2)
>> kernel: [ 5028.547095] LustreError: 3830:0:
>> (obd_config.c:1202:class_process_config()) no device for:
>> hiskp3-OST-osc- 880416680800
>> kernel: [ 5028.547105] LustreError: 3830:0:
>> (obd_config.c:1606:class_config_llog_handler()) MGC192.168.128.200@o2ib:
>> cfg command failed: rc = -22
>> kernel: [ 5028.547112] Lustre:cmd=cf00f 0:hiskp3-OST-osc
>> 1:osc.max_pages_per_rpc=256
>> kernel: [ 5028.547112]
>> kernel: [ 5028.547156] LustreError: 15b-f: MGC192.168.128.200@o2ib: The
>> configuration from log 'hiskp3-client'failed from the MGS (-22).  Make sure
>> this client and the MGS are running compatible versions of Lustre.
>> kernel: [ 5028.547407] LustreError: 1680:0:(lov_obd.c:946:lov_cleanup())
>> hiskp3-clilov-880416680800: lov tgt 1 not cleaned! deathrow=0, lovrc=1
>> kernel: [ 5028.547415] LustreError: 1680:0:(lov_obd.c:946:lov_cleanup())
>> Skipped 3 previous similar messages
>> kernel: [ 5028.550906] Lustre: Unmounted hiskp3-client
>> kernel: [ 5028.551407] LustreError: 3815:0:
>> (obd_mount.c:1340:lustre_fill_super()) Unable to mount  (-22)
>> 
>> 
>> 
>> On Wednesday 15 November 2017 16:06:29 Harald van Pee wrote:
>> Dear all,
>> 
>> I want to set max_pages_per_rpc to 64 instead of 256
>> lustre mgs/mdt version 2.53
>> lustre oss version 2.53
>> lustre client 2.6
>> 
>> on client I have done:
>> lctl get_param osc.hiskp3-OST*.max_pages_per_rpc
>> osc.hiskp3-OST0001-osc-88105dba4800.max_pages_per_rpc=256
>> osc.hiskp3-OST0002-osc-88105dba4800.max_pages_per_rpc=256
>> osc.hiskp3-OST0003-osc-88105dba4800.max_pages_per_rpc=256
>> osc.hiskp3-OST0004-osc-88105dba4800.max_pages_per_rpc=256
>> lctl set_param osc.hiskp3-OST*.max_pages_per_rpc=64
>> 
>> this works, but after remount I get again 256 therefore I want to make it
>> permant with
>> lctl conf_param hiskp3-OST*.osc.max_pages_per_rpc=64
>> 
>> But I get the message, that this command have to be given on mdt
>> unfortunately I go to our combined mgs/mdt and get
>> 
>> Lustre: Setting parameter hiskp3-OST-osc.osc.max_pages_per_rpc in log
>> hiskp3-client
>> LustreError: 956:0:(obd_config.c:1221:class_process_config()) no device
>> for: hiskp3-OST-osc-MDT
>> LustreError: 956:0:(obd_config.c:1591:class_config_llog_handler())
>> MGC192.168.128.200@o2ib: cfg command failed: rc = -22
>> Lustre:cmd=cf00f 0:hiskp3-OST-osc-MDT
>> 1:osc.max_pages_per_rpc=64
>> 
>> than I can not mount client and want to go back
>> lctl set_param osc.hiskp3-OST*.max_pages_per_rpc=64
>> 
>> Lustre: Modifying parameter hiskp3-OST-osc.osc.max_pages_per_rpc in log
>> hiskp3-client
>> Lustre: Skipped 1 previous similar message
>> LustreError: 966:0:(obd_config.c:1221:class_process_config()) no device
>> for: hiskp3-OST-osc-MDT
>> LustreError: 966:0:(obd_config.c:1591:class_config_llog_handler())
>> MGC192.168.128.200@o2ib: cfg command failed: rc = -22
>> Lustre:cmd=cf00f 0:hiskp3-OST-osc-MDT
>> 1:osc.max_pages_per_rpc=256
>> 
>> obviously what I have done was completly wrong and I can no longer mount a
>&

Re: [lustre-discuss] how to set max_pages_per_rpc (I have done something wrong and need help)

2017-11-15 Thread Dilger, Andreas

The problem that Lustre clients fail to mount when they get an unknown 
parameter is fixed in newer Lustre releases (2.9+) via patch 
https://review.whamcloud.com/21112 .

The current maintenance release is 2.10.1.

You could also work around this by erasing the configuration parameters (see 
Lustre manual).

Cheers, Andreas

On Nov 15, 2017, at 09:26, Harald van Pee 
> wrote:

Here are more informations:

if I try to mount the filesystem on the client I get similar messages as from
the failing conf_param command. It seems one have to remove this failed
configuration but how?
Here the syslog output on the client:

kernel: [ 4203.506437] LustreError: 3698:0:
(obd_mount.c:1340:lustre_fill_super()) Unable to mount  (-2)
kernel: [ 5028.547095] LustreError: 3830:0:
(obd_config.c:1202:class_process_config()) no device for: hiskp3-OST-osc-
880416680800
kernel: [ 5028.547105] LustreError: 3830:0:
(obd_config.c:1606:class_config_llog_handler()) MGC192.168.128.200@o2ib: cfg
command failed: rc = -22
kernel: [ 5028.547112] Lustre:cmd=cf00f 0:hiskp3-OST-osc
1:osc.max_pages_per_rpc=256
kernel: [ 5028.547112]
kernel: [ 5028.547156] LustreError: 15b-f: MGC192.168.128.200@o2ib: The
configuration from log 'hiskp3-client'failed from the MGS (-22).  Make sure
this client and the MGS are running compatible versions of Lustre.
kernel: [ 5028.547407] LustreError: 1680:0:(lov_obd.c:946:lov_cleanup())
hiskp3-clilov-880416680800: lov tgt 1 not cleaned! deathrow=0, lovrc=1
kernel: [ 5028.547415] LustreError: 1680:0:(lov_obd.c:946:lov_cleanup())
Skipped 3 previous similar messages
kernel: [ 5028.550906] Lustre: Unmounted hiskp3-client
kernel: [ 5028.551407] LustreError: 3815:0:
(obd_mount.c:1340:lustre_fill_super()) Unable to mount  (-22)



On Wednesday 15 November 2017 16:06:29 Harald van Pee wrote:
Dear all,

I want to set max_pages_per_rpc to 64 instead of 256
lustre mgs/mdt version 2.53
lustre oss version 2.53
lustre client 2.6

on client I have done:
lctl get_param osc.hiskp3-OST*.max_pages_per_rpc
osc.hiskp3-OST0001-osc-88105dba4800.max_pages_per_rpc=256
osc.hiskp3-OST0002-osc-88105dba4800.max_pages_per_rpc=256
osc.hiskp3-OST0003-osc-88105dba4800.max_pages_per_rpc=256
osc.hiskp3-OST0004-osc-88105dba4800.max_pages_per_rpc=256
lctl set_param osc.hiskp3-OST*.max_pages_per_rpc=64

this works, but after remount I get again 256 therefore I want to make it
permant with
lctl conf_param hiskp3-OST*.osc.max_pages_per_rpc=64

But I get the message, that this command have to be given on mdt
unfortunately I go to our combined mgs/mdt and get

Lustre: Setting parameter hiskp3-OST-osc.osc.max_pages_per_rpc in log
hiskp3-client
LustreError: 956:0:(obd_config.c:1221:class_process_config()) no device
for: hiskp3-OST-osc-MDT
LustreError: 956:0:(obd_config.c:1591:class_config_llog_handler())
MGC192.168.128.200@o2ib: cfg command failed: rc = -22
Lustre:cmd=cf00f 0:hiskp3-OST-osc-MDT
1:osc.max_pages_per_rpc=64

than I can not mount client and want to go back
lctl set_param osc.hiskp3-OST*.max_pages_per_rpc=64

Lustre: Modifying parameter hiskp3-OST-osc.osc.max_pages_per_rpc in log
hiskp3-client
Lustre: Skipped 1 previous similar message
LustreError: 966:0:(obd_config.c:1221:class_process_config()) no device
for: hiskp3-OST-osc-MDT
LustreError: 966:0:(obd_config.c:1591:class_config_llog_handler())
MGC192.168.128.200@o2ib: cfg command failed: rc = -22
Lustre:cmd=cf00f 0:hiskp3-OST-osc-MDT
1:osc.max_pages_per_rpc=256

obviously what I have done was completly wrong and I can no longer mount a
client, mounted clients are working.
How can I get it back working?
hiskp3-MDT ist the label of the mgs/mdt but hiskp3-OST-osc-MDT
seems to be incorrect

What I have to do to get the mgs/mdt working again?
Its your production cluster
Any help is welcome

Best
Harald






___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

--
Harald van Pee

Helmholtz-Institut fuer Strahlen- und Kernphysik der Universitaet Bonn
Nussallee 14-16 - 53115 Bonn - Tel +49-228-732213 - Fax +49-228-732505
mail: p...@hiskp.uni-bonn.de
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Re: [lustre-discuss] lctl ping issue on 2.7.19

2017-11-11 Thread Dilger, Andreas

I'm no LNet expert, but I do know your o2iblnd module settings need to match on 
client and server (see /etc/modprobe.d/). 

There is work being done to make this more flexible, but it's not finished yet. 

Cheers, Andreas

> On Nov 11, 2017, at 08:37, john casu  wrote:
> 
> Have a bizarre issue, where I can lctl ping between clients and lctl ping 
> between servers
> but cannot ping between a client and a server.
> 
> on my server, I see the following error that seems to explain the issue:
> Nov 10 09:19:01 mds1 kernel: LNet: 
> 328:0:(o2iblnd_cb.c:2343:kiblnd_passive_connect()) Can't accept conn from 
> 10.55.10.11@o2ib (version 12): max_frags 256 too large (32 wanted)
> 
> Does anyone have any advice on how to simply resolve this?
> 
> I'm running 3.10.0-514.10.2.el7_lustre.x86_64 on the server (centos 7.2 base) 
> and
> 3.10.0-514.10.2.el7_lustre.x86_64 on the client (centos 7.3 base)
> also using built in IB drivers.
> 
> thanks,
> -john c.
> ___
> lustre-discuss mailing list
> lustre-discuss@lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Re: [lustre-discuss] obsolete Lustre file-joining feature

2017-11-09 Thread Dilger, Andreas

On Nov 9, 2017, at 14:05, teng wang  wrote:
> 
> Hi,
> 
> Seems like "lfs join" command is no longer supported in the version 
> 2.7.2.25. Is there any alternative for this feature? For example, is there 
> any user-level Lustre API for this feature?

The "file join" feature was only experimental at best, and has not been 
available for years.  In Lustre 2.10 the "composite layout" feature
implements similar (though not identical) functionality.

Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Intel Corporation







___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Re: [lustre-discuss] Does lustre 2.10 client support 2.5 server ?

2017-11-09 Thread Dilger, Andreas

On Nov 9, 2017, at 05:27, Andrew Elwell  wrote:
> 
>> My Lustre server is running the version 2.5 and I want to use 2.10 client.
>> Is this combination supported ? Is there anything that I need to be aware of
> 
> 2 of our storage appliances (sonnexion 1600 based) run 2.5.1, I've
> mounted this OK on infiniband clients fine with 2.10.0 and 2.10.1 OK,
> but a colleague has since had to downgrade some of our clients to
> 2.9.0 on OPA / KNL hosts as we were seeing strange issues (can't
> remember the ticket details)

If people are having problems like this, it would be useful to know the
details.  If you are using a non-Intel release, you should go through your
support provider, since they know the most details about what patches are
in their release.

> We do see the warnings at startup:
> Lustre: Server MGS version (2.5.1.0) is much older than client.
> Consider upgrading server (2.10.0)

This is a standard message if your client/server versions are more than 0.4 
releases apart.  While Lustre clients and servers negotiate the supported 
features between them at connection time, so there should be broad 
interoperability between releases, we only test between the latest major 
releases (e.g. 2.5.x and 2.7.y, or 2.7.y and 2.10.z), it isn't possible to test 
interoperability between every release.

Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Intel Corporation

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Re: [lustre-discuss] 1 MDS and 1 OSS

2017-10-30 Thread Dilger, Andreas

On Oct 31, 2017, at 07:35, Andrew Elwell <andrew.elw...@gmail.com> wrote:
> 
> 
> 
> On 31 Oct. 2017 07:20, "Dilger, Andreas" <andreas.dil...@intel.com> wrote:
>> 
>> Having a larger MDT isn't bad if you plan future expansion.  That said, you 
>> would get better performance over FDR if you used SSDs for the MDT rather 
>> than HDDs (if you aren't already planning this), and for a single OSS you 
>> probably don't need the extra MDT capacity.  With both ldiskfs+LVM and ZFS 
>> you can also expand the MDT size in the future if you need more capacity.
> 
> Can someone with wiki editing rights summarise the advantages of different 
> hardware combinations? For example I remember Daniel @ NCI had some nice 
> comments about which components (MDS v OSS) benefited from faster cores over 
> thread count and where more RAM was important.
> 
> I feel this would be useful for people building small test systems and 
> comparing vendor responses for large tenders.

Everyone has wiki editing rights, you just need to register...

Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Intel Corporation







___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Re: [lustre-discuss] 1 MDS and 1 OSS

2017-10-30 Thread Dilger, Andreas

On Oct 31, 2017, at 05:46, Mohr Jr, Richard Frank (Rick Mohr)  
wrote:
> 
>> On Oct 30, 2017, at 4:46 PM, Brian Andrus  wrote:
>> 
>> Someone please correct me if I am wrong, but that seems a bit large of an 
>> MDT. Of course drives these days are pretty good sized, so the extra is 
>> probably very inexpensive.
> 
> That probably depends on what the primary usage will be.  If the applications 
> create lots of small files (like some biomed programs), then a larger MDT 
> would result in more inodes allowing more Lustre files to be created.

With mirroring the MDT ends up as ~2.4TB (about 1.2B files for ldiskfs, 600M 
files for ZFS), which gives an minimum average file size of 120TB/1.2B = 100KB 
on the OSTs (200KB for ZFS).  That said, by default you won't be able to create 
so many files on the OSTs unless you reduce the inode ratio for ldiskfs at 
format time, or use ZFS (which doesn't have a fixed inode count, but uses twice 
as much space per inode ob the MDT). 

Having a larger MDT isn't bad if you plan future expansion.  That said, you 
would get better performance over FDR if you used SSDs for the MDT rather than 
HDDs (if you aren't already planning this), and for a single OSS you probably 
don't need the extra MDT capacity.  With both ldiskfs+LVM and ZFS you can also 
expand the MDT size in the future if you need more capacity.

Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Intel Corporation

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Re: [lustre-discuss] Lustre routing help needed

2017-10-30 Thread Dilger, Andreas

The 2.10 release added support for multi-rail LNet, which may potentially be 
causing problems here. I would suggest to install an older LNet version on your 
routers to match your client/server.

You may need to build your own RPMs for your new kernel, but can use 
--disable-server for configure to simplify things.

Cheers, Andreas

On Oct 31, 2017, at 04:45, Kevin M. Hildebrand 
> wrote:

Thanks, I completely missed that.  Indeed the ko2iblnd parameters were 
different between the servers and the router.  I've updated the parameters on 
the router to match those on the server, and things haven't gotten any better.  
(The problem appears to be on the Ethernet side anyway, so you've probably 
helped me fix a problem I didn't know I had...)
I don't see much discussion about configuring lnet parameters for Ethernet 
networks, I assume that's using ksocklnd.  On that side, it appears that all of 
the ksocklnd parameters match between the router and clients.  Interesting that 
peer_timeout is 180, which is almost exactly when my client gets marked down on 
the router.

Server (and now router) ko2iblnd parameters:
peer_credits 8
peer_credits_hiw 4
credits 256
concurrent_sends 8
ntx 512
map_on_demand 0
fmr_pool_size 512
fmr_flush_trigger 384
fmr_cache 1

Client and router ksocklnd:
peer_timeout 180
peer_credits 8
keepalive 30
sock_timeout 50
credits 256
rx_buffer_size 0
tx_buffer_size 0
keepalive_idle 30
round_robin 1
sock_timeout 50

Thanks,
Kevin

On Mon, Oct 30, 2017 at 4:16 PM, Mohr Jr, Richard Frank (Rick Mohr) 
> wrote:

> On Oct 30, 2017, at 8:47 AM, Kevin M. Hildebrand 
> > wrote:
>
> All of the hosts (client, server, router) have the following in ko2iblnd.conf:
>
> alias ko2iblnd-opa ko2iblnd
> options ko2iblnd-opa peer_credits=128 peer_credits_hiw=64 credits=1024 
> concurrent_sends=256 ntx=2048 map_on_demand=32 fmr_pool_size=2048 
> fmr_flush_trigger=512 fmr_cache=1 conns_per_peer=4
>
> install ko2iblnd /usr/sbin/ko2iblnd-probe

Those parameters will only get applied to omnipath interfaces (which you don’t 
have), so everything you have should just be running with default parameters.  
Since your lnet routers have a different version of lustre than your 
servers/clients, it might be possible that the default values for the ko2iblnd 
parameters are different between the two versions.  You can always check this 
by looking at the values in the files under /sys/module/ko2iblnd/parameters.  
It might be worthwhile to compare those values on the lnet routers to the 
values on the servers to see if maybe there is a difference that could affect 
the behavior.

--
Rick Mohr
Senior HPC System Administrator
National Institute for Computational Sciences
http://www.nics.tennessee.edu

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Re: [lustre-discuss] 1 MDS and 1 OSS

2017-10-30 Thread Dilger, Andreas

First, to answer Amjad's question - the number of OSS nodes you have depends
on the capacity and performance you need.  For 120TB of total storage (assume 
30x4TB drives, or 20x60TB drives) a single OSS is definitely capable of 
handling this many drives.  I'd also assume you are using 10Gb Ethernet (~= 
1GB/s), which  a single OSS should be able to saturate (at either 40MB/s or 
60MB/s per data drive with RAID-6 8+2 LUNs).  If you want more capacity or 
bandwidth, you can add more OSS nodes now or in the future.

As Ravi mentioned, with a single OSS and MDS, you will need to reboot the 
single server in case of failures instead of having automatic failover, but for 
some systems this is fine.

Finally, as for whether Lustre on a single MDS+OSS is better than running NFS 
on a single server, that depends mostly on the application workload.  NFS is 
easier to administer than Lustre, and will provide better small file 
performance than Lustre.  NFS also has the benefit that it works with every 
client available.

Interestingly, there are some workloads that users have reported to us where a 
single Lustre OSS will perform better than NFS, because Lustre does proper data 
locking/caching, while NFS has only close-to-open consistency semantics, and 
cannot cache data on the client for a long time.  Any workloads where there are 
multiple writers/readers to the same file will just not function properly with 
NFS.  Lustre will handle a large number of clients better than NFS.  For 
streaming IO loads, Lustre is better able to saturate the network (though for 
slower networks this doesn't really make much difference).  Lustre can drive 
faster networks (e.g. IB) much better with LNet than NFS with IPoIB.

And of course, if you think your performance/capacity needs will increase in 
the future, then Lustre can easily scale to virtually any size and performance 
you need, while NFS will not.

In general I wouldn't necessarily recommend Lustre for a single MDS+OSS 
installation, but this depends on your workload and future plans.

Cheers, Andreas

On Oct 30, 2017, at 15:59, E.S. Rosenberg  wrote:
> 
> Maybe someone can answer this in the context of this question, is there any 
> performance gain over classic filers when you are using only a single OSS?
> 
> On Mon, Oct 30, 2017 at 9:56 AM, Ravi Konila  wrote:
> Hi Majid
>  
> It is better to go for HA for both OSS and MDS. You would need 2 nos of MDS 
> and 2 nos of OSS (identical configuration).
> Also use latest Lustre 2.10.1 release.
>  
> Regards
> Ravi Konila
>  
>  
>> From: Amjad Syed
>> Sent: Monday, October 30, 2017 1:17 PM
>> To: lustre-discuss@lists.lustre.org
>> Subject: [lustre-discuss] 1 MDS and 1 OSS
>>  
>> Hello
>> We are in process in procuring one small Lustre filesystem giving us 120 TB  
>> of storage using Lustre 2.X.
>> The vendor has proposed only 1 MDS and 1 OSS as a solution.
>> The query we have is that is this configuration enough , or we need more OSS?
>> The MDS and OSS server are identical  with regards to RAM (64 GB) and  HDD 
>> (300GB)
>>  
>> Thanks
>> Majid

Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Intel Corporation

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Re: [lustre-discuss] ZFS-OST layout, number of OSTs

2017-10-28 Thread Dilger, Andreas

Having L2ARC on disks has no benefit at all.  It only makes sense if the L2ARC 
devices are on much faster storage (i.e. SSDs/NVMe) than the rest of the pool.  
Otherwise, the data could just be read from the disks directly.

Cheers, Andreas

On Oct 26, 2017, at 10:13, Mannthey, Keith  wrote:
> 
> I have seen both small and large OST work it just depends on what you want in 
> the system (Size/Performance/Manageability). Do benchmark both as they will 
> differ in overall performance some. 
> 
> L2arc read cache can help some workloads.  It takes multi reads for data to 
> be moved into the cache so standard benchmarking (IOR and other streaming 
> benchmarks) won't see much of a change.  
> 
> Thanks,
> Keith 
> 
> -Original Message-
> From: lustre-discuss [mailto:lustre-discuss-boun...@lists.lustre.org] On 
> Behalf Of Thomas Roth
> Sent: Thursday, October 26, 2017 1:50 AM
> To: Lustre Discuss 
> Subject: Re: [lustre-discuss] ZFS-OST layout, number of OSTs
> 
> On the other hand if we gather three or four raidz2s into one zpool/OST, loss 
> of one raidz means loss of a 120-160TB OST.
> Around here, this is usually the deciding argument. (Even temporarily taking 
> down one OST for whatever repairs would take more data offline).
> 
> 
> How is the general experience with having an l2arc on additional disks?
> In my test attempts I did not see much benefit under Lustre.
> 
> With our type of hardware, we do not have room for one drive per (small) 
> zpool - if there were only one or two zpools per box, this would be possible.
> 
> Regards
> Thomas
> 
> On 10/24/2017 09:41 PM, Cory Spitz wrote:
>> It’s also worth noting that if you have small OSTs it’s much easier to bump 
>> into a full OST situation.   And specifically, if you singly stripe a file 
>> the file size is limited by the size of the OST.
>> 
>> -Cory
>> 
> ___
> lustre-discuss mailing list
> lustre-discuss@lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
> ___
> lustre-discuss mailing list
> lustre-discuss@lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Intel Corporation







___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Re: [lustre-discuss] Lustre clients do not mount storage automatically

2017-10-24 Thread Dilger, Andreas

On Oct 24, 2017, at 07:31, Ravi Konila  wrote:
> 
> Hi Bob
>  
> I have done that. The line in fstab is
>  
> 192.168.0.50@o2ib:/lhome/homelustredefaults,_netdev0 0
>  
> If I add mount command in rc.local twice, it works..surprise..
>  
> my rc.local has
>  
> mount –t lustre 192.168.0.50@o2ib:/lhome/home
> mount –t lustre 192.168.0.50@o2ib:/lhome/home

You can add the mount option "retry=5" so that it will retry the
mount 5 times, in case the network is taking a long time to start.
The _netdev mount option _should_ handle this, not sure what the
problem is you are seeing.

Cheers, Andreas

> Ravi Konila
> From: Bob Ball
> Sent: Tuesday, October 24, 2017 6:10 PM
> To: Ravi Konila ; Lustre Discuss
> Subject: Re: [lustre-discuss] Lustre clients do not mount storage 
> automatically
>  
> If mounting from /etc/fstab, try adding "_netdev" as a parameter.  This 
> forces the mount to wait until the network is ready.
> 
> bob
> 
> On 10/24/2017 5:58 AM, Ravi Konila wrote:
>> Hi
>> My lustre clients does not mount lustre storage automatically on reboot.
>> I tried by adding in fstab as well as in /etc/rc.local, but on reboot they 
>> don’t mount lustre file system .
>> I get the following error (/var/log/messages output)
>>  
>> Oct 24 15:17:40 headnode02 named[6296]: client 127.0.0.1#50559: error 
>> sending response: host unreachable
>> Oct 24 15:17:54 headnode02 kernel: LustreError: 15c-8: MGC192.168.0.50@o2ib: 
>> The configuration from log 'lhome-client' failed (-5). This may be the 
>> result of communication errors between this node and the MGS, a bad 
>> configuration, or other errors. See the syslog for more information.
>> Oct 24 15:17:54 headnode02 kernel: Lustre: Unmounted lhome-client
>> Oct 24 15:17:54 headnode02 kernel: LustreError: 
>> 8116:0:(obd_mount.c:1426:lustre_fill_super()) Unable to mount  (-5)
>>  
>> But later on if I mount manually with the command
>> mount –t lustre 192.168.0.50@o2ib:/lhome /home, it works fine.
>> My lustre version is 2.8 and RHEL 6.7
>>  
>> Any suggestions?
>>  
>> Regards
>>  
>> Ravi Konila
>> 
>> 
>> ___
>> lustre-discuss mailing list
>> 
>> lustre-discuss@lists.lustre.org
>> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
> 
> ___
> lustre-discuss mailing list
> lustre-discuss@lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Intel Corporation







___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Re: [lustre-discuss] Linux users are not able to access lustre folders

2017-10-20 Thread Dilger, Andreas

On Oct 20, 2017, at 09:37, Ravi Bhat  wrote:
> 
> Hi Rick
> Thanks, I have created user (luser6) in client as well as in lustre servers. 
> But I get the same error as 
> No directory /home/luser6
> Logging in with home="/".
> 
> But now I can cd /home/luser6 manually and create files or folders.

Note that the "luser6" needs to have the same numeric UID/GID as on the clients.
Copying /etc/passwd to the MDS is the easiest way to do this (it isn't needed 
on the OSS nodes).  You probably do NOT want to copy /etc/shadow to the MDS, to 
prevent regular users from logging in there.

You can check that this is configured correctly by running on the MDS (as root):

  /usr/sbin/l_getidentity -d {uid}

to verify that "uid" is accessible from /etc/passwd (or NIS/LDAP/AD/whatever).

Cheers, Andreas

> Regards,
> Konila Ravi Bhat
> 
> On 20-Oct-2017 8:31 pm, "Mohr Jr, Richard Frank (Rick Mohr)"  
> wrote:
> 
> > On Oct 20, 2017, at 10:50 AM, Ravi Konila  wrote:
> >
> > Can you please guide me how do I do it, I mean install NIS on servers and 
> > clients?
> > Is it mandatory to setup NIS?
> >
> 
> NIS is not mandatory.  You just need a way to ensure that user accounts are 
> visible to the lustre servers.  You could also use LDAP or even just 
> /etc/passwd.   You’ll probably just want to choose whatever mechanism is used 
> on your other systems.
> 
> For the purposes of testing, you could always just create the luser1 locally 
> on each lustre server to see if things start to work.
> 
> --
> Rick Mohr
> Senior HPC System Administrator
> National Institute for Computational Sciences
> http://www.nics.tennessee.edu
> 
> 
> ___
> lustre-discuss mailing list
> lustre-discuss@lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Intel Corporation







___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Re: [lustre-discuss] Acceptable thresholds

2017-10-20 Thread Dilger, Andreas

The number of threads needed on the OSS nodes is largely a function of how much 
network and storage is attached.  If you have a large number of disks, and want 
to keep them all busy, then more threads are needed to keep the disks well fed 
compared to a system with a single slow LUN.

On very large OSSes this might be in the hundreds of IO threads.  With smaller 
OSSes it might only be 32-64 threads.

Run a benchmark like obdfilter-survey to see how many threads are needed to 
keep all the disks busy, then use the oss.ko module parameters to limit the OSS 
thread count.

Cheers, Andreas

On Oct 19, 2017, at 13:26, Patrick Farrell  wrote:
> 
> Several processes per CPU core, probably?  It’s a lot.
> 
> But there’s a lot of environmental and configuration dependence here too.
> 
> Why not look at how many you have running currently when Lustre is set up and 
> set the limit to double that?  Watching process count isn’t a good way to 
> measure load anyway - it’s probably only good for watching for a fork-bomb 
> type thing, where process count goes runaway.  So why not configure to catch 
> that and otherwise don’t worry about it?
> 
> - Patrick
> 
>> From: lustre-discuss  on behalf of 
>> "E.S. Rosenberg" 
>> Date: Thursday, October 19, 2017 at 2:20 PM
>> To: "lustre-discuss@lists.lustre.org" 
>> Subject: [lustre-discuss] Acceptable thresholds
>> 
>> Hi,
>> This question is I guess not truly answerable because it is probably very 
>> specific for each environment etc. but I am still going to ask it to get a 
>> general idea.
>> 
>> We started testing monitoring using Zabbix, its' default 'too many 
>> processes' threshold is not very high, so I already raised it to 1024 but 
>> the Lustre servers are still well over even that count.
>> 
>> So what is a 'normal' process count for Lustre servers?
>> Should I assume X processes per client? What is X?
>> 
>> Thanks,
>> Eli

Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Intel Corporation







___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Re: [lustre-discuss] Lustre compilation error

2017-10-18 Thread Dilger, Andreas

On Oct 18, 2017, at 07:44, parag_k  wrote:
> 
> I got the source from github.

Lustre isn't hosted on GitHub (unless someone is cloning it there), so it isn't 
clear what you are compiling.

You should download sources from git://git.hpdd.intel.com/fs/lustre-release.git

Cheers, Andreas

> My configure line is-
> 
> ./configure --disable-client 
> --with-kernel-source-header=/usr/src/kernels/3.10.0-514.el7.x86_64/ 
> --with-o2ib=/usr/src/ofa_kernel/default/
> 
> There are two things I was trying to do.
> 
> 1)  Creating rpms from source. And error mailed below is while making 
> rpms.
> 
>  
> 
> 2)  Compiling from source which is mentioned in the attached guide.
> 
>  
> 
> I also tried by extracting the src rpm and getting tar.gz from there.
> 
>  
> 
> 
> 
> Regards,
> Parag
> 
>  Original message 
> From: Chris Horn 
> Date: 18/10/2017 10:31 am (GMT+05:30)
> To: Parag Khuraswar , 'Lustre User Discussion Mailing 
> List' 
> Subject: Re: [lustre-discuss] Lustre compilation error
> 
> It would be helpful if you provided more context. How did you acquire the 
> source? What was your configure line? Is there a set of build instructions 
> that you are following?
> 
>  
> 
> Chris Horn
> 
>  
> 
> From: lustre-discuss  on behalf of 
> Parag Khuraswar 
> Date: Tuesday, October 17, 2017 at 11:52 PM
> To: 'Lustre User Discussion Mailing List' 
> Subject: Re: [lustre-discuss] Lustre compilation error
> 
>  
> 
> Hi,
> 
>  
> 
> Does any one have any idea on below issue?
> 
>  
> 
> Regards,
> 
> Parag
> 
>  
> 
>  
> 
> From: lustre-discuss [mailto:lustre-discuss-boun...@lists.lustre.org] On 
> Behalf Of Parag Khuraswar
> Sent: Tuesday, October , 2017 6:11 PM
> To: 'Lustre User Discussion Mailing List'
> Subject: [lustre-discuss] Lustre compilation error
> 
>  
> 
> Hi,
> 
>  
> 
> I am trying to make rpms from lustre 2.10.0 source. I get below error when I 
> run “make”
> 
>  
> 
> ==
> 
> make[4]: *** No rule to make target `fld.ko', needed by `all-am'.  Stop.
> 
> make[3]: *** [all-recursive] Error 1
> 
> make[2]: *** [all-recursive] Error 1
> 
> make[1]: *** [all] Error 2
> 
> error: Bad exit status from 
> /tmp/rpmbuild-lustre-root-Ssi5N0Xv/TMP/rpm-tmp.bKMjSO (%build)
> 
>  
> 
>  
> 
> RPM build errors:
> 
> Bad exit status from 
> /tmp/rpmbuild-lustre-root-Ssi5N0Xv/TMP/rpm-tmp.bKMjSO (%build)
> 
> make: *** [rpms] Error 1
> 
> ==
> 
>  
> 
> Regards,
> 
> Parag
> 
>  
> 
>  
> 
> ___
> lustre-discuss mailing list
> lustre-discuss@lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Intel Corporation







___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Re: [lustre-discuss] common error during periods of slow client iO

2017-10-15 Thread Dilger, Andreas

On Oct 8, 2017, at 18:51, John White  wrote:
> 
> We get this error quite often when OSSs are providing very slow client IO and 
> I'd love to know what it means.  I'd imagine it's one of those situation 
> where Andreas says, "this is a symptom, not a cause"

Sorry, I'm not sure what this is.  At first glance I'd say some kind of network
error is causing problems during data transfer.  The "-125" error (-ECANCELED)
is returned in a few different parts in the code, but is most likely from
the LNet layer. 

> LustreError: 4043:0:(events.c:452:server_bulk_callback()) event type 5, 
> status -125, desc 8803419d5000
> LustreError: 4041:0:(events.c:452:server_bulk_callback()) event type 5, 
> status -125, desc 880174d7e400
> LustreError: 4041:0:(events.c:452:server_bulk_callback()) event type 5, 
> status -125, desc 8806c5f32600
> LustreError: 4041:0:(events.c:452:server_bulk_callback()) event type 5, 
> status -125, desc 8806c6717a00
> LustreError: 4041:0:(events.c:452:server_bulk_callback()) event type 5, 
> status -125, desc 880297d33800
> LustreError: 4041:0:(events.c:452:server_bulk_callback()) event type 5, 
> status -125, desc 880155e58000
> LustreError: 4044:0:(events.c:452:server_bulk_callback()) event type 5, 
> status -5, desc 8806a8d33200
> LustreError: 4044:0:(events.c:452:server_bulk_callback()) event type 5, 
> status -125, desc 8806c6715a00
> LustreError: 4044:0:(events.c:452:server_bulk_callback()) event type 5, 
> status -5, desc 8801fd15cc00
> LustreError: 4044:0:(events.c:452:server_bulk_callback()) event type 5, 
> status -5, desc 8806c6050a00
> LustreError: 4044:0:(events.c:452:server_bulk_callback()) event type 5, 
> status -5, desc 8806b1044800
> LustreError: 4044:0:(events.c:452:server_bulk_callback()) event type 5, 
> status -5, desc 8806ae9cd600
> LustreError: 4044:0:(events.c:452:server_bulk_callback()) event type 5, 
> status -5, desc 8804d5011000
> LustreError: 4044:0:(events.c:452:server_bulk_callback()) event type 5, 
> status -5, desc 880684182800
> LustreError: 4044:0:(events.c:452:server_bulk_callback()) event type 5, 
> status -5, desc 8806b103ee00
> 
> ___
> lustre-discuss mailing list
> lustre-discuss@lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Intel Corporation







___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Re: [lustre-discuss] Timestamp doesn't show nano seconds

2017-10-15 Thread Dilger, Andreas

On Oct 11, 2017, at 05:38, Biju C P  wrote:
> 
> Hi,
> 
> All the files created under lustre filesystem are not storing the nano 
> seconds in timestamp. Is it by design ? If so, what will be the reason not to 
> store the nano seconds timestamp ?
> 
> [root@localhost dir1]# stat testfile18.log 
> 
>   File: ‘testfile18.log’
> 
>   Size: 384   Blocks: 8  IO Block: 4194304 regular file
> 
> Device: 2c54f966h/743766374d  Inode: 144115206178471951  Links: 1
> 
> Access: (0644/-rw-r--r--)  Uid: (0/root)   Gid: (0/root)
> 
> Context: unconfined_u:object_r:admin_home_t:s0
> 
> Access: 2017-10-11 11:31:15.0 +
> 
> Modify: 2017-10-11 11:31:15.0 +
> 
> Change: 2017-10-11 11:31:15.0 +
> 
>  Birth: -


Lustre has not implemented sub-second timestamps, as the kernel and ext3
did not have support for this when it was first developed, and until now
no users have asked for it, as this would add overhead for updating
timestamps multiple times a second.

Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Intel Corporation







___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Re: [lustre-discuss] exec start error for lustre-2.10.1_13_g2ee62fb

2017-10-13 Thread Dilger, Andreas

Could you please file a Jira ticket (and possibly a patch) to fix this, so it 
isn't forgotten. 

Cheers, Andreas

> On Oct 13, 2017, at 06:50, David Rackley  wrote:
> 
> That was it! Thanks for the help.
> 
> - Original Message -
> From: "Chris Horn" 
> To: "David Rackley" , lustre-discuss@lists.lustre.org
> Sent: Thursday, October 12, 2017 5:02:47 PM
> Subject: Re: [lustre-discuss] exec start error for lustre-2.10.1_13_g2ee62fb
> 
> Google suggests that this error message has been associated with a missing 
> “hashpling” in some cases. The lustre_routes_config script has “# 
> !/bin/bash”, and I wonder if that space before the “!” isn’t the culprit?
> 
> 
> 
> Just a guess. You might try to remove that space from the 
> lustre_routes_config script and try to restart lnet with systemctl.
> 
> 
> 
> Chris Horn
> 
> 
> 
> On 10/12/17, 3:39 PM, "lustre-discuss on behalf of David Rackley" 
>  wrote:
> 
> 
> 
>Greetings,
> 
> 
> 
>I have built lustre-2.10.1_13_g2ee62fb on 3.10.0-693.2.2.el7.x86_64 RHEL 
> Workstation release 7.4 (Maipo).
> 
> 
> 
>After installation of 
> kmod-lustre-client-2.10.1_13_g2ee62fb-1.el7.x86_64.rpm and 
> lustre-client-2.10.1_13_g2ee62fb-1.el7.x86_64.rpm the lnet startup fails. 
> 
> 
> 
>The error reported is:
> 
> 
> 
>-- Unit lnet.service has begun starting up.
> 
>Oct 12 13:21:53  kernel: libcfs: loading out-of-tree module taints kernel.
> 
>Oct 12 13:21:53  kernel: libcfs: module verification failed: signature 
> and/or required key missing - tainting kernel
> 
>Oct 12 13:21:53  kernel: LNet: HW NUMA nodes: 1, HW CPU cores: 20, 
> npartitions: 1
> 
>Oct 12 13:21:53  kernel: alg: No test for adler32 (adler32-zlib)
> 
>Oct 12 13:21:53  kernel: alg: No test for crc32 (crc32-table)
> 
>Oct 12 13:21:54  kernel: LNet: Using FMR for registration
> 
>Oct 12 13:21:54 lctl[135556]: LNET configured
> 
>Oct 12 13:21:54  kernel: LNet: Added LNI 172.17.1.92@o2ib [8/256/0/180]
> 
>Oct 12 13:21:54  systemd[135576]: Failed at step EXEC spawning 
> /usr/sbin/lustre_routes_config: Exec format error
> 
>-- Subject: Process /usr/sbin/lustre_routes_config could not be executed
> 
>-- Defined-By: systemd
> 
>-- Support: 
> https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.freedesktop.org_mailman_listinfo_systemd-2Ddevel=DwIGaQ=lz9TcOasaINaaC3U7FbMev2lsutwpI4--09aP8Lu18s=YRHhjew1k3Uquj64NMeoZQ=7w3NFNrR4nh8bmAIIuotD49Y2GvHoyAo981ZUHbvgg0=kxN1qf1rXRld4fZDxONppI9l8fdxJMzBaBeyDdejaEM=
>  
> 
>-- 
> 
>-- The process /usr/sbin/lustre_routes_config could not be executed and 
> failed.
> 
>-- 
> 
>-- The error number returned by this process is 8.
> 
>Oct 12 13:21:54  systemd[1]: lnet.service: main process exited, 
> code=exited, status=203/EXEC
> 
>Oct 12 13:21:54  systemd[1]: Failed to start lnet management.
> 
>-- Subject: Unit lnet.service has failed
> 
>-- Defined-By: systemd
> 
>-- Support: 
> https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.freedesktop.org_mailman_listinfo_systemd-2Ddevel=DwIGaQ=lz9TcOasaINaaC3U7FbMev2lsutwpI4--09aP8Lu18s=YRHhjew1k3Uquj64NMeoZQ=7w3NFNrR4nh8bmAIIuotD49Y2GvHoyAo981ZUHbvgg0=kxN1qf1rXRld4fZDxONppI9l8fdxJMzBaBeyDdejaEM=
>  
> 
>-- 
> 
>-- Unit lnet.service has failed.
> 
>-- 
> 
>-- The result is failed.
> 
> 
> 
>Any ideas?
> 
>___
> 
>lustre-discuss mailing list
> 
>lustre-discuss@lists.lustre.org
> 
>
> https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.lustre.org_listinfo.cgi_lustre-2Ddiscuss-2Dlustre.org=DwIGaQ=lz9TcOasaINaaC3U7FbMev2lsutwpI4--09aP8Lu18s=YRHhjew1k3Uquj64NMeoZQ=7w3NFNrR4nh8bmAIIuotD49Y2GvHoyAo981ZUHbvgg0=opi5sgY77yi4mGI8B6DTWf48swoFn6Ifqkz1ThO763s=
> ___
> lustre-discuss mailing list
> lustre-discuss@lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Re: [lustre-discuss] Lustre shares

2017-10-04 Thread Dilger, Andreas

On Oct 4, 2017, at 07:23, Biju C P  wrote:
> I am very new to Lustre filesystem and my question may be very basic. Please 
> help me on this.
> 
> How do I create Lustre shares?

Could you please explain a bit more about what you want here? Do you want to 
have clients access subtrees of the filesystem or similar?

> Is there any API available to query all shares available on Lustre filesystem?

There aren't "shares" per-se, so no API is available for this.

> MGS/MDS can be configured to have multiple filesystems. What is the use case 
> ? When does the customer create multiple filesystems ?

Some (typically smaller) sites have multiple filesystems on the same server, 
mostly for reducing the cost of the server hardware while separating the 
filesystems for (partial) fault isolation or capacity/user management.

However, for larger filesystems, there is not much benefit for sharing the
same servers between multiple filesystems, since there are multiple servers
in a single filesystem to improve performance so it doesn't make sense to
split them up again.

Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Intel Corporation

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Re: [lustre-discuss] lustre client can't moutn after configuring LNET with lnetctl

2017-09-30 Thread Dilger, Andreas

On Sep 29, 2017, at 07:17, Riccardo Veraldi <riccardo.vera...@cnaf.infn.it> 
wrote:
> 
> On 9/28/17 8:29 PM, Dilger, Andreas wrote:
>> Riccardo,
>> I'm not an LNet expert, but a number of LNet multi-rail fixes are landed or 
>> being worked on for Lustre 2.10.1.  You might try testing the current b2_10 
>> to see if that resolves your problems.
> You are right I might end up with that. Sorry but I did not understand
> if 2.10.1 is officially out or if it is release candidate.

2.10.1 isn't officially released because of a problem we were hitting with RHEL 
7.4 + OFED + DNE, but that has since been fixed.  In any case, the b2_10 branch 
will only get low-risk changes, and since we are at -RC1 I would expect it to 
be quite stable, and possibly better than what you are seeing now.

Conversely, if this _doesn't_ fix your problem, then it would be good to know 
about it.  We wouldn't hold up 2.10.1 for the fix I think, but it should go 
into 2.10.2 if possible.

Cheers, Andreas

>> 
>> On Sep 27, 2017, at 21:22, Riccardo Veraldi <riccardo.vera...@cnaf.infn.it> 
>> wrote:
>>> Hello.
>>> 
>>> I configure Multi-rail on my lustre environment.
>>> 
>>> MDS: 172.21.42.213@tcp
>>> OSS: 172.21.52.118@o2ib
>>>172.21.52.86@o2ib
>>> Client: 172.21.52.124@o2ib
>>>172.21.52.125@o2ib
>>> 
>>> 
>>> [root@drp-tst-oss10:~]# cat /proc/sys/lnet/peers
>>> nid  refs state  last   max   rtr   mintx   min
>>> queue
>>> 172.21.52.124@o2ib  1NA-1   128   128   128   128   128 0
>>> 172.21.52.125@o2ib  1NA-1   128   128   128   128   128 0
>>> 172.21.42.213@tcp   1NA-1 8 8 8 8 6 0
>>> 
>>> after configuring multi-rail I can see both infiniband interfaces peers on 
>>> the OSS and on the client side. 
>>> Anyway before multi-rail lustre client could mount the lustre FS without 
>>> problems.
>>> Now after multi-rail is set up the client cannot mount anymore the 
>>> filesystem.
>>> 
>>> When I mount lustre from the client (fstab entry):
>>> 
>>> 172.21.42.213@tcp:/drplu /drplu lustre noauto,lazystatfs,flock, 0 0
>>> 
>>> the file system cannot be mounted and I got these errors
>>> 
>>> Sep 27 18:28:46 drp-tst-lu10 kernel: [  596.842861] Lustre:
>>> 2490:0:(client.c:2114:ptlrpc_expire_one_request()) @@@ Request sent has
>>> failed due to network error: [sent 1506562126/real 1506562126] 
>>> req@8808326b2a00 x1579744801849904/t0(0)
>>> o400->
>>> drplu-OST0001-osc-88085d134800@172.21.52.86@o2ib:28/4
>>> lens
>>> 224/224 e 0 to 1 dl 1506562133 ref 1 fl Rpc:eXN/0/ rc 0/-1
>>> Sep 27 18:28:46 drp-tst-lu10 kernel: [  596.842872] Lustre:
>>> drplu-OST0001-osc-88085d134800: Connection to drplu-OST0001 (at
>>> 172.21.52.86@o2ib) was lost; in progress operations using this service
>>> will wait for recovery to complete
>>> Sep 27 18:28:46 drp-tst-lu10 kernel: [  596.843306] Lustre:
>>> drplu-OST0001-osc-88085d134800: Connection restored to
>>> 172.21.52.86@o2ib (at 172.21.52.86@o2ib)
>>> 
>>> 
>>> the mount point appears and disappears every few seconds from "df"
>>> 
>>> I do not have a clue on how to fix. The multi rail capability is important 
>>> for me.
>>> 
>>> I have Lustre 2.10.0 both client side and server side.
>>> here is my lnet.conf on the lustre client side. The one OSS side is
>>> similar just swapped peers for o2ib net.
>>> 
>>> net:
>>>- net type: lo
>>>  local NI(s):
>>>- nid: 0@lo
>>>  status: up
>>>  statistics:
>>>  send_count: 0
>>>  recv_count: 0
>>>  drop_count: 0
>>>  tunables:
>>>  peer_timeout: 0
>>>  peer_credits: 0
>>>  peer_buffer_credits: 0
>>>  credits: 0
>>>  lnd tunables:
>>>  tcp bonding: 0
>>>  dev cpt: 0
>>>  CPT: "[0]"
>>>- net type: o2ib
>>>  local NI(s):
>>>- nid: 172.21.52.124@o2ib
>>>  status: up
>>>  interfaces:
>>>  0: ib0
>>>  statistics:
>>>  send_count: 7
>>>  recv_count: 7
>>>

Re: [lustre-discuss] lustre client can't moutn after configuring LNET with lnetctl

2017-09-29 Thread Dilger, Andreas

Riccardo,
I'm not an LNet expert, but a number of LNet multi-rail fixes are landed or 
being worked on for Lustre 2.10.1.  You might try testing the current b2_10 to 
see if that resolves your problems.

Cheers, Andreas

On Sep 27, 2017, at 21:22, Riccardo Veraldi  
wrote:
> 
> Hello.
> 
> I configure Multi-rail on my lustre environment.
> 
> MDS: 172.21.42.213@tcp
> OSS: 172.21.52.118@o2ib
> 172.21.52.86@o2ib
> Client: 172.21.52.124@o2ib
> 172.21.52.125@o2ib
> 
>  
> [root@drp-tst-oss10:~]# cat /proc/sys/lnet/peers
> nid  refs state  last   max   rtr   mintx   min
> queue
> 172.21.52.124@o2ib  1NA-1   128   128   128   128   128 0
> 172.21.52.125@o2ib  1NA-1   128   128   128   128   128 0
> 172.21.42.213@tcp   1NA-1 8 8 8 8 6 0
> 
> after configuring multi-rail I can see both infiniband interfaces peers on 
> the OSS and on the client side. 
> Anyway before multi-rail lustre client could mount the lustre FS without 
> problems.
> Now after multi-rail is set up the client cannot mount anymore the filesystem.
> 
> When I mount lustre from the client (fstab entry):
> 
> 172.21.42.213@tcp:/drplu /drplu lustre noauto,lazystatfs,flock, 0 0
> 
> the file system cannot be mounted and I got these errors
> 
> Sep 27 18:28:46 drp-tst-lu10 kernel: [  596.842861] Lustre:
> 2490:0:(client.c:2114:ptlrpc_expire_one_request()) @@@ Request sent has
> failed due to network error: [sent 1506562126/real 1506562126] 
> req@8808326b2a00 x1579744801849904/t0(0)
> o400->
> drplu-OST0001-osc-88085d134800@172.21.52.86@o2ib:28/4
>  lens
> 224/224 e 0 to 1 dl 1506562133 ref 1 fl Rpc:eXN/0/ rc 0/-1
> Sep 27 18:28:46 drp-tst-lu10 kernel: [  596.842872] Lustre:
> drplu-OST0001-osc-88085d134800: Connection to drplu-OST0001 (at
> 172.21.52.86@o2ib) was lost; in progress operations using this service
> will wait for recovery to complete
> Sep 27 18:28:46 drp-tst-lu10 kernel: [  596.843306] Lustre:
> drplu-OST0001-osc-88085d134800: Connection restored to
> 172.21.52.86@o2ib (at 172.21.52.86@o2ib)
> 
> 
> the mount point appears and disappears every few seconds from "df"
> 
> I do not have a clue on how to fix. The multi rail capability is important 
> for me.
> 
> I have Lustre 2.10.0 both client side and server side.
> here is my lnet.conf on the lustre client side. The one OSS side is
> similar just swapped peers for o2ib net.
> 
> net:
> - net type: lo
>   local NI(s):
> - nid: 0@lo
>   status: up
>   statistics:
>   send_count: 0
>   recv_count: 0
>   drop_count: 0
>   tunables:
>   peer_timeout: 0
>   peer_credits: 0
>   peer_buffer_credits: 0
>   credits: 0
>   lnd tunables:
>   tcp bonding: 0
>   dev cpt: 0
>   CPT: "[0]"
> - net type: o2ib
>   local NI(s):
> - nid: 172.21.52.124@o2ib
>   status: up
>   interfaces:
>   0: ib0
>   statistics:
>   send_count: 7
>   recv_count: 7
>   drop_count: 0
>   tunables:
>   peer_timeout: 180
>   peer_credits: 128
>   peer_buffer_credits: 0
>   credits: 1024
>   lnd tunables:
>   peercredits_hiw: 64
>   map_on_demand: 32
>   concurrent_sends: 256
>   fmr_pool_size: 2048
>   fmr_flush_trigger: 512
>   fmr_cache: 1
>   ntx: 2048
>   conns_per_peer: 4
>   tcp bonding: 0
>   dev cpt: -1
>   CPT: "[0]"
> - nid: 172.21.52.125@o2ib
>   status: up
>   interfaces:
>   0: ib1
>   statistics:
>   send_count: 5
>   recv_count: 5
>   drop_count: 0
>   tunables:
>   peer_timeout: 180
>   peer_credits: 128
>   peer_buffer_credits: 0
>   credits: 1024
>   lnd tunables:
>   peercredits_hiw: 64
>   map_on_demand: 32
>   concurrent_sends: 256
>   fmr_pool_size: 2048
>   fmr_flush_trigger: 512
>   fmr_cache: 1
>   ntx: 2048
>   conns_per_peer: 4
>   tcp bonding: 0
>   dev cpt: -1
>   CPT: "[0]"
> - net type: tcp
>   local NI(s):
> - nid: 172.21.42.195@tcp
>   status: up
>   interfaces:
>   0: enp7s0f0
>   statistics:
>   send_count: 51
>   recv_count: 51
>   drop_count: 0
>   tunables:
>   peer_timeout: 180
>   peer_credits: 8
>   peer_buffer_credits: 0
>   credits: 256
>   lnd tunables:
>   tcp bonding: 0
>

Re: [lustre-discuss] E5-2667 or E5-2697A for MDS

2017-09-28 Thread Dilger, Andreas

On Sep 28, 2017, at 04:54, forrest.wc.l...@dell.com wrote:
> 
> Hello :   
>  
> Our customer is going to configure Lustre FS, which will have a lot of small 
> files to be accessed.
>  
> We are to configure 1TB Memory for MDS.
>  
> Regarding to CPU configuration , can we propose E5-2667 or E5-2697A for MDS 
> for good performance ?
>  
> E5-2667 v4 : 3.2GHz, 8 Cores
> E5-2697A v4: 2.6GHz, 16 cores

There is a good presentation showing CPU speed vs. cores vs. MDS performance:

https://www.eofs.eu/_media/events/lad14/03_shuichi_ihara_lustre_metadata_lad14.pdf

Normally, higher GHz is good for the MDS, but if it reduces the number of
cores by half, it may not be worthwhile.  It also depends on whether your
workloads are mostly parallel (in which case more cores * GHz is better),
or more serial (in which case a higher GHz is better).

In this case, cores * GHz is 3.2GHz * 8 = 25.6GHz, and 2.6GHz * 16 = 41.6GHz,
so you would probably get better aggregate performance from the E5-2697A as
long as you have sufficient client parallelism to drive the system heavily.

Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Intel Corporation

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Re: [lustre-discuss] Query filesystem names

2017-09-27 Thread Dilger, Andreas

On Sep 26, 2017, at 11:26, Biju C P  wrote:
> 
> Hi,
> 
> I need to query the list of filesystem names configured on the Management 
> server (MGS) from client. Is there any lustre API available to query this ?

There isn't a way to get this from the client, but on the MGS the filesystems 
can be listed:

$ lctl get_param -n mgs.MGS.filesystems
myth
testfs
lustre

Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Intel Corporation







___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Re: [lustre-discuss] 2.10.0 CentOS6.9 ksoftirqd CPU load

2017-09-27 Thread Dilger, Andreas

On Sep 26, 2017, at 01:10, Hans Henrik Happe  wrote:
> 
> Hi,
> 
> Did anyone else experience CPU load from ksoftirqd after 'modprobe
> lustre'? On an otherwise idle node I see:
> 
>  PID USER  PR   NI VIRT  RES  SHR S %CPU  %MEM TIME+   COMMAND
>9 root  20   0 000 S 28.5  0.0  2:05.58 ksoftirqd/1
> 
> 
>   57 root  20   0 000 R 23.9  0.0  2:22.91 ksoftirqd/13
> 
> The sum of those two is about 50% CPU.
> 
> I have narrowed it down to the ptlrpc module. When I remove that, it stops.
> 
> I also tested the 2.10.1-RC1, which is the same.

If you can run "echo l > /proc/sysrq-trigger" it will report the processes
that are currently running on the CPUs of your system to the console (and
also /var/log/messages, if it can write everything in time).

You might need to do this several times to get a representative sample of
the ksoftirqd process stacks to see what they are doing that is consuming
so much CPU.

Alternately, "echo t > /proc/sysrq-trigger" will report the stacks of all
processes to the console (and /v/l/m), but there will be a lot of them,
and no better chance that it catches what ksoftirqd is doing 25% of the time.

Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Intel Corporation

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Re: [lustre-discuss] Wrong --index set for OST

2017-09-26 Thread Dilger, Andreas

On Sep 26, 2017, at 07:35, Ben Evans  wrote:
> 
> I'm guessing on the osts, but what you'd want to do is to find files that
> are striped to a single OST using "lfs getstripe".  You'll need one file
> per OST.
> 
> After that, you'll have to do something like iterate through the OSTs to
> find the right combo where an ls -l works for that file.  Keep track of
> what OST indexes map to what devices, because you'll be destroying them
> pretty constantly until you resolve all of them.

I don't think you need to iterate through the configuration each time,
which would take ages to do.  Rather, just do the "lfs getstripe" on a
few files, and then find which OSTs have object IDs (under the O/0/d*
directories) that match the required index.

Essentially, just make a NxN grid of "current index" vs "actual index"
and then start crossing out boxes when the "lfs getstripe" returns an
OST object that doesn't actually exist on the OST (assuming the LFSCK
run didn't mess that up too badly).

> Each time you change an OST index, you'll need to do tunefs.lustre
> --writeconf on *all* devices to make them register with the MGS again.
> 
> -Ben Evans
> 
> On 9/26/17, 1:08 AM, "lustre-discuss on behalf of rodger"
>  rod...@csag.uct.ac.za> wrote:
> 
>> Dear All,
>> 
>> Apologies for nagging on this!
>> 
>> Does anyone have any insight on assessing progress of the lfsck?
>> 
>> Does anyone have experience of fixing incorrect index values on OST?
>> 
>> Regards,
>> Rodger
>> 
>> On 25/09/2017 11:21, rodger wrote:
>>> Dear All,
>>> 
>>> I'm still struggling with this. I am running an lfsck -A at present.
>>> The 
>>> status update is reporting:
>>> 
>>> layout_mdts_init: 0
>>> layout_mdts_scanning-phase1: 1
>>> layout_mdts_scanning-phase2: 0
>>> layout_mdts_completed: 0
>>> layout_mdts_failed: 0
>>> layout_mdts_stopped: 0
>>> layout_mdts_paused: 0
>>> layout_mdts_crashed: 0
>>> layout_mdts_partial: 0
>>> layout_mdts_co-failed: 0
>>> layout_mdts_co-stopped: 0
>>> layout_mdts_co-paused: 0
>>> layout_mdts_unknown: 0
>>> layout_osts_init: 0
>>> layout_osts_scanning-phase1: 0
>>> layout_osts_scanning-phase2: 12
>>> layout_osts_completed: 0
>>> layout_osts_failed: 30
>>> layout_osts_stopped: 0
>>> layout_osts_paused: 0
>>> layout_osts_crashed: 0
>>> layout_osts_partial: 0
>>> layout_osts_co-failed: 0
>>> layout_osts_co-stopped: 0
>>> layout_osts_co-paused: 0
>>> layout_osts_unknown: 0
>>> layout_repaired: 82358851
>>> namespace_mdts_init: 0
>>> namespace_mdts_scanning-phase1: 1
>>> namespace_mdts_scanning-phase2: 0
>>> namespace_mdts_completed: 0
>>> namespace_mdts_failed: 0
>>> namespace_mdts_stopped: 0
>>> namespace_mdts_paused: 0
>>> namespace_mdts_crashed: 0
>>> namespace_mdts_partial: 0
>>> namespace_mdts_co-failed: 0
>>> namespace_mdts_co-stopped: 0
>>> namespace_mdts_co-paused: 0
>>> namespace_mdts_unknown: 0
>>> namespace_osts_init: 0
>>> namespace_osts_scanning-phase1: 0
>>> namespace_osts_scanning-phase2: 0
>>> namespace_osts_completed: 0
>>> namespace_osts_failed: 0
>>> namespace_osts_stopped: 0
>>> namespace_osts_paused: 0
>>> namespace_osts_crashed: 0
>>> namespace_osts_partial: 0
>>> namespace_osts_co-failed: 0
>>> namespace_osts_co-stopped: 0
>>> namespace_osts_co-paused: 0
>>> namespace_osts_unknown: 0
>>> namespace_repaired: 68265278
>>> 
>>> with the layout_repaired and namespace_repaired values ticking up at
>>> about 1 per second.
>>> 
>>> Is the layout_osts_failed value of 30 a concern?
>>> 
>>> Is there any way to know how far along it is?
>>> 
>>> I am also seeing many messages similar to the following in
>>> /var/log/messages on the mdt and oss with OST:
>>> 
>>> Sep 25 10:48:00 mds0l210 kernel: LustreError:
>>> 5934:0:(osp_precreate.c:903:osp_precreate_cleanup_orphans())
>>> terra-OST-osc-MDT: cannot cleanup orphans: rc = -22
>>> Sep 25 10:48:00 mds0l210 kernel: LustreError:
>>> 5934:0:(osp_precreate.c:903:osp_precreate_cleanup_orphans()) Skipped
>>> 599 
>>> previous similar messages
>>> Sep 25 10:48:30 mds0l210 kernel: LustreError:
>>> 6137:0:(fld_handler.c:256:fld_server_lookup()) srv-terra-MDT:
>>> Cannot 
>>> find sequence 0x8: rc = -2
>>> Sep 25 10:48:30 mds0l210 kernel: LustreError:
>>> 6137:0:(fld_handler.c:256:fld_server_lookup()) Skipped 16593 previous
>>> similar messages
>>> Sep 25 10:58:01 mds0l210 kernel: LustreError:
>>> 5934:0:(osp_precreate.c:903:osp_precreate_cleanup_orphans())
>>> terra-OST-osc-MDT: cannot cleanup orphans: rc = -22
>>> Sep 25 10:58:01 mds0l210 kernel: LustreError:
>>> 5934:0:(osp_precreate.c:903:osp_precreate_cleanup_orphans()) Skipped
>>> 599 
>>> previous similar messages
>>> Sep 25 10:58:57 mds0l210 kernel: LustreError:
>>> 6137:0:(fld_handler.c:256:fld_server_lookup()) srv-terra-MDT:
>>> Cannot 
>>> find sequence 0x8: rc = -2
>>> Sep 25 10:58:57 mds0l210 kernel: LustreError:
>>> 6137:0:(fld_handler.c:256:fld_server_lookup()) Skipped 40309 previous
>>> similar messages

Re: [lustre-discuss] Wrong --index set for OST

2017-09-26 Thread Dilger, Andreas

On Sep 25, 2017, at 03:21, rodger  wrote:
> 
> Dear All,
> 
> I'm still struggling with this. I am running an lfsck -A at present.

I think running lfsck is the wrong thing to do in this case.  This is trying to 
"repair" the filesystem, but the OST indices are mixed up, so it will just be 
making the problem worse.

One thing you can look at is to dump the "CONFIGS/mountdata" file to see if the 
correct OST
index is still stored in that file.  Something like:

# debugfs -c -R "dump CONFIGS/mountdata /tmp/mountdata" /dev/
# strings /tmp/mountdata

I don't think the "CONFIGS/-OST" files will contain the correct OST 
index anymore
after the tunefs.lustre was run.

> The status update is reporting:
> 
> layout_mdts_init: 0
> layout_mdts_scanning-phase1: 1
> layout_mdts_scanning-phase2: 0
> layout_mdts_completed: 0
> layout_mdts_failed: 0
> layout_mdts_stopped: 0
> layout_mdts_paused: 0
> layout_mdts_crashed: 0
> layout_mdts_partial: 0
> layout_mdts_co-failed: 0
> layout_mdts_co-stopped: 0
> layout_mdts_co-paused: 0
> layout_mdts_unknown: 0
> layout_osts_init: 0
> layout_osts_scanning-phase1: 0
> layout_osts_scanning-phase2: 12
> layout_osts_completed: 0
> layout_osts_failed: 30
> layout_osts_stopped: 0
> layout_osts_paused: 0
> layout_osts_crashed: 0
> layout_osts_partial: 0
> layout_osts_co-failed: 0
> layout_osts_co-stopped: 0
> layout_osts_co-paused: 0
> layout_osts_unknown: 0
> layout_repaired: 82358851
> namespace_mdts_init: 0
> namespace_mdts_scanning-phase1: 1
> namespace_mdts_scanning-phase2: 0
> namespace_mdts_completed: 0
> namespace_mdts_failed: 0
> namespace_mdts_stopped: 0
> namespace_mdts_paused: 0
> namespace_mdts_crashed: 0
> namespace_mdts_partial: 0
> namespace_mdts_co-failed: 0
> namespace_mdts_co-stopped: 0
> namespace_mdts_co-paused: 0
> namespace_mdts_unknown: 0
> namespace_osts_init: 0
> namespace_osts_scanning-phase1: 0
> namespace_osts_scanning-phase2: 0
> namespace_osts_completed: 0
> namespace_osts_failed: 0
> namespace_osts_stopped: 0
> namespace_osts_paused: 0
> namespace_osts_crashed: 0
> namespace_osts_partial: 0
> namespace_osts_co-failed: 0
> namespace_osts_co-stopped: 0
> namespace_osts_co-paused: 0
> namespace_osts_unknown: 0
> namespace_repaired: 68265278
> 
> with the layout_repaired and namespace_repaired values ticking up at about 
> 1 per second.
> 
> Is the layout_osts_failed value of 30 a concern?
> 
> Is there any way to know how far along it is?
> 
> I am also seeing many messages similar to the following in /var/log/messages 
> on the mdt and oss with OST:
> 
> Sep 25 10:48:00 mds0l210 kernel: LustreError: 
> 5934:0:(osp_precreate.c:903:osp_precreate_cleanup_orphans()) 
> terra-OST-osc-MDT: cannot cleanup orphans: rc = -22
> Sep 25 10:48:00 mds0l210 kernel: LustreError: 
> 5934:0:(osp_precreate.c:903:osp_precreate_cleanup_orphans()) Skipped 599 
> previous similar messages
> Sep 25 10:48:30 mds0l210 kernel: LustreError: 
> 6137:0:(fld_handler.c:256:fld_server_lookup()) srv-terra-MDT: Cannot find 
> sequence 0x8: rc = -2
> Sep 25 10:48:30 mds0l210 kernel: LustreError: 
> 6137:0:(fld_handler.c:256:fld_server_lookup()) Skipped 16593 previous similar 
> messages
> Sep 25 10:58:01 mds0l210 kernel: LustreError: 
> 5934:0:(osp_precreate.c:903:osp_precreate_cleanup_orphans()) 
> terra-OST-osc-MDT: cannot cleanup orphans: rc = -22
> Sep 25 10:58:01 mds0l210 kernel: LustreError: 
> 5934:0:(osp_precreate.c:903:osp_precreate_cleanup_orphans()) Skipped 599 
> previous similar messages
> Sep 25 10:58:57 mds0l210 kernel: LustreError: 
> 6137:0:(fld_handler.c:256:fld_server_lookup()) srv-terra-MDT: Cannot find 
> sequence 0x8: rc = -2
> Sep 25 10:58:57 mds0l210 kernel: LustreError: 
> 6137:0:(fld_handler.c:256:fld_server_lookup()) Skipped 40309 previous similar 
> messages
> 
> Do these indicate that the process is not working?
> 
> Regards,
> Rodger
> 
> 
> 
> On 23/09/2017 15:07, rodger wrote:
>> Dear All,
>> In the process of upgrading 1.8.x to 2.x I've messed up a number of the 
>> index values for OSTs by running tune2fs with the --index value set. To 
>> compound matters while trying to get the OSTs to mount I erased the 
>> last_rcvd files on the OSTs. I'm looking for a way to confirm what the index 
>> should be for each device. Part of the reason for my difficulty is that in 
>> the evolution of the filesystem some OSTs were decommissioned and so the 
>> full set no longer has a sequential set of index values. In practicing for 
>> the upgrade the trial sets that I created did have nice neat sequential 
>> indexes and the process I developed broke when I used the real data. :-(
>> The result is that although the lustre filesystem mounts and all directories 
>> appear to be listed files in directories mostly have question marks for 
>> attributes and are not available for access. I'm assuming this is because 
>> the index for the OST holding the file is wrong.
>> Any pointers to recovery would be much appreciated!

Re: [lustre-discuss] Lustre poor performance

2017-08-23 Thread Dilger, Andreas

On Aug 23, 2017, at 08:39, Mohr Jr, Richard Frank (Rick Mohr)  
wrote:
> 
> 
>> On Aug 22, 2017, at 7:14 PM, Riccardo Veraldi 
>>  wrote:
>> 
>> On 8/22/17 9:22 AM, Mannthey, Keith wrote:
>>> Younot expected.
>>> 
>> yes they are automatically used on my Mellanox and the script ko2iblnd-probe 
>> seems like not working properly.
> 
> The ko2iblnd-probe script looks in /sys/class/infiniband for device names 
> starting with “hfi” or “qib”.  If it detects those, it decides that the 
> “profile” it should use is “opa” so then it basically invokes the 
> ko2iblnd-opa modprobe line.  But the script has no logic to detect other 
> types of card (i.e. - mellanox), so in those cases, no ko2iblnd options are 
> used and you end up with the default module parameters being used.
> 
> If you want to use the script, you will need to modify ko2iblnd-probe to add 
> a new case for your brand of HCA and then add an appropriate 
> ko2iblnd- line to ko2iblnd.conf.
> 
> Or just do what I did and comment out all the lines in ko2iblnd.conf and add 
> your own lines.

If there are significantly different options needed for newer Mellanox HCAs 
(e.g. as between Qlogic/OPA and MLX) it would be great to get a patch to 
ko2iblnd-probe and ko2iblnd.conf that adds those options as the default for the 
new type of card, so that Lustre works better out of the box.  That helps 
transfer the experience of veteran IB users to users that may not have the 
background to get the best LNet IB performance.

Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Intel Corporation







___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Re: [lustre-discuss] lustre-client is obsoleted

2017-07-26 Thread Dilger, Andreas

This is a minor problem in the .spec file and has been fixed. 

The reason for the Obsoletes was to allow installing server RPMs on clients, 
but it should have only obsoleted older versions. 

Cheers, Andreas

> On Jul 26, 2017, at 10:24, Jon Tegner  wrote:
> 
> Hi,
> 
> when trying to update clients from 2.9 to 2.10.0 (on CentOS-7) I received the 
> following:
> 
> "Package lustre-client is obsoleted by lustre, trying to install 
> lustre-2.10.0-1.el7.x86_64 instead"
> 
> and then the update failed (to my guessing due to the fact that zfs-related 
> packages are missing on the system (at the moment I don't intend to use zfs) .
> 
> I managed to get past this by forcing the installation of the client, i.e.,
> 
> "yum install lustre-client-2.10.0-1.el7.x86_64.rpm"
> 
> Just curious, is lustre-client really obsoleted?
> 
> Regards,
> 
> /jon
> ___
> lustre-discuss mailing list
> lustre-discuss@lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Re: [lustre-discuss] How does Lustre client side caching work?

2017-07-26 Thread Dilger, Andreas

Lustre currently only uses RAM for client side cache. This is kept coherent 
across all clients by the LDLM, but is not persistent across reboots.

We have discussed integration of fscache with the Lustre client to allow 
persistent cache on NVMe/Optane/NVRAM or other fast local storage. IMHO, there 
is little benefit to cache on slower local devices (e.g. HDD) since Lustre can 
read over the network (assuming IB and decent servers) at a large fraction of 
the PCI bandwidth. That would only be a win over WAN or other slow networks.

Cheers, Andreas

On Jul 25, 2017, at 20:10, Joakim Ziegler 
> wrote:

Hello, I'm pretty new to Lustre, we're looking at setting up a Lustre cluster 
for storage of media assets (something in the 0.5-1PB range to start with, 
maybe 6 OSSes (in HA pairs), running on our existing FDR IB network). It looks 
like a good match for our needs, however, there's an area I've been unable to 
find details about. Note that I'm just investigating for now, I have no running 
Lustre setup.

There are plenty of references to Lustre using client side caching, and how the 
Distributed Lock Manager makes this work. However, I can't find almost any 
information about how the client side cache actually works. When I first heard 
it mentioned, I imagined something like the ZFS L2ARC, where you can add a 
device (say, a couple of SSDs) to the client and point Lustre at it to use it 
for caching. But some references I come across just talk about the normal 
kernel page cache, which is probably smaller and less persistent than what I'd 
like for our usage.

Could anyone enlighten me? I have a large dataset, but clients typically use a 
small part of it at any given time, and uses it quite intensively, so a 
client-side cache (either a read cache or ideally a writeback cache) would 
likely reduce network traffic and server load quite a bit. We've been using NFS 
over RDMA and fscache to get a read cache that does roughly this so far on our 
existing file servers, and it's been quite effective, so I imagine we could 
also benefit from something similar as we move to Lustre.

--
Joakim Ziegler  -  Supervisor de postproducción  -  Terminal
joa...@terminalmx.com   -   044 55 2971 8514   -  
 5264 0864
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Re: [lustre-discuss] backup zfs MDT or migrate from ZFS back to ldiskfs

2017-07-22 Thread Dilger, Andreas

Using rsync or tar to backup/restore a ZFS MDT is not supported, because this 
changes the dnode numbering, but ZFS OI Scrub is not yet implemented (there is 
a Jira ticket for this, and some work is underway there).

Options include using zfs send/recv, as you were using, or just incrementally 
replacing the disks in the pool one at a time and letting them resilver to the 
SSDs (assuming they are larger than the HDDs they are replacing).

I'm not sure why send/recv is so slow and exploding the metadata size, but it 
might relate to the ashift=12 on the target and ashift=9 on the source?  This 
can be particularly bad with RAIDz compared to mirrors, since small blocks (as 
typically used on the MDT) will always need to write 16KB instead of 8 or 12KB 
(with 2 or 3 mirrors).

Cheers, Andreas

On Jul 22, 2017, at 07:48, Raj 
> wrote:

Stu,
Is there a reason why you picked Raidz 3 rather than 4 way mirror across 4 
disks?
Raidz 3 parity calculation might take more cpu resources rather than mirroring 
across disks but also the latency may be higher in mirroring to sync across all 
the disks. Wondering if you did some testing before deciding it.

On Fri, Jul 21, 2017 at 12:27 AM Stu Midgley 
> wrote:
we have been happily using 2.9.52+0.7.0-rc3 for a while now.

The MDT is a raidz3 across 4 disks.

On Fri, Jul 21, 2017 at 1:19 PM, Isaac Huang 
> wrote:
On Fri, Jul 21, 2017 at 12:54:15PM +0800, Stu Midgley wrote:
> Afternoon
>
> I have an MDS running on spinning media and wish to migrate it to SSD's.
>
> Lustre 2.9.52
> ZFS 0.7.0-rc3

This may not be a stable combination - I don't think Lustre officially
supports 0.7.0-rc yet. Plus, there's a recent Lustre osd-zfs bug and
its fix hasn't been back ported to 2.9 yet (to the best of my knowledge):
https://jira.hpdd.intel.com/browse/LU-9305

> How do I do it?

Depends on how you've configured the MDT pool. If the disks are
mirrored or just plan disks without any redundancy (i.e. not RAIDz),
you can simply attach the SSDs to the hard drives to form or extend
mirrors and then detach the hard drives - see zpool attach/detach.

-Isaac

--
Dr Stuart Midgley
sdm...@gmail.com
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Re: [lustre-discuss] FID used by two objects

2017-07-22 Thread Dilger, Andreas

On Jul 17, 2017, at 22:48, wanglu  wrote:
> 
> Hello, 
> 
> One OST of our system can not be mounted in lustre mode after an severe disk 
> error and an 5 days' e2fsck.  Here are errors we got during the mount 
> operation.
> #grep FID /var/log/messages
> Jul 17 20:15:21 oss04 kernel: LustreError: 
> 13089:0:(osd_oi.c:653:osd_oi_insert()) lustre-OST0036: the FID 
> [0x20005:0x1:0x0] is used by two objects: 86/3303188178 48085/1708371613
> Jul 17 20:38:41 oss04 kernel: LustreError: 
> 13988:0:(osd_oi.c:653:osd_oi_insert()) lustre-OST0036: the FID 
> [0x20005:0x1:0x0] is used by two objects: 86/3303188178 48086/3830163079
> Jul 17 20:49:55 oss04 kernel: LustreError: 
> 14221:0:(osd_oi.c:653:osd_oi_insert()) lustre-OST0036: the FID 
> [0x20005:0x1:0x0] is used by two objects: 86/3303188178 48087/538285899
> Jul 18 11:39:25 oss04 kernel: LustreError: 
> 31071:0:(osd_oi.c:653:osd_oi_insert()) lustre-OST0036: the FID 
> [0x20005:0x1:0x0] is used by two objects: 86/3303188178 48088/2468309129
> Jul 18 11:39:56 oss04 kernel: LustreError: 
> 31170:0:(osd_oi.c:653:osd_oi_insert()) lustre-OST0036: the FID 
> [0x20005:0x1:0x0] is used by two objects: 86/3303188178 48089/2021195118
> Jul 18 12:04:31 oss04 kernel: LustreError: 
> 32127:0:(osd_oi.c:653:osd_oi_insert()) lustre-OST0036: the FID 
> [0x20005:0x1:0x0] is used by two objects: 86/3303188178 48090/956682248

The numbers printed here are ldiskfs inode numbers, 86 and 48090.  The FID 
[0x20005:0x1:0x0] is the user quota file, so these files may be in the 
quota_slave directory.

> and the mount operation is failed with error -17
> Jul 18 12:04:31 oss04 kernel: LustreError: 
> 32127:0:(osd_oi.c:653:osd_oi_insert()) lustre-OST0036: the FID 
> [0x20005:0x1:0x0] is used by two objects: 86/3303188178 48090/956682248
> Jul 18 12:04:31 oss04 kernel: LustreError: 
> 32127:0:(qsd_lib.c:418:qsd_qtype_init()) lustre-OST0036: can't open slave 
> index copy [0x20006:0x2:0x0] -17
> Jul 18 12:04:31 oss04 kernel: LustreError: 
> 32127:0:(obd_mount_server.c:1723:server_fill_super()) Unable to start 
> targets: -17
> Jul 18 12:04:31 oss04 kernel: Lustre: Failing over lustre-OST0036
> Jul 18 12:04:32 oss04 kernel: Lustre: server umount lustre-OST0036 complete
> 
> If you run e2fsck again, the command will claim that the inode 480xx has two 
> reference and remove 480xxx to Lost+Found. 
> # e2fsck -f /dev/sdn 
> e2fsck 1.42.12.wc1 (15-Sep-2014)
> Pass 1: Checking inodes, blocks, and sizes
> Pass 2: Checking directory structure
> Pass 3: Checking directory connectivity
> Pass 4: Checking reference counts
> Unattached inode 48090
> Connect to /lost+found? yes
> Inode 48090 ref count is 2, should be 1.  Fix? yes
> Pass 5: Checking group summary information
> 
> lustre-OST0036: * FILE SYSTEM WAS MODIFIED *
> lustre-OST0036: 238443/549322752 files (4.4% non-contiguous), 
> 1737885841/2197287936 blocks
> 
> Is it possible to find the file corresponding to 86/3303188178 and delete it ?

You could just delete the 48090 file from lost+found (or move it out of the 
Lustre filesystem for backup) and it should solve the problem.

> P.S  1. in ldiskfs mode,  most of the disk files are OK to read, while some 
> of them are red. 
>2.  there are about 240'000 objects in the OST. 
>   [root@oss04 d0]# df -i /lustre/ostc
> FilesystemInodes  IUsed IFree IUse% Mounted on
> /dev/sdn   549322752 238443 5490843091% /lustre/ostc
>3.  Lustre Version 2.5.3,  e2fsprog version 

This is an old version of Lustre and e2fsprogs, you would be much better off to 
upgrade.

Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Intel Corporation







___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Re: [lustre-discuss] Lustre 2.10.0 ZFS version

2017-07-17 Thread Dilger, Andreas

To be clear - we do not _currently_ build the Lustre RPMs against a binary RPM 
from ZoL, but rather build our own ZFS RPM packages, then build the Lustre RPMs 
against those packages.  This was done because ZoL didn't provide binary RPM 
packages when we started using ZFS, and we are currently not able to ship the 
binary RPM packages ourselves.

We are planning to change the Lustre build process to use the ZoL pre-packaged 
binary RPMs for Lustre 2.11, so that the binary RPM packages we build can be 
used together with the ZFS RPMs installed by end users.  If that change is not 
too intrusive, we will also try to backport this to b2_10 for a 2.10.x 
maintenance release.

Cheers, Andreas

On Jul 17, 2017, at 10:42, Götz Waschk  wrote:
> 
> Hi Peter,
> 
> I wasn't able to install the official binary build of
> kmod-lustre-osd-zfs, even with kmod-zfs-0.6.5.9-1.el7_3.centos from
> from zfsonlinux.org, the ksym deps do not match. For me, it is always
> rebuilding the lustre source rpm against the zfs kmod packages.
> 
> Regards, Götz Waschk
> 
> On Mon, Jul 17, 2017 at 2:39 PM, Jones, Peter A  
> wrote:
>> 0.6.5.9 according to lustre/Changelog. We have tested with pre-release 
>> versions of 0.7 during the release cycle too if that’s what you’re wondering.
>> 
>> 
>> 
>> 
>> On 7/17/17, 1:55 AM, "lustre-discuss on behalf of Götz Waschk" 
>> > goetz.was...@gmail.com> wrote:
>> 
>>> Hi everyone,
>>> 
>>> which version of kmod-zfs was the official Lustre 2.10.0 binary
>>> release for CentOS 7.3 built against?
>>> 
>>> Regards, Götz Waschk
>>> ___
>>> lustre-discuss mailing list
>>> lustre-discuss@lists.lustre.org
>>> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
> ___
> lustre-discuss mailing list
> lustre-discuss@lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Intel Corporation

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Re: [lustre-discuss] set OSTs read only ?

2017-07-16 Thread Dilger, Andreas

When you write "MGS", you really mean "MDS". The MGS would be the place for 
this if you were changing the config to permanently deactivate the OSTs via 
"lctl conf_param". To temporarily do this, the commands should be run on the 
MDS via "lctl set_param".  In most cases the MDS and MGS are co-located, so the 
distinction is irrelevant, but good to get it right for the record.

The problem of objects not being unlinked until after the MDS is restarted has 
been fixed.

Also, with 2.9 and later it is possible to use "lctl set_param 
osp..create_count=0" to stop new file allocation on that OST without 
blocking unlinked at all, which is best for emptying old OSTs, rather than 
using "deactivate".

For marking the OSTs read-only, both of these solutions will not prevent 
clients from modifying the OST filesystems, just from creating new files 
(assuming all OSTs are set this way).

You might consider to try "mount -o remount,ro" on the MDT and OST filesystems 
on the servers to see if this works (I haven't tested this myself). The problem 
might be that this prevents new clients from mounting.

It probably makes sense to add server-side read-only mounting as a feature. 
Could you please file a ticket in Jira about this?

Cheers, Andreas

On Jul 16, 2017, at 09:16, Bob Ball <b...@umich.edu<mailto:b...@umich.edu>> 
wrote:

I agree with Raj.  Also, I have noted with Lustre 2.7, that the space is not 
actually freed after re-activation of the OST, until the mgs is restarted.  I 
don't recall the reason for this, or know if this was fixed in later Lustre 
versions.

Remember, this is done on the mgs, not on the clients.  If you do it on a 
client, the behavior is as you thought.

bob

On 7/16/2017 11:10 AM, Raj wrote:

No. Deactivating an OST will not allow to create new objects(file). But, client 
can read AND modify an existing objects(append the file). Also, it will not 
free any space from deleted objects until the OST is activated again.

On Sun, Jul 16, 2017, 9:29 AM E.S. Rosenberg 
<esr+lus...@mail.hebrew.edu<mailto:esr%2blus...@mail.hebrew.edu>> wrote:
On Thu, Jul 13, 2017 at 5:49 AM, Bob Ball 
<b...@umich.edu<mailto:b...@umich.edu>> wrote:
On the mgs/mgt do something like:
lctl --device -OST0019-osc-MDT deactivate

No further files will be assigned to that OST.  Reverse with "activate".  Or 
reboot the mgs/mdt as this is not persistent.  "lctl dl" will tell you exactly 
what that device name should be for you.
Doesn't that also disable reads from the OST though?

bob


On 7/12/2017 6:04 PM, Alexander I Kulyavtsev wrote:
You may find advise from Andreas on this list (also attached below). I did not 
try setting fail_loc myself.

In 2.9 there is setting  osp.*.max_create_count=0 described at LUDOC-305.

We used to set OST degraded as described in lustre manual.
It works most of the time but at some point I saw lustre errors in logs for 
some ops. Sorry, I do not recall details.

I still not sure either of these approaches will work for you: setting OST 
degraded or fail_loc will makes some osts selected instead of others.
You may want to verify if these settings will trigger clean error on user side 
(instead of blocking) when all OSTs are degraded.

The other and also simpler approach would be to enable lustre quota and set 
quota below used space for all users (or groups).

Alex.

From: "Dilger, Andreas" 
<andreas.dil...@intel.com<mailto:andreas.dil...@intel.com>>
Subject: Re: [lustre-discuss] lustre 2.5.3 ost not draining
Date: July 28, 2015 at 11:51:38 PM CDT
Cc: "lustre-discuss@lists.lustre.org<mailto:lustre-discuss@lists.lustre.org>" 
<lustre-discuss@lists.lustre.org<mailto:lustre-discuss@lists.lustre.org>>

Setting it degraded means the MDS will avoid allocations on that OST
unless there aren't enough OSTs to meet the request (e.g. stripe_count =
-1), so it should work.

That is actually a very interesting workaround for this problem, and it
will work for older versions of Lustre as well.  It doesn't disable the
OST completely, which is fine if you are doing space balancing (and may
even be desirable to allow apps that need more bandwidth for a widely
striped file), but it isn't good if you are trying to empty the OST
completely to remove it.

It looks like another approach would be to mark the OST as having no free
space using OBD_FAIL_OST_ENOINO (0x229) fault injection on that OST:

  lctl set_param fail_loc=0x229 fail_val=

This would cause the OST to return 0 free inodes from OST_STATFS for the
specified OST index, and the MDT would skip this OST completely.  To
disable all of the OSTs on an OSS use  = -1.  It isn't possible
to selectively disable a subset of OSTs using this method.  The
OBD_FAIL_OST_ENOINO fail_loc has been available since Lustre 2.2, which
covers all of the 2.4+ versions that are affected

Re: [lustre-discuss] [HPDD-discuss] Tiered storage

2017-07-13 Thread Dilger, Andreas

> On Jul 7, 2017, at 16:06, Abe Asraoui  wrote:
> 
> Hi All,
> 
> Does someone knows of a configuration guide for Lustre tiered storage ?

Abe,
I don't think there is an existing guide for this, but it is definitely 
something we are looking into.

Currently, the best way to manage different storage tiers in Lustre is via OST 
pools.  As of Lustre 2.9 it is possible to set a default OST pool on the whole 
filesystem (via "lfs setstripe" on the root directory) that is inherited for 
new files/directories that are created in directories that do not already have 
a default directory layout.  Also, some issues with OST pools were fixed in 2.9 
related to inheriting the pool from a parent/filesystem default if other 
striping parameters are specified on the command line (e.g. set pool on parent 
dir, then use "lfs setstripe -c 3" to create a new file).  Together, these make 
it much easier to manage different classes of storage within a single 
filesystem.

Secondly, "lfs migrate" (and the helper script lfs_migrate) allow migration 
(movement) of files between OSTs (relatively) transparently to the 
applications.  The "lfs migrate" functionality (added in Lustre 2.5 I think) 
keeps the same inode, while moving the data from one set of OSTs to another set 
of OSTs, using the same options as "lfs setstripe" to specify the new file 
layout.  It is possible to migrate files opened for read, but it isn't possible 
currently to migrate files that are being modified (either this will cause 
migration to fail, or alternately it is possible to block user access to the 
file while it is being migrated).

The File Level Redundancy (FLR) feature currently under development (target 
2.11) will improve tiered storage with Lustre, by allowing the file to be 
mirrored on multiple OSTs, rather than having to be migrated to have a copy 
exclusively on a single set of OSTs.  With FLR it would be possible to mirror 
input files into e.g. flash-based OST pool before a job starts, and drop the 
flash mirror after the job has completed, without affecting the original files 
on the disk-based OSTs.  It would also be possible to write new files onto the 
flash OST pool, and then mirror the files to the disk OST pool after they 
finish writing, and remove the flash mirror of the output files once the job is 
finished.

There is still work to be done to integrate this FLR functionality into job 
schedulers and application workflows, and/or have a policy engine that manages 
storage tiers directly, but depending on what aspects you are looking at, some 
of the functionality is already available.

Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Intel Corporation

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Re: [lustre-discuss] Lustre client dkms package on kernel 4.4

2017-07-12 Thread Dilger, Andreas

On Jul 7, 2017, at 07:23, Riccardo Veraldi  
wrote:
> 
> trying to install lustre-dkms on 4.4.76-1.el7.elrepo.x86_64
> 
> Loading new lustre-client-2.9.59 DKMS files...
> Building for 4.4.76-1.el7.elrepo.x86_64
> Building initial module for 4.4.76-1.el7.elrepo.x86_64
> Done.
> dkms.conf: Error! Directive 'DEST_MODULE_LOCATION' does not begin with
> '/kernel', '/updates', or '/extra' in record #0.
> dkms.conf: Error! Directive 'DEST_MODULE_LOCATION' does not begin with
> '/kernel', '/updates', or '/extra' in record #1.
> dkms.conf: Error! Directive 'DEST_MODULE_LOCATION' does not begin with
> '/kernel', '/updates', or '/extra' in record #2.
> dkms.conf: Error! Directive 'DEST_MODULE_LOCATION' does not begin with
> '/kernel', '/updates', or '/extra' in record #3.
> dkms.conf: Error! Directive 'DEST_MODULE_LOCATION' does not begin with
> '/kernel', '/updates', or '/extra' in record #4.
> dkms.conf: Error! Directive 'DEST_MODULE_LOCATION' does not begin with
> '/kernel', '/updates', or '/extra' in record #5.
> dkms.conf: Error! Directive 'DEST_MODULE_LOCATION' does not begin with
> '/kernel', '/updates', or '/extra' in record #6.
> dkms.conf: Error! Directive 'DEST_MODULE_LOCATION' does not begin with
> '/kernel', '/updates', or '/extra' in record #7.
> dkms.conf: Error! Directive 'DEST_MODULE_LOCATION' does not begin with
> '/kernel', '/updates', or '/extra' in record #8.
> dkms.conf: Error! Directive 'DEST_MODULE_LOCATION' does not begin with
> '/kernel', '/updates', or '/extra' in record #9.
> dkms.conf: Error! Directive 'DEST_MODULE_LOCATION' does not begin with
> '/kernel', '/updates', or '/extra' in record #10.
> dkms.conf: Error! Directive 'DEST_MODULE_LOCATION' does not begin with
> '/kernel', '/updates', or '/extra' in record #11.
> dkms.conf: Error! Directive 'DEST_MODULE_LOCATION' does not begin with
> '/kernel', '/updates', or '/extra' in record #12.
> dkms.conf: Error! Directive 'DEST_MODULE_LOCATION' does not begin with
> '/kernel', '/updates', or '/extra' in record #13.
> dkms.conf: Error! Directive 'DEST_MODULE_LOCATION' does not begin with
> '/kernel', '/updates', or '/extra' in record #14.
> dkms.conf: Error! Directive 'DEST_MODULE_LOCATION' does not begin with
> '/kernel', '/updates', or '/extra' in record #15.
> dkms.conf: Error! Directive 'DEST_MODULE_LOCATION' does not begin with
> '/kernel', '/updates', or '/extra' in record #16.
> Error! Bad conf file.
> 
> anyone tried this before or has any hint to give me ?

Please see if there is anything in Jira (https://jira.hpdd.intel.com) about 
this, as DKMS is an area that people are working on.  If it isn't already 
fixed, then you should file a new ticket.

I would help you further, but I'm not very familiar with the build system.

Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Intel Corporation







___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Re: [lustre-discuss] How to added new interface in lustre server?

2017-07-11 Thread Dilger, Andreas

With newer versions of Lustre (maybe 2.7+ ?) it is possible to dynamically add 
network interfaces with lnetctl.  See the lnetctl(8) man page for details, if 
it is installed, which also means that you have this capability.

Cheers, Andreas

On Jul 8, 2017, at 09:11, Raj  wrote:
> 
> Lu,
> I will add ib0 just so that it becomes more clear to understand in the future:
> options lnet networks=o2ib(ib0),tcp0(em1),tcp1(p5p2)
> 
> Reloading lnet module or restarting luster will disconnect all the clients. 
> But don't worry, the clients will reconnect once it comes back online.
> But if you have HA setup, I would recommend you to failover any OSTs that 
> this node is hosting to its partner. Once failed over, you can stop lnet by:
> # lustre_rmmod
> And reload lnet by
> #modprobe lustre
> Check whether you have new nids available locally by:
> # lctl list_nids
> If everything looks good, you can failback the OSTs to its original OSS node.
> 
> Also, since you are adding a new lnet network (tcp1) or new NID to the 
> server, I believe you must change the lustre configuration information as 
> mentioned in the manual unless somebody here says its not necessary.
> http://wiki.old.lustre.org/manual/LustreManual20_HTML/LustreMaintenance.html#50438199_31353
> 
> Thanks
> -Raj
>  
> 
> 
> On Fri, Jul 7, 2017 at 1:28 AM Wei-Zhao Lu  wrote:
> Hi ALL,
> 
> My lustre server is 2.5.3, there are 2 interface(ib0, tcp).
> module parameter is "options lnet networks=o2ib,tcp"
> 
> Now, I changed module parameter as "options lnet 
> networks=o2ib,tcp0(em1),tcp1(p5p2)"
> How to reload lnet module or restart lustre service?
> There are many lustre client running jobs, I wish no any bad effect to these 
> client.
> 
> Thanks a lot.
> 
> Best Regards,
> Lu
> ___
> lustre-discuss mailing list
> lustre-discuss@lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
> ___
> lustre-discuss mailing list
> lustre-discuss@lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Intel Corporation







___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Re: [lustre-discuss] LUG 2017 Videos

2017-07-11 Thread Dilger, Andreas

On Jul 8, 2017, at 15:30, Rawlings, Kenrick D  wrote:
> 
> Videos from LUG 2017 are now online. Robert Ping has made them available on 
> YouTube and created a playlist with all of the presentation videos:
> 
> https://www.youtube.com/playlist?list=PLqi-7yMgvZy-cNZR4jTU1RPvnH_yfydrM
> 
> Links to each video have also been added to the Lustre Wiki and OpenSFS 
> website:
> 
> http://wiki.lustre.org/Lustre_User_Group_2017
> http://opensfs.org/lug-2017-agenda

Ken, thanks for your ongoing efforts in updating the LUG wiki.

Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Intel Corporation







___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Re: [lustre-discuss] OSTs per OSS with ZFS

2017-07-06 Thread Dilger, Andreas

On Jul 6, 2017, at 15:43, Nathan R.M. Crawford <nrcra...@uci.edu> wrote:
> 
> On a somewhat-related question, what are the expected trade-offs when 
> splitting the "striping" between ZFS (striping over vdevs) and Lustre 
> (striping over OSTs)? 
> 
> Specific example: if one has an OSS with 40 disks and intends to use 10-disk 
> raidz2 vdevs, how do these options compare?:
> 
> A) 4 OSTs, each on a zpool with a single raidz2 vdev,
> B) 2 OSTs, each on a zpool with two vdevs, and
> C) 1 OST, on a zpool with 4 vdevs?
> 
>   I've done some simple testing with obdfilter-survey and multiple-client 
> file operations on some actual user data, and am leaning more toward "A". 
> However, the differences weren't overwhelming, and I am probably neglecting 
> some important corner cases. Handling striping pattern at the Lustre level 
> (A) also allows tuning on a per-file basis.

As long as  your filesystem isn't going to have so many OSTs that it runs into 
scaling limits (over 2000 OSTs currently), it is typically true that having 
more independent OSTs (case A) gives somewhat better aggregate throughput when 
driven by many clients/threads, because the ZFS transaction commits are 
independent, compared to case C where all of the disks are inevitably waiting 
for the one disk that is slightly more busy/slow than the others on every 
transaction.  This is akin to jitter in parallel compute jobs.  Also, more 
independent OSTs reduces the amount of data lost in case of catastrophic 
failures of one OST.

That said, there are also drawbacks to having more OSTs as in case A.  This 
fragments your free space, and if you have a large number of clients it means 
the filesystem will need to be more conservative in space allocation as the 
filesystem fills with 4x as much free space as with case C.   On a per-OST 
basis you are also more likely to run out of space on a single OST when they 
are smaller.  Also, it isn't possible for single-stripe files to get as good 
performance with 4 separate OSTs as it is with one single large OST, assuming 
you are not already limited by the OSS network bandwidth (in which case you may 
as well just go with case C because the "extra performance" is unusable and you 
are just adding configuration/maintenance overhead).

Having at least 3 VDEVs in a single pool also slightly improves ZFS space 
efficiency and robustness, and reduces configuration management complexity and 
admin overhead, so if the performance is roughly the same I'd be inclined 
toward fewer/larger OSTs.  If the performance is dramatically different and 
your system is not already very large then the added OSTs may be worthwhile.

Cheers, Andreas

> On Mon, Jul 3, 2017 at 1:15 AM, Dilger, Andreas <andreas.dil...@intel.com> 
> wrote:
>> We have seen performance improvements with multiple zpools/OSTs per OSS. 
>> However, with only 5x NVMe devices per OSS you don't have many choices in 
>> terms of redundancy, unless you are not using any redundancy at all, just 
>> raw bandwidth?
>> 
>> The other thing to consider is what the network bandwidth is vs. the NVMe 
>> bandwidth?  With similar test systems using NVMe devices without redundancy 
>> we've seen multi GB/s, so if you aren't using OPA/IB network then that will 
>> likely be your bottleneck. Even if the TCP is fast enough, the CPU overhead 
>> and data copies will probably kill the performance.
>> 
>> In the end, you can probably test with a few of configs to see which one 
>> will give the best performance - mirror, single RAID-Z, two RAID-Z pools on 
>> half-sized partitions, five no-redundancy zpools with one VDEV each, single 
>> no-redundancy zpool with five VDEVs.
>> 
>> Cheers, Andreas
>> 
>> PS - there is initial snapshot functionality in the 2.10 release.
>> 
>> > On Jul 2, 2017, at 10:07, Brian Andrus <toomuc...@gmail.com> wrote:
>> >
>> > All,
>> >
>> > We have been having some discussion about the best practices when creating 
>> > OSTs with ZFS.
>> >
>> > The basic question is: What is the best ration of OSTs per OSS when using 
>> > ZFS?
>> > It is easy enough to do a single OST with all disks and have reliable data 
>> > protection provided by ZFS. It may be an better scenario when snapshots of 
>> > lfs become a feature as well.
>> >
>> > However, multiple OSTs can mean more stripes and faster reads/writes. I 
>> > have seen some tests that were done quite some time ago which may not be 
>> > so valid anymore with the updates to Lustre.
>> >
>> > We have a system for testing that has 5 NVMes each. We can do 1 zfs file 
>> > system with all or we can separate

Re: [lustre-discuss] uid/gid in changelog records

2017-07-04 Thread Dilger, Andreas

On Jun 28, 2017, at 22:22, Matthew Sanderson <matthew.sander...@anu.edu.au> 
wrote:
> 
> I was envisioning either unconditionally recording UID/GID, or perhaps making 
> this configurable at compile time. I hadn't considered runtime configuration 
> (distinct from registering a changelog consumer). Is this necessary?
> 
> Our site is probably not interested in recording NIDs. Although it would make 
> sense to include them in an audit trail, I'm concerned about the possible 
> overhead. Perhaps recording of NIDs could be configurable? I would prefer to 
> consider this later, if ever.
> 
> I had originally hoped to avoid increasing the number of records generated by 
> a given MPI job, by adding the UID and GID as extensions to a changelog 
> record type that was already being generated.
> However, in my testing, some actions which should be captured in even a 
> minimal audit trail, such as a sequence of open/read/write/close system 
> calls, do not generate any changelog records.
> 
> As a result, it's my understanding that generation of additional changelog 
> record(s) is unavoidable. Is this an accurate assessment?

Sebastien has filed https://jira.hpdd.intel.com/browse/LU-9727 to track this 
feature discussion and development.

Cheers, Andreas

> On 27/06/17 21:35, Dilger, Andreas wrote:
>> On Jun 27, 2017, at 01:18, Matthew Sanderson <matthew.sander...@anu.edu.au> 
>> wrote:
>>> Hi all,
>>> 
>>> Change logs would form a more complete audit trail if they contained a user 
>>> ID (and possibly also a primary group ID, maybe even all of the user's 
>>> supplementary group IDs).
>>> 
>>> Is there a particular reason why this information isn't currently stored in 
>>> changelog records?
>>> 
>>> After some investigation with my colleague (cc'd), it looks like this would 
>>> be a comparatively easy change to make. The information is already sent 
>>> over the wire to the MDS; it's just not persisted in the changelog.
>>> 
>>> The additional fields could be added to the userspace 'struct 
>>> changelog_rec' as an additional extension, similar to the way renames and 
>>> job IDs are stored. As far as I can tell, this wouldn't break compatibility 
>>> with existing applications that consume changelogs.
>> This definitely sounds interesting.  Originally, the ChangeLog was developed 
>> for tracking resync of changes to the filesystem, and the ownership of the 
>> files can be found by looking up the inode by FID.  Definitely there has 
>> been some interest in having auditing for Lustre.
>> 
>> Your analysis of the updated ChangeLog format is correct - it was 
>> implemented to allow addition of new fields, and I'd definitely be in 
>> support of your proposal to record the process UID/GID accessing the files, 
>> if auditing was enabled.  Would the client NID also need to be recorded?
>> 
>> I was thinking that enabling auditing for Lustre would potentially be too 
>> much overhead, but if this was limited to a single record for each UID/GID 
>> opening each file it would likely be fairly reasonable since only a single 
>> record would be needed for even a large MPI job.
>> 
>> Cheers, Andreas
>> --
>> Andreas Dilger
>> Lustre Principal Architect
>> Intel Corporation
>> 
>> 
>> 
>> 
>> 
>> 
>> 
> 
> ___
> lustre-discuss mailing list
> lustre-discuss@lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Intel Corporation







___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Re: [lustre-discuss] Problem with raising osc.*.max_rpcs_in_flight

2017-07-02 Thread Dilger, Andreas

It may also be that this histogram is limited to 32 buckets?

Cheers, Andreas

> On Jun 30, 2017, at 03:03, Reinoud Bokhorst  wrote:
> 
> Hi all,
> 
> I have a problem with raising the osc.*.max_rpcs_in_flight client
> setting on our Lustre 2.7.0. I am trying the increase the setting from
> 32 to 64 but according to osc.*.rpc_stats it isn't being used. The
> statistics still stop at 31 rpcs with high write request numbers, e.g.
> 
>readwrite
> rpcs in flightrpcs   % cum % |   rpcs   % cum %
> 0:   0   0   0   |  0   0   0
> 1:7293  38  38   |   2231  16  16
> 2:3872  20  59   |   1196   8  25
> 3:1851   9  69   |935   6  31
> --SNIP--
> 28:  0   0 100   | 89   0  87
> 29:  0   0 100   | 90   0  87
> 30:  0   0 100   | 94   0  88
> 31:  0   0 100   |   1573  11 100
> 
> I have modified some ko2iblnd driver parameters in an attempt to get it
> working:
> 
> options ko2iblnd peer_credits=128 peer_credits_hiw=128 credits=2048
> concurrent_sends=256 ntx=2048 map_on_demand=32 fmr_pool_size=2048
> fmr_flush_trigger=512 fmr_cache=1
> 
> Specifically I raised peer_credits_hiw to 128 as I've understood that it
> must be twice the value of max_rpcs_in_flight. Checking the module
> parameters that were actually loaded, I noticed that it was set to 127.
> So apparently it must be smaller than peers_credits. After noticing this
> I tried setting max_rpcs_in_flight to 60 but that didn't help either.
> Are there any other parameters affecting the max rpcs? Do all settings
> have to be powers of 2?
> 
> Related question; documentation on the driver parameters and how it all
> hangs together is rather scarce on the internet. Does anyone have some
> good pointers?
> 
> Thanks,
> Reinoud Bokhorst
> 
> 
> ___
> lustre-discuss mailing list
> lustre-discuss@lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Re: [lustre-discuss] uid/gid in changelog records

2017-06-27 Thread Dilger, Andreas

On Jun 27, 2017, at 01:18, Matthew Sanderson  
wrote:
> 
> Hi all,
> 
> Change logs would form a more complete audit trail if they contained a user 
> ID (and possibly also a primary group ID, maybe even all of the user's 
> supplementary group IDs).
> 
> Is there a particular reason why this information isn't currently stored in 
> changelog records?
> 
> After some investigation with my colleague (cc'd), it looks like this would 
> be a comparatively easy change to make. The information is already sent over 
> the wire to the MDS; it's just not persisted in the changelog.
> 
> The additional fields could be added to the userspace 'struct changelog_rec' 
> as an additional extension, similar to the way renames and job IDs are 
> stored. As far as I can tell, this wouldn't break compatibility with existing 
> applications that consume changelogs.

This definitely sounds interesting.  Originally, the ChangeLog was developed 
for tracking resync of changes to the filesystem, and the ownership of the 
files can be found by looking up the inode by FID.  Definitely there has been 
some interest in having auditing for Lustre.

Your analysis of the updated ChangeLog format is correct - it was implemented 
to allow addition of new fields, and I'd definitely be in support of your 
proposal to record the process UID/GID accessing the files, if auditing was 
enabled.  Would the client NID also need to be recorded?

I was thinking that enabling auditing for Lustre would potentially be too much 
overhead, but if this was limited to a single record for each UID/GID opening 
each file it would likely be fairly reasonable since only a single record would 
be needed for even a large MPI job.

Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Intel Corporation

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

1 2 3 4 5 >

1 - 100 of 402 matches

Mail list logo