Re: [lustre-discuss] lnet_peer_ni_add_to_recoveryq

2020-03-09 Thread Chris Horn
(Re-sending my response to the list)

Yes, I believe that there are cases when problems on a remote node can be 
interpreted as local failures.


From: "nathan.dau...@noaa.gov" 
Date: Sunday, March 8, 2020 at 3:56 AM
To: Chris Horn , "lustre-discuss@lists.lustre.org" 

Cc: "nathan.dau...@noaa.gov" 
Subject: Re: [lustre-discuss] lnet_peer_ni_add_to_recoveryq
Resent-From: 
Resent-Date: Sunday, March 8, 2020 at 4:56 AM

Chris, all,

We are also seeing similar messages primarily on our servers, but from 
lnet_handle_local_failure() instead. I don't find any issues with the local 
o2ib interfere, yet, but there _may_ be a correlation with a client hang. Could 
this also be caused on a server by remote network problems or a client dropping 
out, in spite of the "local" name?

Thanks,
Nathan


On Mar 6, 2020 1:10 PM, Chris Horn  wrote:

> lneterror: 10164:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked())
> lpni  added to recovery queue.  Health = 900

The message means that the health value of a remote peer interface has been 
decremented, and as a result, the interface has been put into recovery mode. 
This mechanism is part of the LNet health feature.

Health values are decremented when a PUT or GET fails. Usually there are other 
messages in the log that can tell you more about the specific failure. 
Depending on your network type you should probably see messages from socklnd or 
o2iblnd. Network congestion could certainly lead to message timeouts, which 
would in turn result in interfaces being placed into recovery mode.

Chris Horn

On 3/6/20, 8:59 AM, "lustre-discuss on behalf of Michael Di Domenico" 
 
wrote:

along the aforementioned error i also see these at the same time

lustreerror: 9675:0:(obd_config.c:1428:class_modify_config())
<...>-clilov-<...>; failed to send uevent qos_threshold_rr=100

On Fri, Mar 6, 2020 at 9:39 AM Michael Di Domenico
 wrote:
>
> On Fri, Mar 6, 2020 at 9:36 AM Degremont, Aurelien  
wrote:
> >
> > Did you see any actual error on your system?
> >
> > Because there is a patch that is just decreasing the verbosity level of 
such messages, which looks like could be ignored.
> > 
https://urldefense.proofpoint.com/v2/url?u=https-3A__jira.whamcloud.com_browse_LU-2D13071=DwICAg=C5b8zRQO1miGmBeVZ2LFWg=hIaFpo9yRyCwkkAs6y0c7W-QqT7uZMMSOkAIByhcA-I=ByOR33WN61jv0rEVZTtNhUgN313iSqbgrdfakY-TAjc=jp8DpDcylEQYlbd9-s3efysfDy2KdLvBrptsplqR1ks=
> > 
https://urldefense.proofpoint.com/v2/url?u=https-3A__review.whamcloud.com_-23_c_37718_=DwICAg=C5b8zRQO1miGmBeVZ2LFWg=hIaFpo9yRyCwkkAs6y0c7W-QqT7uZMMSOkAIByhcA-I=ByOR33WN61jv0rEVZTtNhUgN313iSqbgrdfakY-TAjc=8EUQ5wHRCuFFbd4PKxQCnTB_L9IgffvkzFw4_v6MEHg=
>
> thanks.  it's not entirely clear just yet.  i'm trying to track down a
> "slow jobs" issue.  i see these messages everywhere, so it might be a
> non issue or a sign of something more pressing.
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org

https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.lustre.org_listinfo.cgi_lustre-2Ddiscuss-2Dlustre.org=DwICAg=C5b8zRQO1miGmBeVZ2LFWg=hIaFpo9yRyCwkkAs6y0c7W-QqT7uZMMSOkAIByhcA-I=ByOR33WN61jv0rEVZTtNhUgN313iSqbgrdfakY-TAjc=d36yZXUxMDJOjluQt2LUPivEkfLhScuCLIQT6Fl-Qhs=





___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org<https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.lustre.org_listinfo.cgi_lustre-2Ddiscuss-2Dlustre.org=DwQGaQ=C5b8zRQO1miGmBeVZ2LFWg=hIaFpo9yRyCwkkAs6y0c7W-QqT7uZMMSOkAIByhcA-I=xjhlFKAxRoTIY1jLm_ZOO79SIHjnFFvd-sHl1eMEQQM=Wvg4NbAeA1O-DrqWqy5rrQ4OrwfrO7V220OCeVGeWdg=>

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] lnet_peer_ni_add_to_recoveryq

2020-03-09 Thread Chris Horn
Network failures cause an interface's health value to decrement. Recovery mode 
is the mechanism that raises the health value back up. Interfaces are ping'd on 
a regular interval by the "lnet_monitor_thread". Successful pings increase the 
health value of the interface (remote or local). 

When LNet is selecting the local and remote interfaces to use for a PUT or GET, 
it considers the health value of each interface. Healthier interfaces are 
preferred.

Chris Horn

On 3/9/20, 4:22 AM, "Degremont, Aurelien"  wrote:

What's the impact of being in recovery mode with LNET health?


Le 06/03/2020 21:12, « lustre-discuss au nom de Chris Horn » 
 a écrit :

> lneterror: 
10164:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked())
> lpni  added to recovery queue.  Health = 900

The message means that the health value of a remote peer interface has 
been decremented, and as a result, the interface has been put into recovery 
mode. This mechanism is part of the LNet health feature.

Health values are decremented when a PUT or GET fails. Usually there 
are other messages in the log that can tell you more about the specific 
failure. Depending on your network type you should probably see messages from 
socklnd or o2iblnd. Network congestion could certainly lead to message 
timeouts, which would in turn result in interfaces being placed into recovery 
mode.
    
Chris Horn

On 3/6/20, 8:59 AM, "lustre-discuss on behalf of Michael Di Domenico" 
 
wrote:

along the aforementioned error i also see these at the same time

lustreerror: 9675:0:(obd_config.c:1428:class_modify_config())
<...>-clilov-<...>; failed to send uevent qos_threshold_rr=100

On Fri, Mar 6, 2020 at 9:39 AM Michael Di Domenico
 wrote:
>
> On Fri, Mar 6, 2020 at 9:36 AM Degremont, Aurelien 
 wrote:
> >
> > Did you see any actual error on your system?
> >
> > Because there is a patch that is just decreasing the verbosity 
level of such messages, which looks like could be ignored.
> > 
https://urldefense.proofpoint.com/v2/url?u=https-3A__jira.whamcloud.com_browse_LU-2D13071=DwICAg=C5b8zRQO1miGmBeVZ2LFWg=hIaFpo9yRyCwkkAs6y0c7W-QqT7uZMMSOkAIByhcA-I=ByOR33WN61jv0rEVZTtNhUgN313iSqbgrdfakY-TAjc=jp8DpDcylEQYlbd9-s3efysfDy2KdLvBrptsplqR1ks=
> > 
https://urldefense.proofpoint.com/v2/url?u=https-3A__review.whamcloud.com_-23_c_37718_=DwICAg=C5b8zRQO1miGmBeVZ2LFWg=hIaFpo9yRyCwkkAs6y0c7W-QqT7uZMMSOkAIByhcA-I=ByOR33WN61jv0rEVZTtNhUgN313iSqbgrdfakY-TAjc=8EUQ5wHRCuFFbd4PKxQCnTB_L9IgffvkzFw4_v6MEHg=
>
> thanks.  it's not entirely clear just yet.  i'm trying to track 
down a
> "slow jobs" issue.  i see these messages everywhere, so it might 
be a
> non issue or a sign of something more pressing.
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org

https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.lustre.org_listinfo.cgi_lustre-2Ddiscuss-2Dlustre.org=DwICAg=C5b8zRQO1miGmBeVZ2LFWg=hIaFpo9yRyCwkkAs6y0c7W-QqT7uZMMSOkAIByhcA-I=ByOR33WN61jv0rEVZTtNhUgN313iSqbgrdfakY-TAjc=d36yZXUxMDJOjluQt2LUPivEkfLhScuCLIQT6Fl-Qhs=





___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org

https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.lustre.org_listinfo.cgi_lustre-2Ddiscuss-2Dlustre.org=DwIGaQ=C5b8zRQO1miGmBeVZ2LFWg=hIaFpo9yRyCwkkAs6y0c7W-QqT7uZMMSOkAIByhcA-I=MWzLz3rQZoSqu_bMB83a0EdO1KMglAndLsxrBlOT9fA=Y-NtxxGn4LIKwsK_QtBwjw13E0CYycKLLS9PNuiGvms=
 




___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] lnet_peer_ni_add_to_recoveryq

2020-03-06 Thread Chris Horn
> lneterror: 10164:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked())
> lpni  added to recovery queue.  Health = 900

The message means that the health value of a remote peer interface has been 
decremented, and as a result, the interface has been put into recovery mode. 
This mechanism is part of the LNet health feature.

Health values are decremented when a PUT or GET fails. Usually there are other 
messages in the log that can tell you more about the specific failure. 
Depending on your network type you should probably see messages from socklnd or 
o2iblnd. Network congestion could certainly lead to message timeouts, which 
would in turn result in interfaces being placed into recovery mode.

Chris Horn

On 3/6/20, 8:59 AM, "lustre-discuss on behalf of Michael Di Domenico" 
 
wrote:

along the aforementioned error i also see these at the same time

lustreerror: 9675:0:(obd_config.c:1428:class_modify_config())
<...>-clilov-<...>; failed to send uevent qos_threshold_rr=100

On Fri, Mar 6, 2020 at 9:39 AM Michael Di Domenico
 wrote:
>
> On Fri, Mar 6, 2020 at 9:36 AM Degremont, Aurelien  
wrote:
> >
> > Did you see any actual error on your system?
> >
> > Because there is a patch that is just decreasing the verbosity level of 
such messages, which looks like could be ignored.
> > 
https://urldefense.proofpoint.com/v2/url?u=https-3A__jira.whamcloud.com_browse_LU-2D13071=DwICAg=C5b8zRQO1miGmBeVZ2LFWg=hIaFpo9yRyCwkkAs6y0c7W-QqT7uZMMSOkAIByhcA-I=ByOR33WN61jv0rEVZTtNhUgN313iSqbgrdfakY-TAjc=jp8DpDcylEQYlbd9-s3efysfDy2KdLvBrptsplqR1ks=
> > 
https://urldefense.proofpoint.com/v2/url?u=https-3A__review.whamcloud.com_-23_c_37718_=DwICAg=C5b8zRQO1miGmBeVZ2LFWg=hIaFpo9yRyCwkkAs6y0c7W-QqT7uZMMSOkAIByhcA-I=ByOR33WN61jv0rEVZTtNhUgN313iSqbgrdfakY-TAjc=8EUQ5wHRCuFFbd4PKxQCnTB_L9IgffvkzFw4_v6MEHg=
>
> thanks.  it's not entirely clear just yet.  i'm trying to track down a
> "slow jobs" issue.  i see these messages everywhere, so it might be a
> non issue or a sign of something more pressing.
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org

https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.lustre.org_listinfo.cgi_lustre-2Ddiscuss-2Dlustre.org=DwICAg=C5b8zRQO1miGmBeVZ2LFWg=hIaFpo9yRyCwkkAs6y0c7W-QqT7uZMMSOkAIByhcA-I=ByOR33WN61jv0rEVZTtNhUgN313iSqbgrdfakY-TAjc=d36yZXUxMDJOjluQt2LUPivEkfLhScuCLIQT6Fl-Qhs=





___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Lustre rpm install creating a file that breaks lustre

2019-10-02 Thread Chris Horn
Anything in dmesg? We need to know _why_ the network failed to start.

Chris Horn

From: Kurt Strosahl 
Date: Wednesday, October 2, 2019 at 1:55 PM
To: Chris Horn , "lustre-discuss@lists.lustre.org" 

Subject: Re: [lustre-discuss] Lustre rpm install creating a file that breaks 
lustre

the lnet modules load, but when I start the lnet service it says that the 
network is down.  I backed everything out, removed the file, and then started 
the lnet service again and it worked properly.

____
From: Chris Horn 
Sent: Wednesday, October 2, 2019 2:48 PM
To: Kurt Strosahl ; lustre-discuss@lists.lustre.org 

Subject: [EXTERNAL] Re: [lustre-discuss] Lustre rpm install creating a file 
that breaks lustre


Might be best to open a ticket for this. What was the nature of the failure?



Chris Horn



From: lustre-discuss  on behalf of 
Kurt Strosahl 
Date: Wednesday, October 2, 2019 at 1:30 PM
To: "lustre-discuss@lists.lustre.org" 
Subject: [lustre-discuss] Lustre rpm install creating a file that breaks lustre



Good Afternoon,



While getting lustre 2.10.8 running on a RHEL 7.7 system I found that the 
RPM install was putting a file in /etc/modprobe.d that was preventing lnet from 
starting properly.



the file is ko2iblnd.conf, which contains the following...



alias ko2iblnd-opa ko2iblnd

options ko2iblnd-opa peer_credits=128 peer_credits_hiw=64 credits=1024 
concurrent_sends=256 ntx=2048 map_on_demand=32 fmr_pool_size=2048 
fmr_flush_trigger=512 fmr_cache=1 conns_per_peer=4



install ko2iblnd /usr/sbin/ko2iblnd-probe



Our system is running infiniband, not omnipath.  So I'm mot sure why this file 
is being put in place.  Removing the file allows lnet to start properly.



w/r,

Kurt J. Strosahl
System Administrator: Lustre, HPC
Scientific Computing Group, Thomas Jefferson National Accelerator Facility
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Lustre rpm install creating a file that breaks lustre

2019-10-02 Thread Chris Horn
Might be best to open a ticket for this. What was the nature of the failure?

Chris Horn

From: lustre-discuss  on behalf of 
Kurt Strosahl 
Date: Wednesday, October 2, 2019 at 1:30 PM
To: "lustre-discuss@lists.lustre.org" 
Subject: [lustre-discuss] Lustre rpm install creating a file that breaks lustre

Good Afternoon,

While getting lustre 2.10.8 running on a RHEL 7.7 system I found that the 
RPM install was putting a file in /etc/modprobe.d that was preventing lnet from 
starting properly.

the file is ko2iblnd.conf, which contains the following...

alias ko2iblnd-opa ko2iblnd
options ko2iblnd-opa peer_credits=128 peer_credits_hiw=64 credits=1024 
concurrent_sends=256 ntx=2048 map_on_demand=32 fmr_pool_size=2048 
fmr_flush_trigger=512 fmr_cache=1 conns_per_peer=4

install ko2iblnd /usr/sbin/ko2iblnd-probe

Our system is running infiniband, not omnipath.  So I'm mot sure why this file 
is being put in place.  Removing the file allows lnet to start properly.

w/r,

Kurt J. Strosahl
System Administrator: Lustre, HPC
Scientific Computing Group, Thomas Jefferson National Accelerator Facility
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] 2.10.5 compiler versions

2018-08-30 Thread Chris Horn
> Because checking for various kernel features requires starting a small 
> kernel module compile each time, which is slow.  If you have time to 
> investigate and optimize, that would be much appreciated.

To follow-up on this point - Look at the range of kernel versions supported!

   * Server known to build on patched kernels:
 2.6.32-431.29.2.el6 (RHEL6.5)
 2.6.32-504.30.3.el6 (RHEL6.6)
 2.6.32-573.26.1.el6 (RHEL6.7)
 2.6.32-642.15.1.el6 (RHEL6.8)
 2.6.32-696.18.7.el6 (RHEL6.9)
 3.10.0-862.9.1.el7  (RHEL7.5)
 3.0.101-0.47.71 (SLES11 SP3)
 3.0.101-107 (SLES11 SP4)
 3.12.74-60.64.40(SLES12 SP1)
 4.4.120-92.70   (SLES12 SP2)
 4.4.132-94.33   (SLES12 SP3)
 3.13.0-101  (Ubuntu 14.04, ZFS only)
 4.4.0-85.108(Ubuntu 14.04.5 LTS)
 4.4.0-131   (Ubuntu 16.04)
 vanilla linux 4.6.7 (ZFS only)
   * Client known to build on unpatched kernels:
 2.6.32-431.29.2.el6 (RHEL6.5)
 2.6.32-504.30.3.el6 (RHEL6.6)
 2.6.32-573.26.1.el6 (RHEL6.7)
 2.6.32-642.15.1.el6 (RHEL6.8)
 2.6.32-696.18.7.el6 (RHEL6.9)
 3.10.0-862.9.1.el7  (RHEL7.5)
 3.0.101-0.47.71 (SLES11 SP3)
 3.0.101-107 (SLES11 SP4)
 3.12.74-60.64.40(SLES12 SP1)
 4.4.120-92.70   (SLES12 SP2)
 4.4.133-94.33   (SLES12 SP3)
 3.13.0-101  (Ubuntu 14.04)
 4.4.0-85.108(Ubuntu 14.04.5 LTS)
 4.4.0-131   (Ubuntu 16.04)
 4.15.0-32   (Ubuntu 18.04)

Chris Horn

On 8/30/18, 2:34 PM, "lustre-discuss on behalf of Andreas Dilger" 
 
wrote:

On Aug 30, 2018, at 13:28, E.S. Rosenberg  
wrote:
> 
> HI everyone,
> 
> We just succesfully built 2.10.5 on our Debian clients but to do so I had 
to revert to gcc-7 (from 8), is this a known issue? In general what compilers 
is building/testing done with?

I believe that there is a patch in Gerrit for fixing the GCC 8 compiler 
issues.  Testing and review of the patch is welcome.

> Also I was wondering how is it that the configure script takes longer to 
run then compiling everything?

Because checking for various kernel features requires starting a small 
kernel module compile each time, which is slow.  If you have time to 
investigate and optimize, that would be much appreciated.

Cheers, Andreas
---
Andreas Dilger
CTO Whamcloud






___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] LNET Routing Question

2018-05-23 Thread Chris Horn
Hello,

I agree as others have stated that we would not expect the loss of a router to 
significantly affect the I/O destined for filesystems served by other routers, 
nor would we expect the I/O destined for non-routed filesystems to be affected. 
However, I can say that we have seen bugs in this area in the past where the 
loss of a remote filesystem (the servers, not the routers serving that 
filesystem) did affect access to other filesystems. If I recall correctly the 
issue was that resources were being consumed on the routers in trying to 
communicate with the lost filesystem. That resource consumption caused I/O 
destined for other filesystems to get backed up. I’m not aware of any 
outstanding issues like this, and I’ll stress that that sort of behavior would 
certainly be considered a bug. So please let us know if you see any issues.

Regarding check_routers_before_use, this parameter affects how the LNet router 
checker behaves upon startup. The router checker on an LNet peer works by 
periodically sending an LNet ping to each known router. If a peer receives a 
response from the router within a timeout period then the router is considered 
alive, otherwise it is considered dead and routes hosted by that router are 
removed from the routing table (until it starts responding to the pings). By 
default, all routers are initially considered to be up (alive), and all routes 
are immediately eligible for sends. When check_routers_before_use is enabled 
(set to “1”) all routers are instead initially considered down (dead), and all 
routes must first respond to an LNet level ping before the route becomes 
eligible for sends.

The use of this parameter should not affect the scenarios you describe. Traffic 
destined for local networks is not affected by the up or down (alive or dead) 
states of routers.

Chris Horn

From: lustre-discuss <lustre-discuss-boun...@lists.lustre.org> on behalf of 
Makia Minich <ma...@systemfabricworks.com>
Date: Wednesday, May 9, 2018 at 8:51 AM
To: "lustre-discuss@lists.lustre.org" <lustre-discuss@lists.lustre.org>
Subject: [lustre-discuss] LNET Routing Question

Hello all,

I have an LNET routing question. I’ve attached a quick diagram of the current 
setup; but basically I have two core networks (one infiniband and one ethernet) 
with a set of LNET routers in between. There is storage and clients on both 
sides of these routers and all clients need to see all/most storage. All 
connections, configurations, etc are all working.

The question is, if an LNET router goes down (which does cause some amount of 
reconnect or remapping for any clients attempting to use those routes) would 
this cause any issues or delays for a client’s connection to non-routed 
storage? Put slightly different, if a job on the ethernet clients is actively 
using ethernet storage and the lnet routers go down, will job be affected? What 
about a new job just launching when that lnet router is down?

In addition, what does “check_routers_before_use” actually do and does it 
change the scenarios I mentioned? (e.g. If an ethernet client has 
“check_routers_before_use” would every file request start with a ping to the 
routers even if it’s not leaving it’s core network?)

Thanks!

[cid:8DEBE4C1-F290-40B3-B226-C983C1A17279@unknownnet.local]
—

Makia Minich
Principal Architect
System Fabric Works
"Fabric Computing that Works”

"Oh, I don't know. I think everything is just as it should be, y'know?”
- Frank Fairfield

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] LNET routes

2018-01-26 Thread Chris Horn
The route selection algorithm looks at:

1. Priority
2. Hop count
3. Number of bytes in transit (or queued)
4. Number of credits available
5. Which was used last

So, everything else being equal, step 5 will ensure round-robin. But there are 
several other factors that are considered first.

At least, this was true before multi-rail. I’m not sure if that has changed 
things w.r.t. route selection.

Chris Horn

From: lustre-discuss <lustre-discuss-boun...@lists.lustre.org> on behalf of 
Preeti Malakar <malakar.pre...@gmail.com>
Date: Friday, January 26, 2018 at 10:28 AM
To: "lustre-discuss@lists.lustre.org" <lustre-discuss@lists.lustre.org>
Subject: [lustre-discuss] LNET routes

Hi,

I was wondering if someone can help me with the following question about LNETs:

When three writes are issued from a compute node to an OST, is it right that 
the order in which the LNETs (corresponding to that OST) are used to route the 
data from the compute node to the OST is the sequence in which LNETs are 
assigned to the OST, i.e. in round robin order? For e.g. if the LNETs for an 
OST were 14,410,1022,1246,2341,2438,3441,3594, then for three writes from a 
compute node, the LNETs used will be 14,410,1022. If there are writes from 
other nodes to the same OST at the same time, then this order (of LNETs) 
depends on the writes issued from other nodes as well, is that right?

Thanks,
Preeti

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] 2.10.1 client fails to mount on 2.9 backend

2017-11-17 Thread Chris Horn
Is the MGS actually on tcp or is it on o2ib? Can you “lctl ping” the MGS LNet 
nid from the client where you’re trying to mount?

Chris Horn

From: lustre-discuss <lustre-discuss-boun...@lists.lustre.org> on behalf of 
Christopher Johnston <chjoh...@gmail.com>
Date: Friday, November 17, 2017 at 3:17 PM
To: lustre-discuss <lustre-discuss@lists.lustre.org>
Subject: [lustre-discuss] 2.10.1 client fails to mount on 2.9 backend

Just tested the new 2.10.1 client against one of my fileservers where I have 
2.9 running.   Works with 2.10.0 but not 2.10.1, is this expected?

# mount -t lustre 172.30.69.90:/qstore /mnt
mount.lustre: mount 172.30.69.90:/qstore at /mnt failed: No such file or 
directory
Is the MGS specification correct?
Is the filesystem name correct?
If upgrading, is the copied client log valid? (see upgrade docs)

# dmesg -e | tail -4
[Nov17 16:15] LustreError: 22344:0:(ldlm_lib.c:483:client_obd_setup()) can't 
add initial connection
[  +0.009423] LustreError: 22344:0:(obd_config.c:608:class_setup()) setup 
MGC172.30.69.90@tcp failed (-2)
[  +0.009755] LustreError: 22344:0:(obd_mount.c:203:lustre_start_simple()) 
MGC172.30.69.90@tcp setup error -2
[  +0.010193] LustreError: 22344:0:(obd_mount.c:1505:lustre_fill_super()) 
Unable to mount  (-2)

-Chris
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Fwd: Re: Lustre compilation error

2017-10-21 Thread Chris Horn
I would need more information to help you. Maybe provide the complete terminal 
output of your build. Everything from getting the source to running ‘make rpms’.

Chris Horn

From: lustre-discuss <lustre-discuss-boun...@lists.lustre.org> on behalf of 
parag_k <para...@citilindia.com>
Date: Friday, October 20, 2017 at 11:17 PM
To: "lustre-discuss@lists.lustre.org" <lustre-discuss@lists.lustre.org>
Subject: [lustre-discuss] Fwd: Re: Lustre compilation error


Hi,

Any solution on below issue ?

Regards,
Parag

 Original message 
From: parag_k <para...@citilindia.com>
Date: 19/10/2017 8:15 am (GMT+05:30)
To: "Dilger, Andreas" <andreas.dil...@intel.com>
Cc: Chris Horn <ho...@cray.com>, Lustre User Discussion Mailing List 
<lustre-discuss@lists.lustre.org>
Subject: Re: [lustre-discuss] Lustre compilation error

Hi Dilger,

I extracted the src rpm of lustre 2.10.0 using 7zip and got the tarball of 
lustre 2.10.0.

Also if you put below mentioned link on browser you will find snapshot option 
and can download lustre.

But once you get tar, the procedure of compilation will be same i guess what i 
mentioned in last mail.

Regards,
Parag
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Lustre compilation error

2017-10-17 Thread Chris Horn
It would be helpful if you provided more context. How did you acquire the 
source? What was your configure line? Is there a set of build instructions that 
you are following?

Chris Horn

From: lustre-discuss <lustre-discuss-boun...@lists.lustre.org> on behalf of 
Parag Khuraswar <para...@citilindia.com>
Date: Tuesday, October 17, 2017 at 11:52 PM
To: 'Lustre User Discussion Mailing List' <lustre-discuss@lists.lustre.org>
Subject: Re: [lustre-discuss] Lustre compilation error

Hi,

Does any one have any idea on below issue?

Regards,
Parag


From: lustre-discuss [mailto:lustre-discuss-boun...@lists.lustre.org] On Behalf 
Of Parag Khuraswar
Sent: Tuesday, October , 2017 6:11 PM
To: 'Lustre User Discussion Mailing List'
Subject: [lustre-discuss] Lustre compilation error

Hi,

I am trying to make rpms from lustre 2.10.0 source. I get below error when I 
run “make”

==
make[4]: *** No rule to make target `fld.ko', needed by `all-am'.  Stop.
make[3]: *** [all-recursive] Error 1
make[2]: *** [all-recursive] Error 1
make[1]: *** [all] Error 2
error: Bad exit status from 
/tmp/rpmbuild-lustre-root-Ssi5N0Xv/TMP/rpm-tmp.bKMjSO (%build)


RPM build errors:
Bad exit status from /tmp/rpmbuild-lustre-root-Ssi5N0Xv/TMP/rpm-tmp.bKMjSO 
(%build)
make: *** [rpms] Error 1
==

Regards,
Parag


___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Routers and shortest path

2017-10-16 Thread Chris Horn
>I my case I imagined that each islet could have its own Lustre network. 
> This is simpler than each node having a separate network.
Right, that is what I would suggest doing.

>The main problem I saw is the HA part. If the islet-local router fails, 
> the islet nodes will not be able to join another router because it is in 
> another Lustre network
The routers would belong to all LNets (or at least one other one). 
Primary/secondary paths would be defined via either priority or hop count.

>Anyway, administrators refused to have more than one Lustre network for 
> the compute nodes. So I'm looking for another solution.
AFAIK, this is your only option short of developing your own code for this 
situation (which would be cool!).

Chris Horn

On 10/16/17, 7:06 AM, "LOPEZ, ALEXANDRE" <alexandre.lo...@atos.net> wrote:

Chris,

I my case I imagined that each islet could have its own Lustre network. 
This is simpler than each node having a separate network. 

The main problem I saw is the HA part. If the islet-local router fails, the 
islet nodes will not be able to join another router because it is in another 
Lustre network. Of course, this can be solved adding more routers, but it if 
was already an administrative headache, it will now be an administrative 
headache^2. In addition, I don't know (although I must admit I didn't try) if 
it is possible to chain several routers.

Anyway, administrators refused to have more than one Lustre network for the 
compute nodes. So I'm looking for another solution.

Thanks.
Alex.

-Original Message-
From: Chris Horn [mailto:ho...@cray.com] 
Sent: Friday, October 13, 2017 9:55 PM
To: LOPEZ, ALEXANDRE; Sebastien Buisson
Cc: Lustre Discuss (lustre-discuss@lists.lustre.org)
Subject: Re: [lustre-discuss] Routers and shortest path

I think the only way to do this today is to assign the clients in each 
“islet” a unique LNet. What problems did that cause for you (besides the 
administrative headache?)

Chris Horn

On 10/13/17, 9:51 AM, "lustre-discuss on behalf of LOPEZ, ALEXANDRE" 
<lustre-discuss-boun...@lists.lustre.org on behalf of alexandre.lo...@atos.net> 
wrote:

Hi Sebastien.

It is in fact an asymmetric routing problem. But the way routes are 
declared today in Lustre makes it quite difficult to avoid in this particular 
context.

I was considering the possibility to add a flag, a special route, 
whatever, to force LNet to return the response to the same router the request 
arrived from. Nevertheless, since I started to look at Lustre's code today for 
the very first time, it will take quite some time before I get something 
useful. I don't even know if this is actually possible. If that ever happens, 
I'll be glad to contribute it.

Cheers,
Alejandro

-Original Message-
From: Sebastien Buisson [mailto:sbuis...@ddn.com] 
Sent: Friday, October 13, 2017 3:42 PM
To: LOPEZ, ALEXANDRE
Cc: Lustre Discuss (lustre-discuss@lists.lustre.org)
Subject: Re: [lustre-discuss] Routers and shortest path

Hi Alejandro!

This makes me think of an asymmetric routing problem. It could be 
addressed by implementing something like reverse path filtering 
(http://tldp.org/HOWTO/Adv-Routing-HOWTO/lartc.kernel.rpf.html) in LNet: nodes 
would not accept requests from peers through router B when they are configured 
to talk to those peers through router A only.

If there is no other ready for use solution and you are willing to 
contribute code :)

Cheers,
Sebastien.

> Le 13 oct. 2017 à 15:20, LOPEZ, ALEXANDRE <alexandre.lo...@atos.net> 
a écrit :
> 
> Hi everyone,
>  
> I’d like to have your opinion on a problem I’m facing. Sorry for the 
long mail but I failed to make it shorter without removing some important 
information.
>  
> Each islet on my cluster has a dedicated Lustre router connected to 
the interconnect and to a dedicated network where Lustre servers are reachable. 
Lustre servers are NOT on the main interconnect, thus the need for routers. Any 
router is reachable thru the interconnect from any node but, when the node and 
the router aren’t on the same islet, several switches (hops) need to be 
crossed. The idea is to use the shortest path to the servers thru the 
islet-local router.
>  
> I created the appropriate routes on each compute node to contact the 
islet-local Lustre router. There is also a lower-priority route to fail over a 
router on another islet in case the local Lustre router fails. (This could have 
also been done with the route’s hops, but my understanding is that the final 
result is 

Re: [lustre-discuss] Routers and shortest path

2017-10-13 Thread Chris Horn
I think the only way to do this today is to assign the clients in each “islet” 
a unique LNet. What problems did that cause for you (besides the administrative 
headache?)

Chris Horn

On 10/13/17, 9:51 AM, "lustre-discuss on behalf of LOPEZ, ALEXANDRE" 
<lustre-discuss-boun...@lists.lustre.org on behalf of alexandre.lo...@atos.net> 
wrote:

Hi Sebastien.

It is in fact an asymmetric routing problem. But the way routes are 
declared today in Lustre makes it quite difficult to avoid in this particular 
context.

I was considering the possibility to add a flag, a special route, whatever, 
to force LNet to return the response to the same router the request arrived 
from. Nevertheless, since I started to look at Lustre's code today for the very 
first time, it will take quite some time before I get something useful. I don't 
even know if this is actually possible. If that ever happens, I'll be glad to 
contribute it.

Cheers,
Alejandro

-Original Message-
From: Sebastien Buisson [mailto:sbuis...@ddn.com] 
Sent: Friday, October 13, 2017 3:42 PM
To: LOPEZ, ALEXANDRE
Cc: Lustre Discuss (lustre-discuss@lists.lustre.org)
Subject: Re: [lustre-discuss] Routers and shortest path

Hi Alejandro!

This makes me think of an asymmetric routing problem. It could be addressed 
by implementing something like reverse path filtering 
(http://tldp.org/HOWTO/Adv-Routing-HOWTO/lartc.kernel.rpf.html) in LNet: nodes 
would not accept requests from peers through router B when they are configured 
to talk to those peers through router A only.

If there is no other ready for use solution and you are willing to 
contribute code :)

Cheers,
Sebastien.

> Le 13 oct. 2017 à 15:20, LOPEZ, ALEXANDRE <alexandre.lo...@atos.net> a 
écrit :
> 
> Hi everyone,
>  
> I’d like to have your opinion on a problem I’m facing. Sorry for the long 
mail but I failed to make it shorter without removing some important 
information.
>  
> Each islet on my cluster has a dedicated Lustre router connected to the 
interconnect and to a dedicated network where Lustre servers are reachable. 
Lustre servers are NOT on the main interconnect, thus the need for routers. Any 
router is reachable thru the interconnect from any node but, when the node and 
the router aren’t on the same islet, several switches (hops) need to be 
crossed. The idea is to use the shortest path to the servers thru the 
islet-local router.
>  
> I created the appropriate routes on each compute node to contact the 
islet-local Lustre router. There is also a lower-priority route to fail over a 
router on another islet in case the local Lustre router fails. (This could have 
also been done with the route’s hops, but my understanding is that the final 
result is the same.) I also created the routes on the Lustre servers for the 
responses to reach the clients thru the routes.
>  
> This seems to work as expected, but this is actually false.
>  
> Although the filesystem is mounted on the clients and works, there is a 
problem when there is no failure (all routers are up). The problem roots in the 
routes used to deliver the responses from the servers. If I assign priorities 
to the routes on the servers, the higher priority route will always be used to 
send the responses. So, if a compute node sent a request thru its islet’s 
router (the shortest path), the response will not return thru the same router 
but thru the one designated by the higher priority route, making the return 
path longer. Using hops is the same thing: the route with the lower hop value 
is chosen, but the same set of routes apply to all the nodes on all the islets 
and a valid value for an islet is not valid for all the others. If I assign 
neither priority nor hops, round-robin will be used and the next route on the 
list is selected.
>  
> The ideal solution would be for the response to follow the reverse path 
followed by the request (thru the same router) but I found no way to do it.
>  
> Is there any way to make the responses go the reverse (shortest) path?
>  
> Any other way to solve this?
>  
> I considered assigning a separate Lustre network to each islet but, 
although this solves this problem, it adds new ones; so I ended up discarding 
it.
>  
> I’m currently using Lustre 2.7 but I found nothing suggesting that 2.10 
will solve the problem.
>  
> Thanks for your time and answers.
>  
> Alexandre Lopez
> Big Data & Security – Data Management
> Bull SAS – Atos Technologies
>  
>  
>  
> ___
> lustre-discuss mailing list
> lustre-discuss@lists.lustre.org
> http://lists.lustre.org/listinfo.cgi

Re: [lustre-discuss] Lnet not starting

2017-10-13 Thread Chris Horn
I’m not sure what you mean by “shows as partial”. I can’t find a (systemd) 
lustre.service file that is packaged with the community Lustre. Did you create 
your own? I would say it is good practice to load the modules even though it 
shouldn’t be strictly necessary. Performing a “mount -t lustre…” should pull in 
any necessary modules. If that isn’t happening maybe you just need to run 
depmod.

Chris Horn

From: Ravi Konila <ravibh...@gmail.com>
Reply-To: Ravi Konila <ravibh...@gmail.com>
Date: Friday, October 13, 2017 at 9:29 AM
To: Chris Horn <ho...@cray.com>, 'Lustre User Discussion Mailing List' 
<lustre-discuss@lists.lustre.org>
Subject: Re: [lustre-discuss] Lnet not starting

Hi Chris

In continuation with my trailing email,
why service lustre status shows as partial?

Thanks for the info. Finally, I was able to install lustre servers (MGS and 
MDS) as of now. I used native IB drivers which came with RHEL 6.7
My question is do I need to run modprobe lustre and modprobe lnet everytime 
lustre server reboots?
What I observed that service lustre start doesnot come up without modprobe 
lustre.

Any suggestions?

Regards

Ravi Konila
Sr. Technical Consultant
From: Chris Horn
Sent: Friday, October 13, 2017 12:18 AM
To: Ravi Konila ; Parag Khuraswar ; 'Lustre User Discussion Mailing List'
Subject: Re: [lustre-discuss] Lnet not starting

The pre-built rpms are most likely compiled against the in-kernel IB drivers. 
If you’re using the MOFED drivers you’ll need to recompile Lustre. The 
instructions here may help you out http://wiki.lustre.org/Compiling_Lustre

Chris Horn

From: Ravi Konila <ravibh...@gmail.com>
Reply-To: Ravi Konila <ravibh...@gmail.com>
Date: Thursday, October 12, 2017 at 1:33 PM
To: Chris Horn <ho...@cray.com>, Parag Khuraswar <para...@citilindia.com>, 
'Lustre User Discussion Mailing List' <lustre-discuss@lists.lustre.org>
Subject: Re: [lustre-discuss] Lnet not starting

Hi

I am using pre-built rpms.

Regards

Ravi Konila
From: Chris Horn
Sent: Thursday, October 12, 2017 10:51 PM
To: Ravi Konila ; Parag Khuraswar ; 'Lustre User Discussion Mailing List'
Subject: Re: [lustre-discuss] Lnet not starting

Are you compiling Lustre yourself or using pre-built rpms?

Chris Horn

From: Ravi Konila <ravibh...@gmail.com>
Reply-To: Ravi Konila <ravibh...@gmail.com>
Date: Thursday, October 12, 2017 at 11:40 AM
To: Chris Horn <ho...@cray.com>, Parag Khuraswar <para...@citilindia.com>, 
'Lustre User Discussion Mailing List' <lustre-discuss@lists.lustre.org>
Subject: Re: [lustre-discuss] Lnet not starting

Hi Chris

I installed RHEL 6.7, MLNX_OFED_LINUX-3.4-1.0.0.0-rhel6.7-x86_64 and then 
Lustre 2.8 in my Lustre MDS/MGT/OSS servers.

My ib0 is working fine and I can ping other nodes.
my lustre.conf file has
options lnet networks=o2ib(ib0)

With this, If I run “service lnet start” it fails with error
LNET configure error 22: Invalid argument

dmesg give me output as below (I just captured last line but there are many 
lines with symbol error or so)

LNetError: 16770:0:(api-ni.c:1276:lnet_startup_lndni()) Can't load LND o2ib, 
module ko2iblnd, rc=256

If I specify tcp in lustre.conf,, it works fine.

I have reinstalled Lustre and then Mellanox OFED driver but still the problem 
is same, not able to make infiniband up with Lustre Lnet

Regards

Ravi Konila
Sr. Technical Consultant



From: Chris Horn
Sent: Thursday, October 12, 2017 9:02 PM
To: Ravi Konila ; Parag Khuraswar ; 'Lustre User Discussion Mailing List'
Subject: Re: [lustre-discuss] Lnet not starting

dmesg output should provide more information about the “Invalid argument” error 
that you are seeing, but my guess would be that Lustre was compiled against a 
different IB stack than what you have installed.

Chris Horn

From: lustre-discuss <lustre-discuss-boun...@lists.lustre.org> on behalf of 
Ravi Konila <ravibh...@gmail.com>
Reply-To: Ravi Konila <ravibh...@gmail.com>
Date: Thursday, October 12, 2017 at 7:58 AM
To: Parag Khuraswar <para...@citilindia.com>, 'Lustre User Discussion Mailing 
List' <lustre-discuss@lists.lustre.org>
Subject: Re: [lustre-discuss] Lnet not starting

Hi Parag

Even I am facing the same issue with RHEL 6.7 and Lustre 2.8. Now I am also 
trying with RHEL 7.3 and Lustre 2.10.0.
I am planning to install Mellanox OFED driver with 3.4 stack. Looks like there 
is some problem with OFED 4.x stack with Lustre 2.10.0.
Let me try the same and update.
When I start “service lnet start” it gives LNET configure error 22: Invalid 
argument
but it works fine with tcp.

Regards

Ravi Konila
Sr. Technical Consultant
Maruti Suzuki India Ltd


From: Parag Khuraswar
Sent: Thursday, October 12, 2017 6:11 PM
To: 'Lustre User Discussion Mailing List'
Subject: [lustre-discuss] Lnet not starting

Hi,

I am installing Lustre 2.10.0 on RHEL 7.3.
IB is working fine but lnet is not coming up. Lustre

Re: [lustre-discuss] client eviction from oss on 2.8.0

2017-10-13 Thread Chris Horn
In my experience lock callback timer expirations are often symptoms of network 
problems. Clients are either unable to deliver the expected lock cancellation 
or are unable to perform I/O under the lock (which will extend the timer). Are 
there any error messages that indicate communication failures between the 
evicted client and the server hosting demo-OST0002?

Chris Horn

On 10/13/17, 1:44 PM, "lustre-discuss on behalf of John Casu" 
<lustre-discuss-boun...@lists.lustre.org on behalf of j...@chiraldynamics.com> 
wrote:

client, server = 2.8.0, connected via 40GbE
running IOR & trying to write large files (40TB/file)

I get the follow in /var/log/messages on my client
Oct 13 12:07:19 c3 kernel: Lustre: Evicted from demo-OST0002_UUID (at 
10.55.100.20@tcp) after server handle changed from 0x3e6cc8dc71d19edb to 
0x3e6cc8dc71d2130a
Oct 13 12:07:19 c3 kernel: LustreError: 167-0: 
demo-OST0002-osc-887f229fa800: This client was evicted by demo-OST0002; in 
progress operations using this service will fail.

and the following on the oss:
Oct 13 08:54:40 oss0 kernel: LustreError: 
0:0:(ldlm_lockd.c:342:waiting_locks_callback()) ### lock callback timer expired 
after 101s: evicting client at 10.55.100.31@tcp  ns: filter-demo-OST0002_UUID 
lock: 88044e6cfe00/0x3e6cc8dc71d21112 lrc: 3/0,0 mode: PW/PW res: 
[0x275a:0x0:0x0].0x0 rrc: 2 type: EXT [0->18446744073709551615] (req 
0->4194303) flags: 0x6400010020 nid: 10.55.100.31@tcp 
remote: 0xb696c0d49b95953c expref: 16214 pid: 109581 timeout: 8078643085 
lvb_type: 0

wondering why the lock callback timer might expire.
Only have 3 clients & pair of mds & pair of oss.

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] exec start error for lustre-2.10.1_13_g2ee62fb

2017-10-13 Thread Chris Horn
https://jira.hpdd.intel.com/browse/LU-10119

I’ll push a patch

Chris Horn


On 10/13/17, 10:18 AM, "Dilger, Andreas" <andreas.dil...@intel.com> wrote:

Could you please file a Jira ticket (and possibly a patch) to fix this, so 
it isn't forgotten. 

Cheers, Andreas

> On Oct 13, 2017, at 06:50, David Rackley <rack...@jlab.org> wrote:
> 
> That was it! Thanks for the help.
> 
> - Original Message -
> From: "Chris Horn" <ho...@cray.com>
> To: "David Rackley" <rack...@jlab.org>, lustre-discuss@lists.lustre.org
> Sent: Thursday, October 12, 2017 5:02:47 PM
> Subject: Re: [lustre-discuss] exec start error for 
lustre-2.10.1_13_g2ee62fb
> 
> Google suggests that this error message has been associated with a 
missing “hashpling” in some cases. The lustre_routes_config script has “# 
!/bin/bash”, and I wonder if that space before the “!” isn’t the culprit?
> 
> 
> 
> Just a guess. You might try to remove that space from the 
lustre_routes_config script and try to restart lnet with systemctl.
> 
> 
> 
> Chris Horn
> 
> 
> 
> On 10/12/17, 3:39 PM, "lustre-discuss on behalf of David Rackley" 
<lustre-discuss-boun...@lists.lustre.org on behalf of rack...@jlab.org> wrote:
> 
> 
> 
>Greetings,
> 
> 
> 
>I have built lustre-2.10.1_13_g2ee62fb on 3.10.0-693.2.2.el7.x86_64 
RHEL Workstation release 7.4 (Maipo).
> 
> 
> 
>After installation of 
kmod-lustre-client-2.10.1_13_g2ee62fb-1.el7.x86_64.rpm and 
lustre-client-2.10.1_13_g2ee62fb-1.el7.x86_64.rpm the lnet startup fails. 
> 
> 
> 
>The error reported is:
> 
> 
> 
>-- Unit lnet.service has begun starting up.
> 
>Oct 12 13:21:53  kernel: libcfs: loading out-of-tree module taints 
kernel.
> 
>Oct 12 13:21:53  kernel: libcfs: module verification failed: signature 
and/or required key missing - tainting kernel
> 
>Oct 12 13:21:53  kernel: LNet: HW NUMA nodes: 1, HW CPU cores: 20, 
npartitions: 1
> 
>Oct 12 13:21:53  kernel: alg: No test for adler32 (adler32-zlib)
> 
>Oct 12 13:21:53  kernel: alg: No test for crc32 (crc32-table)
> 
>Oct 12 13:21:54  kernel: LNet: Using FMR for registration
> 
>Oct 12 13:21:54 lctl[135556]: LNET configured
> 
>Oct 12 13:21:54  kernel: LNet: Added LNI 172.17.1.92@o2ib [8/256/0/180]
> 
>Oct 12 13:21:54  systemd[135576]: Failed at step EXEC spawning 
/usr/sbin/lustre_routes_config: Exec format error
> 
>-- Subject: Process /usr/sbin/lustre_routes_config could not be 
executed
> 
>-- Defined-By: systemd
> 
>-- Support: 
https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.freedesktop.org_mailman_listinfo_systemd-2Ddevel=DwIGaQ=lz9TcOasaINaaC3U7FbMev2lsutwpI4--09aP8Lu18s=YRHhjew1k3Uquj64NMeoZQ=7w3NFNrR4nh8bmAIIuotD49Y2GvHoyAo981ZUHbvgg0=kxN1qf1rXRld4fZDxONppI9l8fdxJMzBaBeyDdejaEM=
 
> 
>-- 
> 
>-- The process /usr/sbin/lustre_routes_config could not be executed 
and failed.
> 
>-- 
> 
>-- The error number returned by this process is 8.
> 
>Oct 12 13:21:54  systemd[1]: lnet.service: main process exited, 
code=exited, status=203/EXEC
> 
>Oct 12 13:21:54  systemd[1]: Failed to start lnet management.
> 
>-- Subject: Unit lnet.service has failed
> 
>-- Defined-By: systemd
> 
>-- Support: 
https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.freedesktop.org_mailman_listinfo_systemd-2Ddevel=DwIGaQ=lz9TcOasaINaaC3U7FbMev2lsutwpI4--09aP8Lu18s=YRHhjew1k3Uquj64NMeoZQ=7w3NFNrR4nh8bmAIIuotD49Y2GvHoyAo981ZUHbvgg0=kxN1qf1rXRld4fZDxONppI9l8fdxJMzBaBeyDdejaEM=
 
> 
>-- 
> 
>-- Unit lnet.service has failed.
> 
>-- 
> 
>-- The result is failed.
> 
> 
> 
>Any ideas?
> 
>___
> 
>lustre-discuss mailing list
> 
>lustre-discuss@lists.lustre.org
> 
>
https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.lustre.org_listinfo.cgi_lustre-2Ddiscuss-2Dlustre.org=DwIGaQ=lz9TcOasaINaaC3U7FbMev2lsutwpI4--09aP8Lu18s=YRHhjew1k3Uquj64NMeoZQ=7w3NFNrR4nh8bmAIIuotD49Y2GvHoyAo981ZUHbvgg0=opi5sgY77yi4mGI8B6DTWf48swoFn6Ifqkz1ThO763s=
> ___
> lustre-discuss mailing list
> lustre-discuss@lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] exec start error for lustre-2.10.1_13_g2ee62fb

2017-10-12 Thread Chris Horn
Google suggests that this error message has been associated with a missing 
“hashpling” in some cases. The lustre_routes_config script has “# !/bin/bash”, 
and I wonder if that space before the “!” isn’t the culprit?

Just a guess. You might try to remove that space from the lustre_routes_config 
script and try to restart lnet with systemctl.

Chris Horn

On 10/12/17, 3:39 PM, "lustre-discuss on behalf of David Rackley" 
<lustre-discuss-boun...@lists.lustre.org on behalf of rack...@jlab.org> wrote:

Greetings,

I have built lustre-2.10.1_13_g2ee62fb on 3.10.0-693.2.2.el7.x86_64 RHEL 
Workstation release 7.4 (Maipo).

After installation of 
kmod-lustre-client-2.10.1_13_g2ee62fb-1.el7.x86_64.rpm and 
lustre-client-2.10.1_13_g2ee62fb-1.el7.x86_64.rpm the lnet startup fails. 

The error reported is:

-- Unit lnet.service has begun starting up.
Oct 12 13:21:53  kernel: libcfs: loading out-of-tree module taints kernel.
Oct 12 13:21:53  kernel: libcfs: module verification failed: signature 
and/or required key missing - tainting kernel
Oct 12 13:21:53  kernel: LNet: HW NUMA nodes: 1, HW CPU cores: 20, 
npartitions: 1
Oct 12 13:21:53  kernel: alg: No test for adler32 (adler32-zlib)
Oct 12 13:21:53  kernel: alg: No test for crc32 (crc32-table)
Oct 12 13:21:54  kernel: LNet: Using FMR for registration
Oct 12 13:21:54 lctl[135556]: LNET configured
Oct 12 13:21:54  kernel: LNet: Added LNI 172.17.1.92@o2ib [8/256/0/180]
Oct 12 13:21:54  systemd[135576]: Failed at step EXEC spawning 
/usr/sbin/lustre_routes_config: Exec format error
-- Subject: Process /usr/sbin/lustre_routes_config could not be executed
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
-- 
-- The process /usr/sbin/lustre_routes_config could not be executed and 
failed.
-- 
-- The error number returned by this process is 8.
Oct 12 13:21:54  systemd[1]: lnet.service: main process exited, 
code=exited, status=203/EXEC
Oct 12 13:21:54  systemd[1]: Failed to start lnet management.
-- Subject: Unit lnet.service has failed
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
-- 
-- Unit lnet.service has failed.
-- 
-- The result is failed.

Any ideas?
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Lnet not starting

2017-10-12 Thread Chris Horn
The pre-built rpms are most likely compiled against the in-kernel IB drivers. 
If you’re using the MOFED drivers you’ll need to recompile Lustre. The 
instructions here may help you out http://wiki.lustre.org/Compiling_Lustre

Chris Horn

From: Ravi Konila <ravibh...@gmail.com>
Reply-To: Ravi Konila <ravibh...@gmail.com>
Date: Thursday, October 12, 2017 at 1:33 PM
To: Chris Horn <ho...@cray.com>, Parag Khuraswar <para...@citilindia.com>, 
'Lustre User Discussion Mailing List' <lustre-discuss@lists.lustre.org>
Subject: Re: [lustre-discuss] Lnet not starting

Hi

I am using pre-built rpms.

Regards

Ravi Konila
From: Chris Horn
Sent: Thursday, October 12, 2017 10:51 PM
To: Ravi Konila ; Parag Khuraswar ; 'Lustre User Discussion Mailing List'
Subject: Re: [lustre-discuss] Lnet not starting

Are you compiling Lustre yourself or using pre-built rpms?

Chris Horn

From: Ravi Konila <ravibh...@gmail.com>
Reply-To: Ravi Konila <ravibh...@gmail.com>
Date: Thursday, October 12, 2017 at 11:40 AM
To: Chris Horn <ho...@cray.com>, Parag Khuraswar <para...@citilindia.com>, 
'Lustre User Discussion Mailing List' <lustre-discuss@lists.lustre.org>
Subject: Re: [lustre-discuss] Lnet not starting

Hi Chris

I installed RHEL 6.7, MLNX_OFED_LINUX-3.4-1.0.0.0-rhel6.7-x86_64 and then 
Lustre 2.8 in my Lustre MDS/MGT/OSS servers.

My ib0 is working fine and I can ping other nodes.
my lustre.conf file has
options lnet networks=o2ib(ib0)

With this, If I run “service lnet start” it fails with error
LNET configure error 22: Invalid argument

dmesg give me output as below (I just captured last line but there are many 
lines with symbol error or so)

LNetError: 16770:0:(api-ni.c:1276:lnet_startup_lndni()) Can't load LND o2ib, 
module ko2iblnd, rc=256

If I specify tcp in lustre.conf,, it works fine.

I have reinstalled Lustre and then Mellanox OFED driver but still the problem 
is same, not able to make infiniband up with Lustre Lnet

Regards

Ravi Konila
Sr. Technical Consultant



From: Chris Horn
Sent: Thursday, October 12, 2017 9:02 PM
To: Ravi Konila ; Parag Khuraswar ; 'Lustre User Discussion Mailing List'
Subject: Re: [lustre-discuss] Lnet not starting

dmesg output should provide more information about the “Invalid argument” error 
that you are seeing, but my guess would be that Lustre was compiled against a 
different IB stack than what you have installed.

Chris Horn

From: lustre-discuss <lustre-discuss-boun...@lists.lustre.org> on behalf of 
Ravi Konila <ravibh...@gmail.com>
Reply-To: Ravi Konila <ravibh...@gmail.com>
Date: Thursday, October 12, 2017 at 7:58 AM
To: Parag Khuraswar <para...@citilindia.com>, 'Lustre User Discussion Mailing 
List' <lustre-discuss@lists.lustre.org>
Subject: Re: [lustre-discuss] Lnet not starting

Hi Parag

Even I am facing the same issue with RHEL 6.7 and Lustre 2.8. Now I am also 
trying with RHEL 7.3 and Lustre 2.10.0.
I am planning to install Mellanox OFED driver with 3.4 stack. Looks like there 
is some problem with OFED 4.x stack with Lustre 2.10.0.
Let me try the same and update.
When I start “service lnet start” it gives LNET configure error 22: Invalid 
argument
but it works fine with tcp.

Regards

Ravi Konila
Sr. Technical Consultant
Maruti Suzuki India Ltd


From: Parag Khuraswar
Sent: Thursday, October 12, 2017 6:11 PM
To: 'Lustre User Discussion Mailing List'
Subject: [lustre-discuss] Lnet not starting

Hi,

I am installing Lustre 2.10.0 on RHEL 7.3.
IB is working fine but lnet is not coming up. Lustre service is running.
Ibstat also show link up and active.
Lustre and lnet modules are also loaded.

Regards,
Parag



___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Lnet not starting

2017-10-12 Thread Chris Horn
Are you compiling Lustre yourself or using pre-built rpms?

Chris Horn

From: Ravi Konila <ravibh...@gmail.com>
Reply-To: Ravi Konila <ravibh...@gmail.com>
Date: Thursday, October 12, 2017 at 11:40 AM
To: Chris Horn <ho...@cray.com>, Parag Khuraswar <para...@citilindia.com>, 
'Lustre User Discussion Mailing List' <lustre-discuss@lists.lustre.org>
Subject: Re: [lustre-discuss] Lnet not starting

Hi Chris

I installed RHEL 6.7, MLNX_OFED_LINUX-3.4-1.0.0.0-rhel6.7-x86_64 and then 
Lustre 2.8 in my Lustre MDS/MGT/OSS servers.

My ib0 is working fine and I can ping other nodes.
my lustre.conf file has
options lnet networks=o2ib(ib0)

With this, If I run “service lnet start” it fails with error
LNET configure error 22: Invalid argument

dmesg give me output as below (I just captured last line but there are many 
lines with symbol error or so)

LNetError: 16770:0:(api-ni.c:1276:lnet_startup_lndni()) Can't load LND o2ib, 
module ko2iblnd, rc=256

If I specify tcp in lustre.conf,, it works fine.

I have reinstalled Lustre and then Mellanox OFED driver but still the problem 
is same, not able to make infiniband up with Lustre Lnet

Regards

Ravi Konila
Sr. Technical Consultant



From: Chris Horn
Sent: Thursday, October 12, 2017 9:02 PM
To: Ravi Konila ; Parag Khuraswar ; 'Lustre User Discussion Mailing List'
Subject: Re: [lustre-discuss] Lnet not starting

dmesg output should provide more information about the “Invalid argument” error 
that you are seeing, but my guess would be that Lustre was compiled against a 
different IB stack than what you have installed.

Chris Horn

From: lustre-discuss <lustre-discuss-boun...@lists.lustre.org> on behalf of 
Ravi Konila <ravibh...@gmail.com>
Reply-To: Ravi Konila <ravibh...@gmail.com>
Date: Thursday, October 12, 2017 at 7:58 AM
To: Parag Khuraswar <para...@citilindia.com>, 'Lustre User Discussion Mailing 
List' <lustre-discuss@lists.lustre.org>
Subject: Re: [lustre-discuss] Lnet not starting

Hi Parag

Even I am facing the same issue with RHEL 6.7 and Lustre 2.8. Now I am also 
trying with RHEL 7.3 and Lustre 2.10.0.
I am planning to install Mellanox OFED driver with 3.4 stack. Looks like there 
is some problem with OFED 4.x stack with Lustre 2.10.0.
Let me try the same and update.
When I start “service lnet start” it gives LNET configure error 22: Invalid 
argument
but it works fine with tcp.

Regards

Ravi Konila
Sr. Technical Consultant
Maruti Suzuki India Ltd


From: Parag Khuraswar
Sent: Thursday, October 12, 2017 6:11 PM
To: 'Lustre User Discussion Mailing List'
Subject: [lustre-discuss] Lnet not starting

Hi,

I am installing Lustre 2.10.0 on RHEL 7.3.
IB is working fine but lnet is not coming up. Lustre service is running.
Ibstat also show link up and active.
Lustre and lnet modules are also loaded.

Regards,
Parag



___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Lnet not starting

2017-10-12 Thread Chris Horn
dmesg output should provide more information about the “Invalid argument” error 
that you are seeing, but my guess would be that Lustre was compiled against a 
different IB stack than what you have installed.

Chris Horn

From: lustre-discuss <lustre-discuss-boun...@lists.lustre.org> on behalf of 
Ravi Konila <ravibh...@gmail.com>
Reply-To: Ravi Konila <ravibh...@gmail.com>
Date: Thursday, October 12, 2017 at 7:58 AM
To: Parag Khuraswar <para...@citilindia.com>, 'Lustre User Discussion Mailing 
List' <lustre-discuss@lists.lustre.org>
Subject: Re: [lustre-discuss] Lnet not starting

Hi Parag

Even I am facing the same issue with RHEL 6.7 and Lustre 2.8. Now I am also 
trying with RHEL 7.3 and Lustre 2.10.0.
I am planning to install Mellanox OFED driver with 3.4 stack. Looks like there 
is some problem with OFED 4.x stack with Lustre 2.10.0.
Let me try the same and update.
When I start “service lnet start” it gives LNET configure error 22: Invalid 
argument
but it works fine with tcp.

Regards

Ravi Konila
Sr. Technical Consultant
Maruti Suzuki India Ltd


From: Parag Khuraswar
Sent: Thursday, October 12, 2017 6:11 PM
To: 'Lustre User Discussion Mailing List'
Subject: [lustre-discuss] Lnet not starting

Hi,

I am installing Lustre 2.10.0 on RHEL 7.3.
IB is working fine but lnet is not coming up. Lustre service is running.
Ibstat also show link up and active.
Lustre and lnet modules are also loaded.

Regards,
Parag



___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Lnet not starting

2017-10-12 Thread Chris Horn
It is not recommended to run MOFED-4.0 with Lustre. Only 4.1 or higher.

Chris Horn

On 10/12/17, 8:57 AM, "lustre-discuss on behalf of Peter Kjellström" 
<lustre-discuss-boun...@lists.lustre.org on behalf of c...@nsc.liu.se> wrote:

On Thu, 12 Oct 2017 18:27:34 +0530
"Ravi Konila" <ravibh...@gmail.com> wrote:

> Hi Parag
> 
> Even I am facing the same issue with RHEL 6.7 and Lustre 2.8. Now I
> am also trying with RHEL 7.3 and Lustre 2.10.0. I am planning to
> install Mellanox OFED driver with 3.4 stack. Looks like there is some
> problem with OFED 4.x stack with Lustre 2.10.0.

Note that lustre 2.10.1 from about a week ago added support for MOFED-4.

/Peter K
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Lustre 2.10.0 multi rail configuration

2017-08-28 Thread Chris Horn
Dynamic LNet configuration (DLC) must be used to configure multi-rail. Lustre 
2.10 contains an “lnet.conf” file that has a sample multi-rail configuration. 
I’ve copied it below for your convenience.

> # lnet.conf - configuration file for lnet routes to be imported by lnetctl
> #
> # This configuration file is formatted as YAML and can be imported
> # by lnetctl.
> #
> # net:
> # - net type: o2ib1
> #   local NI(s):
> # - nid: 172.16.1.4@o2ib1
> #   interfaces:
> #   0: ib0
> #   tunables:
> #   peer_timeout: 180
> #   peer_credits: 128
> #   peer_buffer_credits: 0
> #   credits: 1024
> #   lnd tunables:
> #   peercredits_hiw: 64
> #   map_on_demand: 32
> #   concurrent_sends: 256
> #   fmr_pool_size: 2048
> #   fmr_flush_trigger: 512
> #   fmr_cache: 1
> #   CPT: "[0,1]"
> # - nid: 172.16.2.4@o2ib1
> #   interfaces:
> #   0: ib1
> #   tunables:
> #   peer_timeout: 180
> #   peer_credits: 128
> #   peer_buffer_credits: 0
> #   credits: 1024
> #   lnd tunables:
> #   peercredits_hiw: 64
> #   map_on_demand: 32
> #   concurrent_sends: 256
> #   fmr_pool_size: 2048
> #   fmr_flush_trigger: 512
> #   fmr_cache: 1
> #   CPT: "[0,1]"
> # route:
> # - net: o2ib
> #   gateway: 172.16.1.1@o2ib1
> #   hop: -1
> #   priority: 0
> # peer:
> # - primary nid: 192.168.1.2@o2ib
> #   Multi-Rail: True
> #   peer ni:
> # - nid: 192.168.1.2@o2ib
> # - nid: 192.168.2.2@o2ib
> # - primary nid: 172.16.1.1@o2ib1
> #   Multi-Rail: True
> #   peer ni:
> # - nid: 172.16.1.1@o2ib1
> # - nid: 172.16.2.1@o2ib1<mailto:172.16.2.1@o2ib1>

Chris Horn

From: lustre-discuss <lustre-discuss-boun...@lists.lustre.org> on behalf of 
Riccardo Veraldi <riccardo.vera...@cnaf.infn.it>
Date: Monday, August 28, 2017 at 5:49 PM
To: "lustre-discuss@lists.lustre.org" <lustre-discuss@lists.lustre.org>
Subject: [lustre-discuss] Lustre 2.10.0 multi rail configuration

Hello,
I am trying to deploy a multi rail configuration on Lustre 2.10.0 on RHEL73.
My goal is to use both the IB interfaces on OSSes and client.
I have one client and two OSSes and 1 MDS
My LNet network is labelled o2ib5 and tcp5 just for my own convenience. What I 
did is to modify the configuration of lustre.conf

options lnet networks=o2ib5(ib0,ib1),tcp5(enp1s0f0)

lctl list_nids on either hte OSSes or the client shows me both local IB 
interfaces:

172.21.52.86@o2ib5
172.21.52.118@o2ib5
172.21.42.211@tcp5

anyway I can't run a LNet selftest using the new nids, it fails.
Seems like they are unused.
Any hint on the multi-rail configuration needed?
What I'd like to do is use both InfiniBand cards (ib0,ib1)  on my two OSSes and 
on my client to leverage more bandwidth usage
since with only one InfiniBand I cannot saturate the disk performance.
thank you


___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Lustre poor performance

2017-08-21 Thread Chris Horn
The ko2iblnd-opa settings are tuned specifically for Intel OmniPath. Take a 
look at the /usr/sbin/ko2iblnd-probe script to see how OPA hardware is detected 
and the “ko2iblnd-opa” settings get used.

Chris Horn

From: lustre-discuss <lustre-discuss-boun...@lists.lustre.org> on behalf of 
Riccardo Veraldi <riccardo.vera...@cnaf.infn.it>
Date: Saturday, August 19, 2017 at 5:00 PM
To: Arman Khalatyan <arm2...@gmail.com>
Cc: "lustre-discuss@lists.lustre.org" <lustre-discuss@lists.lustre.org>
Subject: Re: [lustre-discuss] Lustre poor performance

I ran again my Lnet self test and  this time adding --concurrency=16  I can use 
all of the IB bandwith (3.5GB/sec).

the only thing I do not understand is why ko2iblnd.conf is not loaded properly 
and I had to remove the alias in the config file to allow
the proper peer_credit settings to be loaded.

thanks to everyone for helping

Riccardo

On 8/19/17 8:54 AM, Riccardo Veraldi wrote:

I found out that ko2iblnd is not getting settings from 
/etc/modprobe/ko2iblnd.conf
alias ko2iblnd-opa ko2iblnd
options ko2iblnd-opa peer_credits=128 peer_credits_hiw=64 credits=1024 
concurrent_sends=256 ntx=2048 map_on_demand=32 fmr_pool_size=2048 
fmr_flush_trigger=512 fmr_cache=1 conns_per_peer=4

install ko2iblnd /usr/sbin/ko2iblnd-probe

but if I modify ko2iblnd.conf like this, then settings are loaded:

options ko2iblnd peer_credits=128 peer_credits_hiw=64 credits=1024 
concurrent_sends=256 ntx=2048 map_on_demand=32 fmr_pool_size=2048 
fmr_flush_trigger=512 fmr_cache=1 conns_per_peer=4

install ko2iblnd /usr/sbin/ko2iblnd-probe

Lnet tests show better behaviour but still I Would expect more than this.
Is it possible to tune parameters in /etc/modprobe/ko2iblnd.conf so that 
Mellanox ConnectX-3 will work more efficiently ?

[LNet Rates of servers]
[R] Avg: 2286 RPC/s Min: 0RPC/s Max: 4572 RPC/s
[W] Avg: 3322 RPC/s Min: 0RPC/s Max: 6643 RPC/s
[LNet Bandwidth of servers]
[R] Avg: 625.23   MiB/s Min: 0.00 MiB/s Max: 1250.46  MiB/s
[W] Avg: 1035.85  MiB/s Min: 0.00 MiB/s Max: 2071.69  MiB/s
[LNet Rates of servers]
[R] Avg: 2286 RPC/s Min: 1RPC/s Max: 4571 RPC/s
[W] Avg: 3321 RPC/s Min: 1RPC/s Max: 6641 RPC/s
[LNet Bandwidth of servers]
[R] Avg: 625.55   MiB/s Min: 0.00 MiB/s Max: 1251.11  MiB/s
[W] Avg: 1035.05  MiB/s Min: 0.00 MiB/s Max: 2070.11  MiB/s
[LNet Rates of servers]
[R] Avg: 2291 RPC/s Min: 0RPC/s Max: 4581 RPC/s
[W] Avg: 3329 RPC/s Min: 0RPC/s Max: 6657 RPC/s
[LNet Bandwidth of servers]
[R] Avg: 626.55   MiB/s Min: 0.00 MiB/s Max: 1253.11  MiB/s
[W] Avg: 1038.05  MiB/s Min: 0.00 MiB/s Max: 2076.11  MiB/s
session is ended
./lnet_test.sh: line 17: 23394 Terminated  lst stat servers




On 8/19/17 4:20 AM, Arman Khalatyan wrote:
just minor comment,
you should push up performance of your nodes,they are not running in the max 
cpu frequencies.Al tests might be inconsistent. in order to get most of ib run 
following:
tuned-adm profile latency-performance
for more options use:
tuned-adm list

It will be interesting to see the difference.

Am 19.08.2017 3:57 vorm. schrieb "Riccardo Veraldi" 
<riccardo.vera...@cnaf.infn.it<mailto:riccardo.vera...@cnaf.infn.it>>:
Hello Keith and Dennis, these are the test I ran.

  *   obdfilter-survey, shows that I Can saturate disk performance, the 
NVMe/ZFS backend is performing very well and it is faster then my Infiniband 
network

pool  alloc   free   read  write   read  write
  -  -  -  -  -  -
drpffb-ost01  3.31T  3.19T  3  35.7K  16.0K  7.03G
  raidz1  3.31T  3.19T  3  35.7K  16.0K  7.03G
nvme0n1   -  -  1  5.95K  7.99K  1.17G
nvme1n1   -  -  0  6.01K  0  1.18G
nvme2n1   -  -  0  5.93K  0  1.17G
nvme3n1   -  -  0  5.88K  0  1.16G
nvme4n1   -  -  1  5.95K  7.99K  1.17G
nvme5n1   -  -  0  5.96K  0  1.17G
  -  -  -  -  -  -
this are the tests results

Fri Aug 18 16:54:48 PDT 2017 Obdfilter-survey for case=disk from drp-tst-ffb01
ost  1 sz 10485760K rsz 1024K obj1 thr1 write 7633.08 SHORT 
rewrite 7558.78 SHORT read 3205.24 [3213.70, 3226.78]
ost  1 sz 10485760K rsz 1024K obj1 thr2 write 7996.89 SHORT 
rewrite 7903.42 SHORT read 5264.70 SHORT
ost  1 sz 10485760K rsz 1024K obj2 thr2 write 7718.94 SHORT 
rewrite 7977.84 SHORT read 5802.17 SHORT

  *   Lnet self test, and here I see the problems. For reference 
172.21.52.[83,84] are the two OSSes 172.21.52.86 is the reader/writer. Here is 
the script that I ran

#!/bin/bash
export LST_SESSION=$$
lst new_session read_write
lst add_group servers 172.21.52.[83,84]@o2ib5
lst add_group readers 172

Re: [lustre-discuss] Lustre Manual Bug Report: 25.2.2. Enabling and Tuning Root Squash

2017-06-06 Thread Chris Horn
The LUDOC project in Intel’s Jira is the correct place to report these issues: 
https://wiki.hpdd.intel.com/display/PUB/How+to+file+a+LUDOC+bug

These are the instructions for submitting fixes to the Lustre Manual if you’re 
so inclined: 
https://wiki.hpdd.intel.com/display/PUB/Making+changes+to+the+Lustre+Manual+source

Chris Horn

From: lustre-discuss <lustre-discuss-boun...@lists.lustre.org> on behalf of 
"Gibbins, Faye" <faye.gibb...@cirrus.com>
Date: Tuesday, June 6, 2017 at 6:41 AM
To: "lustre-discuss@lists.lustre.org" <lustre-discuss@lists.lustre.org>
Subject: [lustre-discuss] Lustre Manual Bug Report: 25.2.2. Enabling and Tuning 
Root Squash

Hi,

I think I’ve found a bug in the manual: 
https://build.hpdd.intel.com/job/lustre-manual/lastSuccessfulBuild/artifact/lustre_manual.xhtml#dbdoclet.50438221_48757
 I’ve included it below in case this is the correct place to report it. If not 
please let me know and I’ll submit it elsewhere.

In section 25.2.2. Enabling and Tuning Root Squash, the bit that says:

Snip---
Root squash parameters can also be changed with the lctl conf_param command. 
For example:

mgs# lctl conf_param testfs.mdt.root_squash="1000:101"
mgs# lctl conf_param testfs.mdt.nosquash_nids="*@tcp"
Snip---

Should be IMHO:

Snip---
Root squash parameters can also be changed with the lctl conf_param command. 
For example:

mgs# lctl conf_param mdt.testfs-MDT.root_squash="1000:101"
mgs# lctl conf_param mdt.testfs-MDT.nosquash_nids="*@tcp"
Snip---

Here is an example of me using my version compared to the documentation version 
using Lustre 2.8 on RHEL 7:

Snip---
0 edi-vf-1-5:audiodb-MDT# lctl get_param mdt.audiodb-MDT.root_squash
mdt.audiodb-MDT.root_squash=0:0
0 edi-vf-1-5:audiodb-MDT# lctl set_param 
mdt.audiodb-MDT.root_squash="0:1"
mdt.audiodb-MDT.root_squash=0:1
0 edi-vf-1-5:audiodb-MDT# lctl get_param mdt.audiodb-MDT.root_squash
mdt.audiodb-MDT.root_squash=0:1
0 edi-vf-1-5:audiodb-MDT# lctl set_param 
mdt.audiodb-MDT.root_squash="0:0"
mdt.audiodb-MDT.root_squash=0:0
# Using the documented version.
0 edi-vf-1-5:audiodb-MDT# lctl set_param audiodb.mdt.root_squash="0:1"
error: set_param: param_path 'audiodb/mdt/root_squash': No such file or 
directory
0 edi-vf-1-5:audiodb-MDT#
Snip---

Yours
Faye Gibbins
Snr SysAdmin, Unix Lead Architect
Software Systems and Cloud Services
Cirrus Logic | cirrus.com<http://www.cirrus.com/>  | +44 (0) 131 272 7398

[id:image002.png@01D2CF24.9A35B8F0]

This message and any attachments may contain privileged and confidential 
information that is intended solely for the person(s) to whom it is addressed. 
If you are not an intended recipient you must not: read; copy; distribute; 
discuss; take any action in or make any reliance upon the contents of this 
message; nor open or read any attachment. If you have received this message in 
error, please notify us as soon as possible on the following telephone number 
and destroy this message including any attachments. Thank you. Cirrus Logic 
International (UK) Ltd and Cirrus Logic International Semiconductor Ltd are 
companies registered in Scotland, with registered numbers SC089839 and SC495735 
respectively. Our registered office is at 7B Nightingale Way, Quartermile, 
Edinburgh, EH3 9EG, UK. Tel: +44 (0)131 272 7000. cirrus.com
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] rpm names?

2017-04-21 Thread Chris Horn
That work was done in LU-5614 and some related tickets. I think the one for 
removing the kernel version from the package name was LU-7643.

Chris Horn

On 4/21/17, 12:58 PM, "lustre-discuss on behalf of Michael Di Domenico" 
<lustre-discuss-boun...@lists.lustre.org on behalf of mdidomeni...@gmail.com> 
wrote:

i'm confused over the naming change between 2.8 and 2.9, it seems the
kernel version has gotten dropped from the rpm filename.

but the files inside the rpm are still compiled for a specific kernel,
so there's a good chance that kmod-lustre-client will get installed on
a machine that does not have a matching kernel to the directory inside
the rpm

this also means i can't rpmbuild --rebuild the lustre-client src rpm
for several different kernels and keep all the files in the same
directory.

kmod-lustre-client-2.9.0-1.el7.x86_64.rpm

lustre-client-modules-2.8.0-3.10.0_327.3.1.el7.x86_64.x86_64.rpm
lustre-client-2.8.0-3.10.0_327.3.1.el7.x86_64.x86_64.rpm

i looked back through the mailing list, but i don't see any mention of
this change. (i'm not on the lustre developers list)
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Multi-cluster (multi-rail) setup

2015-06-12 Thread Chris Horn
Hello and welcome to Lustre :)

 3.- configure /etc/modprobe.d/lustre.conf on each node of each cluster
 like this:
 
 Nodes con Cluster A:  options lnet networks=o2ib0(ib0)
 
 Nodes con Cluster B:  options lnet networks=o2ib1(ib1)
 
 Nodes con Cluster C:  options lnet networks=o2ib2(ib2)
 
 Nodes con Cluster D:  options lnet networks=o2ib3(ib3)”

The “(ibX)” portion of that string should correspond to the local IB interface 
that the clients in those clusters are actually using. i.e which port on the 
clients is active, not the port that is used by servers on that LNet. My guess 
is that the clients have a single IB HCA with a cable plugged into port 0, so 
that what you probably want is:

Nodes con Cluster A:  options lnet networks=o2ib0(ib0)

Nodes con Cluster B:  options lnet networks=o2ib1(ib0)

Nodes con Cluster C:  options lnet networks=o2ib2(ib0)

Nodes con Cluster D:  options lnet networks=o2ib3(ib0)”

Again, that’s just a guess on how these things are typically configured. You’ll 
want to check if that is actually case for your clusters.

Chris Horn

 On Jun 12, 2015, at 2:37 AM, Thrash Er mingorrubi...@gmail.com wrote:
 
 New to Lustre O:)
 
 I have to install and configure a Lustre storage for 4 small clusters
 (4 different departments). Each cluster has its own IB QDR
 interconnect for MPI (and now Lustre) and its own 1 GigE management
 network. IB networks would be something like:
 Cluster A  192.168.1.0  o2ib0(ib0)
 Cluster B  192.168.2.0  o2ib1(ib1)
 Cluster C  192.168.3.0  o2ib2(ib2)
 Cluster D  192.168.4.0  o2ib3(ib3)
 
 I've gone through the Lustre Operations Manual 2.x and, from what I
 understood, I would have to:
 
 1.- add 4 IB ports to each OSS and MDS/MGT and cable them like this:
 IB Port 0 - cluster A
 IB Port 1 - cluster B
 IB Port 2 - cluster C
 IB Port 3 - cluster D
 
 2.- configure /etc/modprobe.d/lustre.conf on the OSS and MDS like this:
 
 options lnet networks=o2ib0(ib0),o2ib1(ib1),o2ib2(ib2),o2ib3(ib3)
 
 3.- configure /etc/modprobe.d/lustre.conf on each node of each cluster
 like this:
 
 Nodes con Cluster A:  options lnet networks=o2ib0(ib0)
 
 Nodes con Cluster B:  options lnet networks=o2ib1(ib1)
 
 Nodes con Cluster C:  options lnet networks=o2ib2(ib2)
 
 Nodes con Cluster D:  options lnet networks=o2ib3(ib3)
 
 
 S, questions:
   1.- Are my assumptions correct?
   2.- No need for LNET routers, right?
   3.- Am I missing something?
 
 Thanks !!
 ___
 lustre-discuss mailing list
 lustre-discuss@lists.lustre.org
 http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [Lustre-discuss] LustreError codes -114 and -16 (ldlm_lib.c:1919:target_send_reply_msg())

2012-02-15 Thread Chris Horn
errno 16 is EBUSY (device or resource busy) and errno 114 is EALREADY 
(Operation already in progress).

Chris Horn

On Feb 15, 2012, at 10:52 AM, Marina Cacciagrano wrote:

Hello,
On all the nodes of a lustre 1.8.2 , I often see messages similar to the 
following  in /var/log/syslog:
LustreError: 8862:0:(ldlm_lib.c:1919:target_send_reply_msg()) @@@ processing 
error (-114)  req@8103f97dc850 x1393780295030087/t0 
o250-bfd79683-ce51-1e18-7f40-632c3a616b01@NET_0x2ac1054d2_UUID:0/0 lens 
368/264 e 0 to 0 dl 1329235090 ref 1 fl Interpret:/0/0 rc -114/0
and
LustreError: 8963:0:(ldlm_lib.c:1919:target_send_reply_msg()) @@@ processing 
error (-16)  req@81024b636400 x1393871458142225/t0 
o38-e3e2f978-3cc1-c6f9-17cb-9ac846be7fae@NET_0x2ac1138c2_UUID:0/0 lens 
368/264 e 0 to 0 dl 132937 ref 1 fl Interpret:/0/0 rc -16/0

I cannot find the meaning of error codes -144 and -16.
Can anybody advise on what generates those errors?

A quick description of the configuration:
the lustre version is 1.8.2.
the system is made up by one MDS host and seven OSS hosts.
lnet is over 10Ge.


Regards,
marina


Framestore
9 Noel Street London W1F 8GH
[T] +44 (0)20 7208 2600  [F] +44 (0)20 7208 2626
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.orgmailto:Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss

___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] LustreError codes -114 and -16 (ldlm_lib.c:1919:target_send_reply_msg())

2012-02-15 Thread Chris Horn
Hard to say what's going on without additional context. The first message 
relates to an MGS_CONNECT rpc (o250), the second messages relates to an 
MDS_CONNECT rpc (o38). I would suspect network issues.

Chris Horn

On Feb 15, 2012, at 12:46 PM, Marina Cacciagrano wrote:

Thanks!
Maybe  that means that the drives are a bit too slow to respond to the 
requests...
Can that be related to a problem with lnet as well?

marina


Framestore
9 Noel Street London W1F 8GH
[T] +44 (0)20 7208 2600  [F] +44 (0)20 7208 2626

- Original Message -
From: Chris Horn ho...@cray.commailto:ho...@cray.com
To: Marina Cacciagrano 
marina.cacciagr...@framestore.commailto:marina.cacciagr...@framestore.com
Cc: lustre-discuss@lists.lustre.orgmailto:lustre-discuss@lists.lustre.org 
lustre-discuss@lists.lustre.orgmailto:lustre-discuss@lists.lustre.org
Sent: Wednesday, 15 February, 2012 5:21:33 PM
Subject: Re: [Lustre-discuss] LustreError codes -114 and -16 
(ldlm_lib.c:1919:target_send_reply_msg())

errno 16 is EBUSY (device or resource busy) and errno 114 is EALREADY 
(Operation already in progress).

Chris Horn

On Feb 15, 2012, at 10:52 AM, Marina Cacciagrano wrote:

Hello,
On all the nodes of a lustre 1.8.2 , I often see messages similar to the 
following  in /var/log/syslog:
LustreError: 8862:0:(ldlm_lib.c:1919:target_send_reply_msg()) @@@ processing 
error (-114)  req@8103f97dc850 x1393780295030087/t0 
o250-bfd79683-ce51-1e18-7f40-632c3a616b01@NET_0x2ac1054d2_UUID:0/0 lens 
368/264 e 0 to 0 dl 1329235090 ref 1 fl Interpret:/0/0 rc -114/0
and
LustreError: 8963:0:(ldlm_lib.c:1919:target_send_reply_msg()) @@@ processing 
error (-16)  req@81024b636400 x1393871458142225/t0 
o38-e3e2f978-3cc1-c6f9-17cb-9ac846be7fae@NET_0x2ac1138c2_UUID:0/0 lens 
368/264 e 0 to 0 dl 132937 ref 1 fl Interpret:/0/0 rc -16/0

I cannot find the meaning of error codes -144 and -16.
Can anybody advise on what generates those errors?

A quick description of the configuration:
the lustre version is 1.8.2.
the system is made up by one MDS host and seven OSS hosts.
lnet is over 10Ge.


Regards,
marina


Framestore
9 Noel Street London W1F 8GH
[T] +44 (0)20 7208 2600  [F] +44 (0)20 7208 2626
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.orgmailto:Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.orgmailto:Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss

___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] Fwd: Lustre performance issue (obdfilter_survey

2011-07-06 Thread Chris Horn
FYI, there is some work being done to clean up obdfilter-survey. See 
https://bugzilla.lustre.org/show_bug.cgi?id=24490
If there was a script issue you might try the patch from that bug to see if you 
can reproduce.
https://bugzilla.lustre.org/show_bug.cgi?id=24490
Chris Horn

On Jul 6, 2011, at 3:37 PM, Cliff White wrote:

The case=network part of obdfilter_survey has really been replaced by 
lnet_selftest.
I don't think it's been maintained in awhile.

It would be best to repeat the network-only test with lnet_selftest, this is 
likely an issue with
the script.
cliffw

On Wed, Jul 6, 2011 at 1:04 PM, lior amar 
lioror...@gmail.commailto:lioror...@gmail.com wrote:
Hi,

I am installing a Lustre system and I wanted to measure the OSS
performance.
I used the obdfilter_survey and got very low performance for low
thread numbers when using the case=network option


System Configuration:
* Lustre 1.8.6-wc (compiled from the whamcloud git)
* Centos 5.6
* Infiniband (mellanox cards) open ib from centos 5.6
* OSS - 2 quad core  E5620 CPUS
* OSS - memory 48GB
* LSI 2965 raid card with 18 disks in raid 6 (16 data + 2). Raw
performance are good both  when testing the block device or over a file system 
with Bonnie++

* OSS uses ext4 and mkfs parameters were set to reflect the stripe
size .. -E stride =...

The performance test I did:


1) obdfilter_survey case=disk -
   OSS performance is ok (similar to raw disk performance) -
   In the case of 1  thread and one object getting 966MB/sec

2) obdfilter_survey case=network -
OSS performance is bad for low thread numbers and get better as
the  number of  threads increases.
For the 1 thread one object getting 88MB/sec

3) obdfilter_survey case=netdisk -- Same as network case

4) When running ost_survey I am getting also low performance:
   Read = 156 MB/sec Write = ~350MB/sec

5) Running the lnet_self test I get much higher numbers
 Numbers obtained with concurrency = 1

 [LNet Rates of servers]
 [R] Avg: 3556 RPC/s Min: 3556 RPC/s Max: 3556 RPC/s
 [W] Avg: 4742 RPC/s Min: 4742 RPC/s Max: 4742 RPC/s
 [LNet Bandwidth of servers]
 [R] Avg: 1185.72  MB/s  Min: 1185.72  MB/s  Max: 1185.72  MB/s
 [W] Avg: 1185.72  MB/s  Min: 1185.72  MB/s  Max: 1185.72  MB/s




Any Ideas why a single thread over network obtain 88MB/sec while the same test 
conducted local obtained 966MB/sec??

What else should I test/read/try ??

10x

Below are the actual numbers:

= obdfilter_survey case = disk ==
Wed Jul  6 13:24:57 IDT 2011 Obdfilter-survey for case=disk from oss1
ost  1 sz 16777216K rsz 1024K obj1 thr1 write  966.90
[ 644.40,1030.02] rewrite 1286.23 [1300.78,1315.77] read
8474.33 SHORT
ost  1 sz 16777216K rsz 1024K obj1 thr2 write 1577.95
[1533.57,1681.43] rewrite 1548.29 [1244.83,1718.42] read
11003.26 SHORT
ost  1 sz 16777216K rsz 1024K obj1 thr4 write 1465.68
[1354.73,1600.50] rewrite 1484.98 [1271.54,1584.52] read
16464.13 SHORT
ost  1 sz 16777216K rsz 1024K obj1 thr8 write 1267.39
[ 797.25,1476.48] rewrite 1350.28 [1283.80,1387.70] read
15353.69 SHORT
ost  1 sz 16777216K rsz 1024K obj1 thr   16 write 1295.35
[1266.82,1408.70] rewrite 1332.59 [1315.61,1429.66] read
15001.67 SHORT
ost  1 sz 16777216K rsz 1024K obj2 thr2 write 1467.80
[1472.62,1691.42] rewrite 1218.88 [ 821.23,1338.74] read
13538.41 SHORT
ost  1 sz 16777216K rsz 1024K obj2 thr4 write 1561.09
[1521.57,1682.75] rewrite 1183.31 [ 959.10,1372.52] read
15955.31 SHORT
ost  1 sz 16777216K rsz 1024K obj2 thr8 write 1498.74
[1543.58,1704.41] rewrite 1116.19 [1001.06,1163.91] read
15523.22 SHORT
ost  1 sz 16777216K rsz 1024K obj2 thr   16 write 1462.54
[ 985.08,1615.48] rewrite 1244.29 [1100.97,1444.80] read
15174.56 SHORT
ost  1 sz 16777216K rsz 1024K obj4 thr4 write 1483.42
[1497.88,1648.45] rewrite 1042.92 [ 801.25,1192.69] read
15997.30 SHORT
ost  1 sz 16777216K rsz 1024K obj4 thr8 write 1494.63
[1458.85,1624.13] rewrite 1041.81 [ 806.25,1183.89] read
15450.18 SHORT
ost  1 sz 16777216K rsz 1024K obj4 thr   16 write 1469.96
[1450.65,1647.45] rewrite 1027.06 [ 645.50,1215.86] read
15543.46 SHORT
ost  1 sz 16777216K rsz 1024K obj8 thr8 write 1417.93
[1250.85,1520.58] rewrite 1007.45 [ 905.15,1130.82] read
15789.66 SHORT
ost  1 sz 16777216K rsz 1024K obj8 thr   16 write 1324.28
[ 951.87,1518.26] rewrite  986.48 [ 855.21,1079.99] read
15510.70 SHORT
ost  1 sz 16777216K rsz 1024K obj   16 thr   16 write 1237.22
[ 989.07,1345.17] rewrite  915.56 [ 749.08,1033.03] read
15415.75 SHORT

==

== obdfilter_survey case = network 
Wed Jul  6 16:29:38 IDT 2011 Obdfilter-survey for case=network from
oss6
ost  1 sz 16777216K rsz 1024K obj1 thr1 write   87.99
[  86.92,  88.92