Re: [lustre-discuss] lnet_peer_ni_add_to_recoveryq

2020-03-09 Thread Chris Horn
(Re-sending my response to the list) Yes, I believe that there are cases when problems on a remote node can be interpreted as local failures. From: "nathan.dau...@noaa.gov" Date: Sunday, March 8, 2020 at 3:56 AM To: Chris Horn , "lustre-discuss@lists.lustre.org" Cc: &qu

Re: [lustre-discuss] lnet_peer_ni_add_to_recoveryq

2020-03-09 Thread Chris Horn
l). When LNet is selecting the local and remote interfaces to use for a PUT or GET, it considers the health value of each interface. Healthier interfaces are preferred. Chris Horn On 3/9/20, 4:22 AM, "Degremont, Aurelien" wrote: What's the impact of being in recovery m

Re: [lustre-discuss] lnet_peer_ni_add_to_recoveryq

2020-03-06 Thread Chris Horn
could certainly lead to message timeouts, which would in turn result in interfaces being placed into recovery mode. Chris Horn On 3/6/20, 8:59 AM, "lustre-discuss on behalf of Michael Di Domenico" wrote: along the aforementioned error i also see these at the same time lus

Re: [lustre-discuss] Lustre rpm install creating a file that breaks lustre

2019-10-02 Thread Chris Horn
Anything in dmesg? We need to know _why_ the network failed to start. Chris Horn From: Kurt Strosahl Date: Wednesday, October 2, 2019 at 1:55 PM To: Chris Horn , "lustre-discuss@lists.lustre.org" Subject: Re: [lustre-discuss] Lustre rpm install creating a file that breaks lustre

Re: [lustre-discuss] Lustre rpm install creating a file that breaks lustre

2019-10-02 Thread Chris Horn
Might be best to open a ticket for this. What was the nature of the failure? Chris Horn From: lustre-discuss on behalf of Kurt Strosahl Date: Wednesday, October 2, 2019 at 1:30 PM To: "lustre-discuss@lists.lustre.org" Subject: [lustre-discuss] Lustre rpm install creating a file t

Re: [lustre-discuss] 2.10.5 compiler versions

2018-08-30 Thread Chris Horn
.13.0-101 (Ubuntu 14.04) 4.4.0-85.108(Ubuntu 14.04.5 LTS) 4.4.0-131 (Ubuntu 16.04) 4.15.0-32 (Ubuntu 18.04) Chris Horn On 8/30/18, 2:34 PM, "lustre-discuss on behalf of Andreas Dilger" wrote: On Aug 30, 2018, at 13:28, E.S

Re: [lustre-discuss] LNET Routing Question

2018-05-23 Thread Chris Horn
or down (alive or dead) states of routers. Chris Horn From: lustre-discuss <lustre-discuss-boun...@lists.lustre.org> on behalf of Makia Minich <ma...@systemfabricworks.com> Date: Wednesday, May 9, 2018 at 8:51 AM To: "lustre-discuss@lists.lustre.org" <lustre-discuss@lists.lus

Re: [lustre-discuss] LNET routes

2018-01-26 Thread Chris Horn
. At least, this was true before multi-rail. I’m not sure if that has changed things w.r.t. route selection. Chris Horn From: lustre-discuss <lustre-discuss-boun...@lists.lustre.org> on behalf of Preeti Malakar <malakar.pre...@gmail.com> Date: Friday, January 26, 2018 at 10:28 AM To: &qu

Re: [lustre-discuss] 2.10.1 client fails to mount on 2.9 backend

2017-11-17 Thread Chris Horn
Is the MGS actually on tcp or is it on o2ib? Can you “lctl ping” the MGS LNet nid from the client where you’re trying to mount? Chris Horn From: lustre-discuss <lustre-discuss-boun...@lists.lustre.org> on behalf of Christopher Johnston <chjoh...@gmail.com> Date: Friday, November 1

Re: [lustre-discuss] Fwd: Re: Lustre compilation error

2017-10-21 Thread Chris Horn
I would need more information to help you. Maybe provide the complete terminal output of your build. Everything from getting the source to running ‘make rpms’. Chris Horn From: lustre-discuss <lustre-discuss-boun...@lists.lustre.org> on behalf of parag_k <para...@citilindia.com>

Re: [lustre-discuss] Lustre compilation error

2017-10-17 Thread Chris Horn
It would be helpful if you provided more context. How did you acquire the source? What was your configure line? Is there a set of build instructions that you are following? Chris Horn From: lustre-discuss <lustre-discuss-boun...@lists.lustre.org> on behalf of Parag Khuraswar

Re: [lustre-discuss] Routers and shortest path

2017-10-16 Thread Chris Horn
for > the compute nodes. So I'm looking for another solution. AFAIK, this is your only option short of developing your own code for this situation (which would be cool!). Chris Horn On 10/16/17, 7:06 AM, "LOPEZ, ALEXANDRE" <alexandre.lo...@atos.net> wrote: Chris,

Re: [lustre-discuss] Routers and shortest path

2017-10-13 Thread Chris Horn
I think the only way to do this today is to assign the clients in each “islet” a unique LNet. What problems did that cause for you (besides the administrative headache?) Chris Horn On 10/13/17, 9:51 AM, "lustre-discuss on behalf of LOPEZ, ALEXANDRE" <lustre-discuss-boun...@lis

Re: [lustre-discuss] Lnet not starting

2017-10-13 Thread Chris Horn
…” should pull in any necessary modules. If that isn’t happening maybe you just need to run depmod. Chris Horn From: Ravi Konila <ravibh...@gmail.com> Reply-To: Ravi Konila <ravibh...@gmail.com> Date: Friday, October 13, 2017 at 9:29 AM To: Chris Horn <ho...@cray.com>, 'Lust

Re: [lustre-discuss] client eviction from oss on 2.8.0

2017-10-13 Thread Chris Horn
failures between the evicted client and the server hosting demo-OST0002? Chris Horn On 10/13/17, 1:44 PM, "lustre-discuss on behalf of John Casu" <lustre-discuss-boun...@lists.lustre.org on behalf of j...@chiraldynamics.com> wrote: client, server = 2.8.0, connected via 40GbE

Re: [lustre-discuss] exec start error for lustre-2.10.1_13_g2ee62fb

2017-10-13 Thread Chris Horn
https://jira.hpdd.intel.com/browse/LU-10119 I’ll push a patch Chris Horn On 10/13/17, 10:18 AM, "Dilger, Andreas" <andreas.dil...@intel.com> wrote: Could you please file a Jira ticket (and possibly a patch) to fix this, so it isn't forgotten. Cheers, Andreas

Re: [lustre-discuss] exec start error for lustre-2.10.1_13_g2ee62fb

2017-10-12 Thread Chris Horn
script and try to restart lnet with systemctl. Chris Horn On 10/12/17, 3:39 PM, "lustre-discuss on behalf of David Rackley" <lustre-discuss-boun...@lists.lustre.org on behalf of rack...@jlab.org> wrote: Greetings, I have built lustre-2.10.1_13_g2ee62fb on 3.10.0-69

Re: [lustre-discuss] Lnet not starting

2017-10-12 Thread Chris Horn
The pre-built rpms are most likely compiled against the in-kernel IB drivers. If you’re using the MOFED drivers you’ll need to recompile Lustre. The instructions here may help you out http://wiki.lustre.org/Compiling_Lustre Chris Horn From: Ravi Konila <ravibh...@gmail.com> Reply-To

Re: [lustre-discuss] Lnet not starting

2017-10-12 Thread Chris Horn
Are you compiling Lustre yourself or using pre-built rpms? Chris Horn From: Ravi Konila <ravibh...@gmail.com> Reply-To: Ravi Konila <ravibh...@gmail.com> Date: Thursday, October 12, 2017 at 11:40 AM To: Chris Horn <ho...@cray.com>, Parag Khuraswar <para...@citilindi

Re: [lustre-discuss] Lnet not starting

2017-10-12 Thread Chris Horn
dmesg output should provide more information about the “Invalid argument” error that you are seeing, but my guess would be that Lustre was compiled against a different IB stack than what you have installed. Chris Horn From: lustre-discuss <lustre-discuss-boun...@lists.lustre.org> on

Re: [lustre-discuss] Lnet not starting

2017-10-12 Thread Chris Horn
It is not recommended to run MOFED-4.0 with Lustre. Only 4.1 or higher. Chris Horn On 10/12/17, 8:57 AM, "lustre-discuss on behalf of Peter Kjellström" <lustre-discuss-boun...@lists.lustre.org on behalf of c...@nsc.liu.se> wrote: On Thu, 12 Oct 2017 18:27:34 +0530 "

Re: [lustre-discuss] Lustre 2.10.0 multi rail configuration

2017-08-28 Thread Chris Horn
nid: 192.168.1.2@o2ib > # Multi-Rail: True > # peer ni: > # - nid: 192.168.1.2@o2ib > # - nid: 192.168.2.2@o2ib > # - primary nid: 172.16.1.1@o2ib1 > # Multi-Rail: True > # peer ni: > # - nid: 172.16.1.1@o2ib1 > #

Re: [lustre-discuss] Lustre poor performance

2017-08-21 Thread Chris Horn
The ko2iblnd-opa settings are tuned specifically for Intel OmniPath. Take a look at the /usr/sbin/ko2iblnd-probe script to see how OPA hardware is detected and the “ko2iblnd-opa” settings get used. Chris Horn From: lustre-discuss <lustre-discuss-boun...@lists.lustre.org> on behalf of Ri

Re: [lustre-discuss] Lustre Manual Bug Report: 25.2.2. Enabling and Tuning Root Squash

2017-06-06 Thread Chris Horn
+to+the+Lustre+Manual+source Chris Horn From: lustre-discuss <lustre-discuss-boun...@lists.lustre.org> on behalf of "Gibbins, Faye" <faye.gibb...@cirrus.com> Date: Tuesday, June 6, 2017 at 6:41 AM To: "lustre-discuss@lists.lustre.org" <lustre-discuss@lists.lustre.or

Re: [lustre-discuss] rpm names?

2017-04-21 Thread Chris Horn
That work was done in LU-5614 and some related tickets. I think the one for removing the kernel version from the package name was LU-7643. Chris Horn On 4/21/17, 12:58 PM, "lustre-discuss on behalf of Michael Di Domenico" <lustre-discuss-boun...@lists.lustre.org on behalf

Re: [lustre-discuss] Multi-cluster (multi-rail) setup

2015-06-12 Thread Chris Horn
lnet networks=o2ib2(ib0) Nodes con Cluster D: options lnet networks=o2ib3(ib0)” Again, that’s just a guess on how these things are typically configured. You’ll want to check if that is actually case for your clusters. Chris Horn On Jun 12, 2015, at 2:37 AM, Thrash Er mingorrubi

Re: [Lustre-discuss] LustreError codes -114 and -16 (ldlm_lib.c:1919:target_send_reply_msg())

2012-02-15 Thread Chris Horn
errno 16 is EBUSY (device or resource busy) and errno 114 is EALREADY (Operation already in progress). Chris Horn On Feb 15, 2012, at 10:52 AM, Marina Cacciagrano wrote: Hello, On all the nodes of a lustre 1.8.2 , I often see messages similar to the following in /var/log/syslog: LustreError

Re: [Lustre-discuss] LustreError codes -114 and -16 (ldlm_lib.c:1919:target_send_reply_msg())

2012-02-15 Thread Chris Horn
Hard to say what's going on without additional context. The first message relates to an MGS_CONNECT rpc (o250), the second messages relates to an MDS_CONNECT rpc (o38). I would suspect network issues. Chris Horn On Feb 15, 2012, at 12:46 PM, Marina Cacciagrano wrote: Thanks! Maybe that means

Re: [Lustre-discuss] Fwd: Lustre performance issue (obdfilter_survey

2011-07-06 Thread Chris Horn
FYI, there is some work being done to clean up obdfilter-survey. See https://bugzilla.lustre.org/show_bug.cgi?id=24490 If there was a script issue you might try the patch from that bug to see if you can reproduce. https://bugzilla.lustre.org/show_bug.cgi?id=24490 Chris Horn On Jul 6, 2011, at 3