Right, the round-robin-only approach may be a deal breaker.  You might be able 
to come up with a “poor-man’s” solution, though I don’t think there is an 
obvious path.  You might be able to force the health detection as you suggest 
or manually bring up the 2nd interface when you know the first has failed.  
Chris or others might have a more specific approach or advice.  While 2.14.0. 
is coming out “soon” (2.14.0-RC2 is tagged), I don’t think that UDSP made it.   
https://jira.whamcloud.com/browse/LU-9121 isn’t listed in the 
https://wiki.lustre.org/Release_2.14.0 project page.

-Cory

On 2/11/21, 5:56 PM, "Nathan Crawford" <nrcra...@uci.edu> wrote:

Hi Chris and Cory,

  I remember looking at configuring multi-rail when 2.12 came out for this very 
reason, but stopped when it looked like round-robin only. Is there a way to 
trick the LNet Health system into seeing one interface as "sick but not dead"?

  Also, when is 2.14 coming out :)

  For what it's worth, the client errors I'm trying to diagnose (only one 
client has them) are similar to:
[Thu Feb 11 15:51:24 2021] LustreError: 11-0: 
DFS-L-OST0003-osc-ffff9cd07c339000: operation ost_set_info to node 
10.201.32.48@o2ib1 failed: rc = -107
[Thu Feb 11 15:51:24 2021] Lustre: DFS-L-OST0003-osc-ffff9cd07c339000: 
Connection to DFS-L-OST0003 (at 10.201.32.48@o2ib1) was lost; in progress 
operations using this service will wait for recovery to complete
[Thu Feb 11 15:51:24 2021] LustreError: 167-0: 
DFS-L-OST0003-osc-ffff9cd07c339000: This client was evicted by DFS-L-OST0003; 
in progress operations using this service will fail.
[Thu Feb 11 15:51:24 2021] Lustre: DFS-L-OST0003-osc-ffff9cd07c339000: 
Connection restored to 10.201.32.48@o2ib1 (at 10.201.32.48@o2ib1)

Thanks,
Nate

On Thu, Feb 11, 2021 at 1:25 PM Horn, Chris 
<chris.h...@hpe.com<mailto:chris.h...@hpe.com>> wrote:
FYI, multi-rail in 2.12 will round robin traffic between both @tcp and @o2ib 
networks. If @o2ib flakes out then traffic should shift entirely to @tcp, but 
there isn’t a way to specify that traffic go to @tcp only when there’s a 
problem with @o2ib. You need the user defined selection policy feature for 
that, and that feature is not slated to arrive until after 2.14 (afaik).

Chris Horn

From: lustre-discuss 
<lustre-discuss-boun...@lists.lustre.org<mailto:lustre-discuss-boun...@lists.lustre.org>>
 on behalf of "Spitz, Cory James" 
<cory.sp...@hpe.com<mailto:cory.sp...@hpe.com>>
Date: Thursday, February 11, 2021 at 3:17 PM
To: "nathan.crawf...@uci.edu<mailto:nathan.crawf...@uci.edu>" 
<nathan.crawf...@uci.edu<mailto:nathan.crawf...@uci.edu>>, Lustre User 
Discussion Mailing List 
<lustre-discuss@lists.lustre.org<mailto:lustre-discuss@lists.lustre.org>>
Subject: Re: [lustre-discuss] LNET IB intermittent connection
Resent-From: <ho...@cray.com<mailto:ho...@cray.com>>
Resent-Date: Thursday, February 11, 2021 at 3:17 PM

Hi, Nate.

You asked, “can LNET be easily configured to go over the @tcp connection when 
the @o2ib flakes out?”

Yes, you can use LNet Multi-Rail for it and that _is_ covered in the “fine 
manual”, chapter 16 ☺
https://doc.lustre.org/lustre_manual.xhtml#lnetmr<https://doc.lustre.org/lustre_manual.xhtml#lnetmr>

-Cory

On 2/10/21, 4:54 PM, "lustre-discuss" 
<lustre-discuss-boun...@lists.lustre.org<mailto:lustre-discuss-boun...@lists.lustre.org>>
 wrote:

Hi All,

  I've recently been having a bunch of LNET over Infiniband 
connection-lost/-restored errors and am trying to find the cause and/or tune 
the system to better cope. There is a lot of stuff on the wiki ( 
https://wiki.lustre.org/Lustre_Resiliency:_Understanding_Lustre_Message_Loss_and_Tuning_for_Resiliency<https://wiki.lustre.org/Lustre_Resiliency:_Understanding_Lustre_Message_Loss_and_Tuning_for_Resiliency>),
 but that's from 2016, and I don't know what parts are superseded. I'm 
currently running Lustre 2.12.5 on CentOS 7.8, with a mix of Q-Logic/Intel QDR 
and Mellanox EDR HCAs and switches (using CentOS in-box RDMA/opensm).

  Is there a better place to look (e.g. the fine manual, section X) for 
guidance? I've done a few searches on the Jira, but the most similar errors 
should have already been fixed in earlier releases.

  Assuming that there is actually some impending hardware issue, can LNET be 
easily configured to go over the @tcp connection when the @o2ib flakes out?

Thanks,
Nate

--

Dr. Nathan Crawford              
nathan.crawf...@uci.edu<mailto:nathan.crawf...@uci.edu>

Director of Scientific Computing

School of Physical Sciences

164 Rowland Hall                 Office: 2101 Natural Sciences II

University of California, Irvine  Phone: 949-824-4508

Irvine, CA 92697-2025, USA


--

Dr. Nathan Crawford              
nathan.crawf...@uci.edu<mailto:nathan.crawf...@uci.edu>

Director of Scientific Computing

School of Physical Sciences

164 Rowland Hall                 Office: 2101 Natural Sciences II

University of California, Irvine  Phone: 949-824-4508

Irvine, CA 92697-2025, USA
_______________________________________________
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Reply via email to