(Re-sending my response to the list)

Yes, I believe that there are cases when problems on a remote node can be 
interpreted as local failures.


From: "nathan.dau...@noaa.gov" <nathan.dau...@noaa.gov>
Date: Sunday, March 8, 2020 at 3:56 AM
To: Chris Horn <ho...@cray.com>, "lustre-discuss@lists.lustre.org" 
<lustre-discuss@lists.lustre.org>
Cc: "nathan.dau...@noaa.gov" <nathan.dau...@noaa.gov>
Subject: Re: [lustre-discuss] lnet_peer_ni_add_to_recoveryq
Resent-From: <ho...@cray.com>
Resent-Date: Sunday, March 8, 2020 at 4:56 AM

Chris, all,

We are also seeing similar messages primarily on our servers, but from 
lnet_handle_local_failure() instead. I don't find any issues with the local 
o2ib interfere, yet, but there _may_ be a correlation with a client hang. Could 
this also be caused on a server by remote network problems or a client dropping 
out, in spite of the "local" name?

Thanks,
Nathan


On Mar 6, 2020 1:10 PM, Chris Horn <ho...@cray.com> wrote:

> lneterror: 10164:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked())
> lpni <address> added to recovery queue.  Health = 900

The message means that the health value of a remote peer interface has been 
decremented, and as a result, the interface has been put into recovery mode. 
This mechanism is part of the LNet health feature.

Health values are decremented when a PUT or GET fails. Usually there are other 
messages in the log that can tell you more about the specific failure. 
Depending on your network type you should probably see messages from socklnd or 
o2iblnd. Network congestion could certainly lead to message timeouts, which 
would in turn result in interfaces being placed into recovery mode.

Chris Horn

On 3/6/20, 8:59 AM, "lustre-discuss on behalf of Michael Di Domenico" 
<lustre-discuss-boun...@lists.lustre.org on behalf of mdidomeni...@gmail.com> 
wrote:

    along the aforementioned error i also see these at the same time

    lustreerror: 9675:0:(obd_config.c:1428:class_modify_config())
    <...>-clilov-<...>; failed to send uevent qos_threshold_rr=100

    On Fri, Mar 6, 2020 at 9:39 AM Michael Di Domenico
    <mdidomeni...@gmail.com> wrote:
    >
    > On Fri, Mar 6, 2020 at 9:36 AM Degremont, Aurelien <degre...@amazon.com> 
wrote:
    > >
    > > Did you see any actual error on your system?
    > >
    > > Because there is a patch that is just decreasing the verbosity level of 
such messages, which looks like could be ignored.
    > > 
https://urldefense.proofpoint.com/v2/url?u=https-3A__jira.whamcloud.com_browse_LU-2D13071&d=DwICAg&c=C5b8zRQO1miGmBeVZ2LFWg&r=hIaFpo9yRyCwkkAs6y0c7W-QqT7uZMMSOkAIByhcA-I&m=ByOR33WN61jv0rEVZTtNhUgN313iSqbgrdfakY-TAjc&s=jp8DpDcylEQYlbd9-s3efysfDy2KdLvBrptsplqR1ks&e=
    > > 
https://urldefense.proofpoint.com/v2/url?u=https-3A__review.whamcloud.com_-23_c_37718_&d=DwICAg&c=C5b8zRQO1miGmBeVZ2LFWg&r=hIaFpo9yRyCwkkAs6y0c7W-QqT7uZMMSOkAIByhcA-I&m=ByOR33WN61jv0rEVZTtNhUgN313iSqbgrdfakY-TAjc&s=8EUQ5wHRCuFFbd4PKxQCnTB_L9IgffvkzFw4_v6MEHg&e=
    >
    > thanks.  it's not entirely clear just yet.  i'm trying to track down a
    > "slow jobs" issue.  i see these messages everywhere, so it might be a
    > non issue or a sign of something more pressing.
    _______________________________________________
    lustre-discuss mailing list
    lustre-discuss@lists.lustre.org
    
https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.lustre.org_listinfo.cgi_lustre-2Ddiscuss-2Dlustre.org&d=DwICAg&c=C5b8zRQO1miGmBeVZ2LFWg&r=hIaFpo9yRyCwkkAs6y0c7W-QqT7uZMMSOkAIByhcA-I&m=ByOR33WN61jv0rEVZTtNhUgN313iSqbgrdfakY-TAjc&s=d36yZXUxMDJOjluQt2LUPivEkfLhScuCLIQT6Fl-Qhs&e=





_______________________________________________
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org<https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.lustre.org_listinfo.cgi_lustre-2Ddiscuss-2Dlustre.org&d=DwQGaQ&c=C5b8zRQO1miGmBeVZ2LFWg&r=hIaFpo9yRyCwkkAs6y0c7W-QqT7uZMMSOkAIByhcA-I&m=xjhlFKAxRoTIY1jLm_ZOO79SIHjnFFvd-sHl1eMEQQM&s=Wvg4NbAeA1O-DrqWqy5rrQ4OrwfrO7V220OCeVGeWdg&e=>

_______________________________________________
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Reply via email to