Re: [lustre-discuss] lnet_peer_ni_add_to_recoveryq

Chris Horn Mon, 09 Mar 2020 09:59:47 -0700

Network failures cause an interface's health value to decrement. Recovery mode 
is the mechanism that raises the health value back up. Interfaces are ping'd on 
a regular interval by the "lnet_monitor_thread". Successful pings increase the 
health value of the interface (remote or local).


When LNet is selecting the local and remote interfaces to use for a PUT or GET, 
it considers the health value of each interface. Healthier interfaces are 
preferred.

Chris Horn

On 3/9/20, 4:22 AM, "Degremont, Aurelien" <degre...@amazon.com> wrote:

    What's the impact of being in recovery mode with LNET health?
    
    
    Le 06/03/2020 21:12, « lustre-discuss au nom de Chris Horn » 
<lustre-discuss-boun...@lists.lustre.org au nom de ho...@cray.com> a écrit :    
        
        > lneterror: 
10164:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked())
        > lpni <address> added to recovery queue.  Health = 900
        
        The message means that the health value of a remote peer interface has 
been decremented, and as a result, the interface has been put into recovery 
mode. This mechanism is part of the LNet health feature.
        
        Health values are decremented when a PUT or GET fails. Usually there 
are other messages in the log that can tell you more about the specific 
failure. Depending on your network type you should probably see messages from 
socklnd or o2iblnd. Network congestion could certainly lead to message 
timeouts, which would in turn result in interfaces being placed into recovery 
mode.
        
        Chris Horn
        
        On 3/6/20, 8:59 AM, "lustre-discuss on behalf of Michael Di Domenico" 
<lustre-discuss-boun...@lists.lustre.org on behalf of mdidomeni...@gmail.com> 
wrote:
        
            along the aforementioned error i also see these at the same time
        
            lustreerror: 9675:0:(obd_config.c:1428:class_modify_config())
            <...>-clilov-<...>; failed to send uevent qos_threshold_rr=100
        
            On Fri, Mar 6, 2020 at 9:39 AM Michael Di Domenico
            <mdidomeni...@gmail.com> wrote:
            >
            > On Fri, Mar 6, 2020 at 9:36 AM Degremont, Aurelien 
<degre...@amazon.com> wrote:
            > >
            > > Did you see any actual error on your system?
            > >
            > > Because there is a patch that is just decreasing the verbosity 
level of such messages, which looks like could be ignored.
            > > 
https://urldefense.proofpoint.com/v2/url?u=https-3A__jira.whamcloud.com_browse_LU-2D13071&d=DwICAg&c=C5b8zRQO1miGmBeVZ2LFWg&r=hIaFpo9yRyCwkkAs6y0c7W-QqT7uZMMSOkAIByhcA-I&m=ByOR33WN61jv0rEVZTtNhUgN313iSqbgrdfakY-TAjc&s=jp8DpDcylEQYlbd9-s3efysfDy2KdLvBrptsplqR1ks&e=
            > > 
https://urldefense.proofpoint.com/v2/url?u=https-3A__review.whamcloud.com_-23_c_37718_&d=DwICAg&c=C5b8zRQO1miGmBeVZ2LFWg&r=hIaFpo9yRyCwkkAs6y0c7W-QqT7uZMMSOkAIByhcA-I&m=ByOR33WN61jv0rEVZTtNhUgN313iSqbgrdfakY-TAjc&s=8EUQ5wHRCuFFbd4PKxQCnTB_L9IgffvkzFw4_v6MEHg&e=
            >
            > thanks.  it's not entirely clear just yet.  i'm trying to track 
down a
            > "slow jobs" issue.  i see these messages everywhere, so it might 
be a
            > non issue or a sign of something more pressing.
            _______________________________________________
            lustre-discuss mailing list
            lustre-discuss@lists.lustre.org
            
https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.lustre.org_listinfo.cgi_lustre-2Ddiscuss-2Dlustre.org&d=DwICAg&c=C5b8zRQO1miGmBeVZ2LFWg&r=hIaFpo9yRyCwkkAs6y0c7W-QqT7uZMMSOkAIByhcA-I&m=ByOR33WN61jv0rEVZTtNhUgN313iSqbgrdfakY-TAjc&s=d36yZXUxMDJOjluQt2LUPivEkfLhScuCLIQT6Fl-Qhs&e=
        
        
        
        
        
        _______________________________________________
        lustre-discuss mailing list
        lustre-discuss@lists.lustre.org
        
https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.lustre.org_listinfo.cgi_lustre-2Ddiscuss-2Dlustre.org&d=DwIGaQ&c=C5b8zRQO1miGmBeVZ2LFWg&r=hIaFpo9yRyCwkkAs6y0c7W-QqT7uZMMSOkAIByhcA-I&m=MWzLz3rQZoSqu_bMB83a0EdO1KMglAndLsxrBlOT9fA&s=Y-NtxxGn4LIKwsK_QtBwjw13E0CYycKLLS9PNuiGvms&e=
 
        
    
    

_______________________________________________
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Re: [lustre-discuss] lnet_peer_ni_add_to_recoveryq

Reply via email to