If deleting and re-adding it restores the status to up then this sounds like a 
bug to me.

Can you enable debug tracing, reproduce the issue, and add this information to 
a ticket?

To enable/gather debug:

# lctl set_param debug=+net
<reproduce issue>
# lctl dk > /tmp/dk.log

You can create a ticket at https://jira.whamcloud.com/

Please provide the dk.log with the ticket.

Thanks,
Chris Horn

From: lustre-discuss <lustre-discuss-boun...@lists.lustre.org> on behalf of 腐朽银 
via lustre-discuss <lustre-discuss@lists.lustre.org>
Date: Friday, February 17, 2023 at 2:53 AM
To: lustre-discuss@lists.lustre.org <lustre-discuss@lists.lustre.org>
Subject: [lustre-discuss] LNet nid down after some thing changed the NICs
Hi,

I encountered a problem when using Lustre Client on k8s with kubenet. Very 
happy if you could help me.

My LNet configuration is:

net:
    - net type: lo
      local NI(s):
        - nid: 0@lo
          status: up
    - net type: tcp
      local NI(s):
        - nid: 10.224.0.5@tcp
          status: up
          interfaces:
              0: eth0

It works. But after I deploy or delete a pod on the node. The nid goes down 
like:

- nid: 10.224.0.5@tcp
          status: down
          interfaces:
              0: eth0

k8s uses veth pairs, so it will add or delete network interfaces when deploying 
or deleting pods. But it doesn't touch the eth0 NIC. I can fix it by deleting 
the tcp net by `lnetctl net del` and re-add it by `lnetctl net add`. But I need 
to do this every time after a pod is scheduled to this node.

My node OS is Ubuntu 18.04 5.4.0-1101-azure. The Lustre Client is built by 
myself from 2.15.1. Is this an expected LNet behavior or I got something wrong? 
I re-build and tested it several times and got the same problem.

Regards,
Chuanjun
_______________________________________________
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Reply via email to