** Description changed:

  [ Impact ]
  
  * netplan-sriov-apply.service can sometimes fail to configure sriov
  interfaces.
  
  * Issue happens when netplan is performing per interface configuration and 
udev rules
    are modifying PF interface names. If that happens netplan will fail to get 
some PF related data
    as expected /sys/class/net/<ifname>/ directory will no longer exist.
  
  * Depending on the timing between netplan-sriov-apply.service and udev rules 
execution, one or more
    PF interfaces might be unconfigured.
  
  * This issue might a be root cause for following netplan bugs:
    - https://bugs.launchpad.net/netplan/+bug/1988018
    - https://bugs.launchpad.net/netplan/+bug/2020409
  
  * A proposed solution is to make sure that udev rules are triggered and 
finished before netplan-sriov-apply.service
    starts executing.
  
  * Issue was most likely introduced by 
https://bugs.launchpad.net/netplan/+bug/1988018
    - this change introduced netplan-sriov-apply.service
    - jammy 0.107.1-3ubuntu0.22.04.2 is still in -proposed
    - noble/questing/resolute released it as part of v1.0
  
  * Issue is reproduced when user specifies set-name config value with a name 
different than what systemd networkd generated
    - During the boot process, interface will first be renamed to ethX, then 
networkd will apply its PCI address based naming,
      and only then udev will process rules created by using set-name config 
value.
    - If set-name is not used or name specified in set-name is the same as the 
one networkd generated, issue will not reproduce.
  
  [ Test Plan ]
  
   * Create a netplan config which modifies interface name and sets sriov 
config, for instance:
    50-if.yaml:
   network:
    ethernets:
      ens1f0:
        match:
          macaddress: b8:3f:d2:09:38:94
        mtu: 1500
        optional: true
        set-name: ens1f0
      ens1f1:
        match:
          macaddress: b8:3f:d2:09:38:94
        mtu: 1500
        optional: true
        set-name: ens1f1
  
   99-sriov.yaml:
   network:
    version: 2
    ethernets:
      ens1f0:
        virtual-function-count: 32
        embedded-switch-mode: switchdev
        delay-virtual-functions-rebind: true
    ethernets:
      ens1f1:
        virtual-function-count: 32
        embedded-switch-mode: switchdev
        delay-virtual-functions-rebind: true
  
  NOTE: name generated for these interfaces by networkd are ens1f0np0 and
  ens1f1np1
  
   * Reboot the host with above config
  
   * After reboot verify if sriov configuration was properly applied on the 
interface.
   Expected result:
  Config was properly applied by netplan-sriov-apply.service
  
   Actual results:
  Feb 02 12:15:49 doopliss netplan[1163]: ERROR:root:could not determine vendor 
and device ID of ens1f1np1: [Errno 2] No such file or directory: 
'/sys/class/net/ens1f1np1/device/vendor'
  Feb 02 12:15:49 doopliss systemd[1]: netplan-sriov-apply.service: Main 
process exited, code=exited, status=1/FAILURE
  Feb 02 12:15:49 doopliss systemd[1]: netplan-sriov-apply.service: Failed with 
result 'exit-code'.
  
  In this example, netplan-sriov-apply.service started around Feb 02 12:15:27, 
it properly configured first interface using old name ens1f0np0.
  Then second interface ens1f1np1 was renamed:
  Feb 02 12:15:37 doopliss kernel: mlx5_core 0000:4b:00.1 ens1f1: renamed from 
ens1f1np1
  Netplan using name ens1f1np1 failed to get 
/sys/class/net/ens1f1np1/device/vendor, as new proper path should be 
/sys/class/net/ens1f1/device/vendor
  
  This is just an example, when interface name changes when 
netplan-sriov.apply.service is running, netplan can fail in different parts of 
the code which can result in similar Error log:
  "[Errno 2] No such file or directory" such as mentioned in LP1988018:
  Apr 16 15:44:44 romano netplan[1171]: failed parsing sriov_totalvfs for 
ens7f1np1: [Errno 2] No such file or directory: 
'/sys/class/net/ens7f1np1/device/sriov_totalvfs'
  
  [ Where problems could occur ]
  
   * Proposed change is making sure that udev rules are triggered and done 
before netplan-sriov-apply.service starts.
     Inspecting current `netplan apply` logic shows that this is already 
performed in the code for `netplan apply` command
     but is missing from `netplan apply --sriov-only` which is called by 
netplan-sriov-apply.service.
  
   * If there are any other processes which are modifying interface names,
  issue can still be reproduced.
  
   * With new change following commands will be executed:
     - udevadm control --reload
     - udevadm trigger --action=add --subsystem-match=net
     - udevadm settle
     If any of the commands hangs, service might not start properly and leave 
interfaces unconfigured.
  
  [ Other Info ]
  
   * Issue can be quite reliable reproduced on jammy-proposed
  
   * I was not able to reproduce issue on Noble, when applying the same 
configuration. Once netplan-sriov-apply.service starts interfaces are already 
set to proper name. This might points to differences in systemd.
     This also doesn't mean that issue can't be reproduced. Service requires 
already set interface names and current settings does not guarantee that.
  
  * Fix was verified on PS6 environment which reported issues in LP2020409
+ 
+ * Upstream PR: https://github.com/canonical/netplan/pull/569

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/2139598

Title:
  Netplan can crash when applying sriov config

To manage notifications about this bug go to:
https://bugs.launchpad.net/netplan/+bug/2139598/+subscriptions


-- 
ubuntu-bugs mailing list
[email protected]
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

Reply via email to