I'm having a very similar issue with the same hardware. Do you think it
might be the same problem? If it is, then it was not actually fixed in
jammy (I'm using a kernel that supposedly have it already fixed).

- same hardware --> a1:00.0 Ethernet controller: Intel Corporation Ethernet 
Controller E810-XXV for SFP (rev 02)
- ubuntu 22.04: `5.15.0-83-generic #92-Ubuntu SMP Mon Aug 14 09:30:42 UTC 2023 
x86_64 x86_64 x86_64 GNU/Linux`
- using a bond over the two ports of the same card, at 25Gbps to two different 
switches
- bond is using LACP with hash layer3+4 and fast timeout
- machine installed by maas. No issues during installation, but at that time 
bond is not formed yet
- later when installed linix is booted, the bond is up and working without 
issues
- it works for about 2 to 3 hours fine, then the issue starts (may or may not 
be related to network load, but it seems that it is triggered by some tests 
that I run after openstack finishes installing)
- one of the legs of the bond freezes and everything that would go to that lag 
is discarded, in and out, ping to random external hosts start losing every 
second packet
- after some time you can see on the kernel log messages about "NETDEV 
WATCHDOG: enp161s0f0 (ice): transmit queue 166 timed out" and a stack trace
- the switch does log that the bond is flapping

[ 6337.489648] ------------[ cut here ]------------                             
                                                                                
                                                   
[ 6337.489653] NETDEV WATCHDOG: enp161s0f0 (ice): transmit queue 166 timed out  
                                                                                
                                                   
[ 6337.489663] WARNING: CPU: 12 PID: 0 at net/sched/sch_generic.c:477 
dev_watchdog+0x277/0x280                                                        
                                                             
[ 6337.489669] Modules linked in: nf_conntrack_netlink geneve ip6_udp_tunnel 
udp_tunnel xt_CT dm_crypt scsi_transport_iscsi veth nfnetlink_cttimeout 
openvswitch nsh nf_conncount unix_diag nft_masq zfs(PO) zunico
de(PO) zzstd(O) zlua(O) zavl(PO) icp(PO) zcommon(PO) znvpair(PO) spl(O) 
vhost_vsock vmw_vsock_virtio_transport_common vhost vhost_iotlb vsock 
xt_CHECKSUM xt_MASQUERADE xt_conntrack ipt_REJECT nf_reject_ipv4 xt_t
cpudp nft_compat nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 
nf_defrag_ipv4 nft_counter nf_tables nfnetlink bridge sunrpc nvme_fabrics 8021q 
garp mrp stp llc bonding tls intel_rapl_msr intel_rapl_common amd
64_edac edac_mce_amd ipmi_ssif binfmt_misc kvm_amd kvm dell_wmi ledtrig_audio 
sparse_keymap video nls_iso8859_1 rapl irdma dell_smbios dcdbas i40e wmi_bmof 
dell_wmi_descriptor ib_uverbs ib_core ccp ptdma k10temp
 acpi_ipmi ipmi_si ipmi_devintf ipmi_msghandler acpi_power_meter mac_hid 
sch_fq_codel dm_multipath scsi_dh_rdac scsi_dh_emc scsi_dh_alua ramoops 
reed_solomon                                                      
[ 6337.489754]  pstore_blk pstore_zone efi_pstore ip_tables x_tables autofs4 
btrfs blake2b_generic zstd_compress raid10 raid456 async_raid6_recov 
async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c r
aid1 raid0 multipath linear cdc_ether usbnet mii mgag200 i2c_algo_bit 
drm_kms_helper syscopyarea crct10dif_pclmul sysfillrect crc32_pclmul sysimgblt 
ghash_clmulni_intel bcache fb_sys_fops crc64 aesni_intel crypt
o_simd cec rc_core nvme ahci cryptd xhci_pci ice tg3 libahci drm megaraid_sas 
i2c_piix4 nvme_core xhci_pci_renesas wmi                                        
                                                     
[ 6337.489809] CPU: 12 PID: 0 Comm: swapper/12 Tainted: P           O      
5.15.0-83-generic #92-Ubuntu                                                    
                                                        
[ 6337.489812] Hardware name: Dell Inc. PowerEdge R7525/03WYW4, BIOS 2.12.4 
07/26/2023                                                                      
                                                       
[ 6337.489814] RIP: 0010:dev_watchdog+0x277/0x280                               
                                                                                
                                                   
[ 6337.489817] Code: eb 97 48 8b 5d d0 c6 05 2a e2 67 01 01 48 89 df e8 2e 5f 
f9 ff 44 89 e1 48 89 de 48 c7 c7 b8 ec 0d 8c 48 89 c2 e8 65 d6 19 00 <0f> 0b eb 
80 e9 af 68 23 00 0f 1f 44 00 00 55 48 89 e5 41 57 41
 56                                                                             
                                                                                
                                                   
[ 6337.489818] RSP: 0018:ffffa4e6d986ce70 EFLAGS: 00010282                      
                                                                                
                                                   
[ 6337.489820] RAX: 0000000000000000 RBX: ffff950b843af000 RCX: 
0000000000000027                                                                
                                                                   
[ 6337.489821] RDX: ffff9489bd520588 RSI: 0000000000000001 RDI: 
ffff9489bd520580                                                                
                                                                   
[ 6337.489822] RBP: ffffa4e6d986cea8 R08: 0000000000000003 R09: 
ffffffffffe3d588                                                                
                                                                   
[ 6337.489823] R10: 756575712074696d R11: 736e617274203a29 R12: 
00000000000000a6                                                                
                                                                   
[ 6337.489824] R13: ffff950b9a613b80 R14: 00000000000000fd R15: 
ffff950b843af4c0                                                                
                                                                   
[ 6337.489825] FS:  0000000000000000(0000) GS:ffff9489bd500000(0000) 
knlGS:0000000000000000                                                          
                                                              
[ 6337.489826] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033                
                                                                                
                                                   
[ 6337.489827] CR2: 00007f89d69a0008 CR3: 00000101037be003 CR4: 
0000000000770ee0                                                                
                                                                   
[ 6337.489828] PKRU: 55555554                                                   
                                                                                
                                                   
[ 6337.489829] Call Trace:                                                      
                                                                                
                                                   
[ 6337.489830]  <IRQ>                                                           
                                                                                
                                                   
[ 6337.489834]  ? show_trace_log_lvl+0x1d6/0x2ea                                
                                                                                
                                                   
[ 6337.489839]  ? show_trace_log_lvl+0x1d6/0x2ea  

(cut)

[ 6337.489963] ice 0000:a1:00.0 enp161s0f0: tx_timeout: VSI_num: 6, Q 166, NTC: 
0x1d, HW_HEAD: 0x52, NTU: 0x53, INT: 0x0                                        
                                                   
[ 6337.489967] ice 0000:a1:00.0 enp161s0f0: tx_timeout recovery level 1, 
txqueue 166                                                                     
                                                          
[ 6339.354957] bond0: (slave enp161s0f0): link status definitely down, 
disabling slave                                                                 
                                                            
[ 6339.386095] ice 0000:a1:00.0: Removed PTP clock                              
                                                                                
                                                   
[ 6339.541782] ice 0000:a1:00.0: Clearing default VSI, re-enable after reset 
completes                                                                       
                                                      
[ 6340.184069] ice 0000:a1:00.0: PTP init successful                            
                                                                                
                                                   
[ 6350.054268] ice 0000:a1:00.0: VSI rebuilt. VSI index 0, type ICE_VSI_PF      
                                                                                
                                                   
[ 6350.092303] ice 0000:a1:00.0: VSI rebuilt. VSI index 383, type ICE_VSI_CTRL  
                                                                                
                                                   
[ 6350.162346] bond0: (slave enp161s0f0): link status definitely up, 25000 Mbps 
full duplex

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/2004262

Title:
  Intel E810 NICs driver in causing hangs when booting and bonds
  configured

Status in linux package in Ubuntu:
  Confirmed
Status in linux source package in Jammy:
  Fix Released
Status in linux source package in Kinetic:
  Fix Released
Status in linux source package in Lunar:
  Confirmed

Bug description:
  [Impact]
    * Intel E810-family NICs cause system hangs when booting with bonding 
enabled
    * This happens due to the driver unplugging auxiliary devices
    * The unplug event happens under RTNL lock context, which causes a deadlock 
where the RDMA driver waits for the RNL lock to complete removal

  [Test Plan]
    * Users have reported that after setting up bonding on switch and server 
side, the system will hang when starting network services

  [Fix]
    * The upstream patch defers unplugging/re-plugging of the auxiliary device, 
so that it's not performed under the RTNL lock context.
    * Fix was introduced by commit:
        248401cb2c46 ice: avoid bonding causing auxiliary plug/unplug under 
RTNL lock

  [Regression Potential]
    * Regressions would manifest in devices that support RDMA functionality and
      have been added to a bond
    * We should look out for auxiliary devices that haven't been properly
      unplugged, or that cause further issues with
      ice_plug_aux_dev()/ice_unplug_aux_dev()

  
  [Original Description]
  jammy 22.04.1
  linux-image-generic 5.15.0-58-generic
  Intel E810-XXV Dual Port NICs in Dell PowerEdge 650

  - 5.15 in jammy -> reproducible
  - 5.19 in hwe-edge -> reproducible
  - 6.2.rc6 in the mainline build -> works
  - Intel's ice driver 1.10.1.2.2 -> works

  After beonding is enabled on switch and server side, the system will
  hang at initialing ubuntu.  The kernel loads but around starting the
  Network Services the system can hang for sometimes 5 minutes, and in
  other cases, indefinitely.

  The message of:

  echo 0 > /proc/sys/kernel/hung_task_timeout_sec”  systemd-resolve
  blocked for more than 120 seconds

  appears, and eventually the Network services just attempts to start
  and never does.  This is with or without DHCP enabled.

  Tried this same setup with the hwe-22.04, hwe-20.04, hwe-22.04-ege and
  linux-oem kernels and all exhibit the same failure.

  To work around this. installing the Intel 'ice' driver of version
  1.10.1.2.2 works.  The system doesn't even remotely hang at startup
  and all networking functions remain working (ping, DNS, general
  accessibility).

  The driver can be found at 
https://downloadmirror.intel.com/763930/ice-1.10.1.2.2.tar.gz
  ---
  ProblemType: Bug
  AlsaDevices:
   total 0
   crw-rw---- 1 root audio 116,  1 Jan 31 13:08 seq
   crw-rw---- 1 root audio 116, 33 Jan 31 13:08 timer
  AplayDevices: Error: [Errno 2] No such file or directory: 'aplay'
  ApportVersion: 2.20.11-0ubuntu82.3
  Architecture: amd64
  ArecordDevices: Error: [Errno 2] No such file or directory: 'arecord'
  AudioDevicesInUse: Error: command ['fuser', '-v', '/dev/snd/seq', 
'/dev/snd/timer'] failed with exit code 1:
  CRDA: N/A
  CasperMD5json:
   {
     "result": "skip"
   }DistroRelease: Ubuntu 22.04
  InstallationDate: Installed on 2023-01-27 (3 days ago)InstallationMedia: 
Ubuntu-Server 22.04.1 LTS "Jammy Jellyfish" - Release amd64 (20220809)
  IwConfig: Error: [Errno 2] No such file or directory: 'iwconfig'
  MachineType: Dell Inc. PowerEdge R650
  Package: linux (not installed)
  PciMultimedia:

  ProcFB: 0 mgag200drmfb
  ProcKernelCmdLine: BOOT_IMAGE=/boot/vmlinuz-5.15.0-58-generic 
root=UUID=668aab7c-abe9-434b-a810-acc6eab76cbc ro fsck.mode=skip
  ProcVersionSignature: Ubuntu 5.15.0-58.64-generic 5.15.74
  RelatedPackageVersions:
   linux-restricted-modules-5.15.0-58-generic N/A
   linux-backports-modules-5.15.0-58-generic  N/A
   linux-firmware                             20220329.git681281e4-0ubuntu3.9
  RfKill: Error: [Errno 2] No such file or directory: 'rfkill'Tags:  jammy 
uec-images
  Uname: Linux 5.15.0-58-generic x86_64
  UpgradeStatus: No upgrade log present (probably fresh install)
  UserGroups: N/A
  _MarkForUpload: True
  dmi.bios.date: 09/14/2022
  dmi.bios.release: 1.8
  dmi.bios.vendor: Dell Inc.
  dmi.bios.version: 1.8.2
  dmi.board.name: 0PJ7YJ
  dmi.board.vendor: Dell Inc.
  dmi.board.version: A01
  dmi.chassis.type: 23
  dmi.chassis.vendor: Dell Inc.
  dmi.modalias: 
dmi:bvnDellInc.:bvr1.8.2:bd09/14/2022:br1.8:svnDellInc.:pnPowerEdgeR650:pvr:rvnDellInc.:rn0PJ7YJ:rvrA01:cvnDellInc.:ct23:cvr:skuSKU=0912;ModelName=PowerEdgeR650:
  dmi.product.family: PowerEdge
  dmi.product.name: PowerEdge R650
  dmi.product.sku: SKU=0912;ModelName=PowerEdge R650
  dmi.sys.vendor: Dell Inc.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2004262/+subscriptions


-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to     : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

Reply via email to