[Kernel-packages] [Bug 1898057] Re: Infiniband transmit queue timeouts after upgrading to linux-hwe-5.4

2020-10-01 Thread Mikko Tanner
apport information

** Package changed: linux-hwe-5.4 (Ubuntu) => linux (Ubuntu)

** Tags added: apport-collected bionic

** Description changed:

  After upgrading 3 servers from linux-image-5.3.0-40-generic to linux-
  image-5.4.0-48-generic I have started seeing the following queue
  timeouts from IP-over-Infiniband (ipoib) devices. The devices in
  question are (with newest available firmware, 2.42.5000):
  
  # lspci -nnk -s 83:00.0
  83:00.0 Network controller [0280]: Mellanox Technologies MT27500 Family 
[ConnectX-3] [15b3:1003]
  Subsystem: Mellanox Technologies MT27500 Family [ConnectX-3] 
[15b3:0027]
  Kernel driver in use: mlx4_core
  
  Below is the WARN from one machine's syslog. The others are practically
  identical. When the WARN happens on any of the machines, other 2 will
  _also_ exhibit queue timeouts. Additionally, other (unrelated) machines
  connected to the same infiniband fabric will exhibit a 12-second
  transmission delay. This could conceivably be caused by these 3 servers
  also being Subnet Managers (opensm package).
  
  The infiniband fabric is partitioned, with the affected partition (8011)
  seeing most of the traffic.
  
  
  
  kernel: [52642.480066] [ cut here ]
  kernel: [52642.480092] NETDEV WATCHDOG: ib0.8011 (): transmit queue 0 timed 
out
  kernel: [52642.480120] WARNING: CPU: 13 PID: 0 at 
/build/linux-hwe-5.4-8m2I8l/linux-hwe-5.4-5.4.0/net/sched/sch_generic.c:448 
dev_watchdog+0x264/0x270
  kernel: [52642.480121] Modules linked in: aufs overlay ip6table_raw 
ip6table_mangle ip6table_nat iptable_raw iptable_mangle iptable_nat nf_tables 
nfnetlink cfg80211 ebtable_filter ebtables ip6table_filter ip6_tables 
iptable_filter bpfilter mst_pciconf(OE) mst_pci(OE) 8021q garp mrp stp llc 
nls_iso8859_1 intel_rapl_msr lz4 lz4_compress intel_rapl_common ib_iser rdma_cm 
sb_edac iw_cm iscsi_tcp libiscsi_tcp libiscsi x86_pkg_temp_thermal 
scsi_transport_iscsi intel_powerclamp zram veth vhost_net tap coretemp vhost 
kvm_intel crct10dif_pclmul crc32_pclmul ghash_clmulni_intel kvm openvswitch nsh 
nf_conncount nf_nat nf_conntrack rapl nf_defrag_ipv6 nf_defrag_ipv4 joydev 
input_leds intel_cstate ib_ipoib mei_me mei ib_cm ioatdma ib_umad lpc_ich 
acpi_pad acpi_power_meter mac_hid ipmi_si ipmi_ssif ipmi_devintf 
ipmi_msghandler kyber_iosched sch_fq_codel tcp_highspeed ip_tables x_tables 
autofs4 zfs(POE) zunicode(POE) zavl(POE) icp(POE) zcommon(POE) znvpair(POE) 
spl(OE) zlua(POE) btrfs zstd_compress raid10 raid456
  kernel: [52642.480182]  async_raid6_recov async_memcpy async_pq async_xor 
async_tx xor raid6_pq libcrc32c raid0 multipath linear dm_mirror dm_region_hash 
dm_log mlx4_ib ib_uverbs ib_core hid_generic raid1 ses enclosure usbhid hid ast 
drm_vram_helper i2c_algo_bit ttm drm_kms_helper syscopyarea sysfillrect 
sysimgblt aesni_intel fb_sys_fops ixgbe glue_helper mpt3sas nvme xfrm_algo 
crypto_simd ahci raid_class dca cryptd mlx4_core drm megaraid_sas nvme_core 
libahci scsi_transport_sas mdio wmi
  kernel: [52642.480221] CPU: 13 PID: 0 Comm: swapper/13 Tainted: P   
OE 5.4.0-48-generic #52~18.04.1-Ubuntu
  kernel: [52642.480223] Hardware name: Supermicro Super Server/X10DRW-iT, BIOS 
2.0b 04/13/2017
  kernel: [52642.480226] RIP: 0010:dev_watchdog+0x264/0x270
  kernel: [52642.480229] Code: 48 85 c0 75 e6 eb a0 4c 89 ef c6 05 42 c1 e7 00 
01 e8 30 b8 fa ff 89 d9 48 89 c2 4c 89 ee 48 c7 c7 50 05 63 ae e8 4c 31 71 ff 
<0f> 0b eb 82 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 48 89 e5 41
  kernel: [52642.480230] RSP: 0018:b4998c970e48 EFLAGS: 00010282
  kernel: [52642.480233] RAX:  RBX:  RCX: 
083f
  kernel: [52642.480234] RDX:  RSI: 00f6 RDI: 
083f
  kernel: [52642.480235] RBP: b4998c970e78 R08: 08fd R09: 
0003
  kernel: [52642.480237] R10: b4998c970ee8 R11: 0001 R12: 
0001
  kernel: [52642.480238] R13: 93d302465000 R14: 93d302465480 R15: 
93f2d293c880
  kernel: [52642.480240] FS:  () GS:93f33f64() 
knlGS:
  kernel: [52642.480242] CS:  0010 DS:  ES:  CR0: 80050033
  kernel: [52642.480243] CR2: f90140127000 CR3: 002a26e0a004 CR4: 
001626e0
  kernel: [52642.480245] Call Trace:
  kernel: [52642.480247]  
  kernel: [52642.480252]  ? pfifo_fast_reset+0x110/0x110
  kernel: [52642.480255]  call_timer_fn+0x32/0x130
  kernel: [52642.480258]  run_timer_softirq+0x443/0x480
  kernel: [52642.480262]  ? ktime_get+0x43/0xa0
  kernel: [52642.480268]  ? lapic_next_deadline+0x26/0x30
  kernel: [52642.480273]  __do_softirq+0xe4/0x2da
  kernel: [52642.480278]  irq_exit+0xae/0xb0
  kernel: [52642.480282]  smp_apic_timer_interrupt+0x79/0x130
  kernel: [52642.480285]  apic_timer_interrupt+0xf/0x20
  kernel: [52642.480286]  
  kernel: [52642.480292] RIP: 0010:cpuidle_enter_state+0xbc/0x440
  kernel

[Kernel-packages] [Bug 1898057] Re: Infiniband transmit queue timeouts after upgrading to linux-hwe-5.4

2020-10-01 Thread Mikko Tanner
** Changed in: linux (Ubuntu)
   Status: New => Confirmed

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1898057

Title:
  Infiniband transmit queue timeouts after upgrading to linux-hwe-5.4

Status in linux package in Ubuntu:
  Confirmed

Bug description:
  After upgrading 3 servers from linux-image-5.3.0-40-generic to linux-
  image-5.4.0-48-generic I have started seeing the following queue
  timeouts from IP-over-Infiniband (ipoib) devices. The devices in
  question are (with newest available firmware, 2.42.5000):

  # lspci -nnk -s 83:00.0
  83:00.0 Network controller [0280]: Mellanox Technologies MT27500 Family 
[ConnectX-3] [15b3:1003]
  Subsystem: Mellanox Technologies MT27500 Family [ConnectX-3] 
[15b3:0027]
  Kernel driver in use: mlx4_core

  Below is the WARN from one machine's syslog. The others are
  practically identical. When the WARN happens on any of the machines,
  other 2 will _also_ exhibit queue timeouts. Additionally, other
  (unrelated) machines connected to the same infiniband fabric will
  exhibit a 12-second transmission delay. This could conceivably be
  caused by these 3 servers also being Subnet Managers (opensm package).

  The infiniband fabric is partitioned, with the affected partition
  (8011) seeing most of the traffic.

  

  kernel: [52642.480066] [ cut here ]
  kernel: [52642.480092] NETDEV WATCHDOG: ib0.8011 (): transmit queue 0 timed 
out
  kernel: [52642.480120] WARNING: CPU: 13 PID: 0 at 
/build/linux-hwe-5.4-8m2I8l/linux-hwe-5.4-5.4.0/net/sched/sch_generic.c:448 
dev_watchdog+0x264/0x270
  kernel: [52642.480121] Modules linked in: aufs overlay ip6table_raw 
ip6table_mangle ip6table_nat iptable_raw iptable_mangle iptable_nat nf_tables 
nfnetlink cfg80211 ebtable_filter ebtables ip6table_filter ip6_tables 
iptable_filter bpfilter mst_pciconf(OE) mst_pci(OE) 8021q garp mrp stp llc 
nls_iso8859_1 intel_rapl_msr lz4 lz4_compress intel_rapl_common ib_iser rdma_cm 
sb_edac iw_cm iscsi_tcp libiscsi_tcp libiscsi x86_pkg_temp_thermal 
scsi_transport_iscsi intel_powerclamp zram veth vhost_net tap coretemp vhost 
kvm_intel crct10dif_pclmul crc32_pclmul ghash_clmulni_intel kvm openvswitch nsh 
nf_conncount nf_nat nf_conntrack rapl nf_defrag_ipv6 nf_defrag_ipv4 joydev 
input_leds intel_cstate ib_ipoib mei_me mei ib_cm ioatdma ib_umad lpc_ich 
acpi_pad acpi_power_meter mac_hid ipmi_si ipmi_ssif ipmi_devintf 
ipmi_msghandler kyber_iosched sch_fq_codel tcp_highspeed ip_tables x_tables 
autofs4 zfs(POE) zunicode(POE) zavl(POE) icp(POE) zcommon(POE) znvpair(POE) 
spl(OE) zlua(POE) btrfs zstd_compress raid10 raid456
  kernel: [52642.480182]  async_raid6_recov async_memcpy async_pq async_xor 
async_tx xor raid6_pq libcrc32c raid0 multipath linear dm_mirror dm_region_hash 
dm_log mlx4_ib ib_uverbs ib_core hid_generic raid1 ses enclosure usbhid hid ast 
drm_vram_helper i2c_algo_bit ttm drm_kms_helper syscopyarea sysfillrect 
sysimgblt aesni_intel fb_sys_fops ixgbe glue_helper mpt3sas nvme xfrm_algo 
crypto_simd ahci raid_class dca cryptd mlx4_core drm megaraid_sas nvme_core 
libahci scsi_transport_sas mdio wmi
  kernel: [52642.480221] CPU: 13 PID: 0 Comm: swapper/13 Tainted: P   
OE 5.4.0-48-generic #52~18.04.1-Ubuntu
  kernel: [52642.480223] Hardware name: Supermicro Super Server/X10DRW-iT, BIOS 
2.0b 04/13/2017
  kernel: [52642.480226] RIP: 0010:dev_watchdog+0x264/0x270
  kernel: [52642.480229] Code: 48 85 c0 75 e6 eb a0 4c 89 ef c6 05 42 c1 e7 00 
01 e8 30 b8 fa ff 89 d9 48 89 c2 4c 89 ee 48 c7 c7 50 05 63 ae e8 4c 31 71 ff 
<0f> 0b eb 82 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 48 89 e5 41
  kernel: [52642.480230] RSP: 0018:b4998c970e48 EFLAGS: 00010282
  kernel: [52642.480233] RAX:  RBX:  RCX: 
083f
  kernel: [52642.480234] RDX:  RSI: 00f6 RDI: 
083f
  kernel: [52642.480235] RBP: b4998c970e78 R08: 08fd R09: 
0003
  kernel: [52642.480237] R10: b4998c970ee8 R11: 0001 R12: 
0001
  kernel: [52642.480238] R13: 93d302465000 R14: 93d302465480 R15: 
93f2d293c880
  kernel: [52642.480240] FS:  () GS:93f33f64() 
knlGS:
  kernel: [52642.480242] CS:  0010 DS:  ES:  CR0: 80050033
  kernel: [52642.480243] CR2: f90140127000 CR3: 002a26e0a004 CR4: 
001626e0
  kernel: [52642.480245] Call Trace:
  kernel: [52642.480247]  
  kernel: [52642.480252]  ? pfifo_fast_reset+0x110/0x110
  kernel: [52642.480255]  call_timer_fn+0x32/0x130
  kernel: [52642.480258]  run_timer_softirq+0x443/0x480
  kernel: [52642.480262]  ? ktime_get+0x43/0xa0
  kernel: [52642.480268]  ? lapic_next_deadline+0x26/0x30
  kernel: [52642.480273]  __do_softirq+0xe4/0x2da
  kernel: [52642.480278]  irq_exit+0xae/0xb0
  ker

[Kernel-packages] [Bug 1898057] Re: Infiniband transmit queue timeouts after upgrading to linux-hwe-5.4

2020-12-11 Thread Mikko Tanner
After a reboot of the whole fabric (switches and machines), this problem
has not resurfaced. Conceivably this could have been a transient error
state, so I will close this with "invalid".

** Changed in: linux (Ubuntu)
   Status: Confirmed => Invalid

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1898057

Title:
  Infiniband transmit queue timeouts after upgrading to linux-hwe-5.4

Status in linux package in Ubuntu:
  Invalid

Bug description:
  After upgrading 3 servers from linux-image-5.3.0-40-generic to linux-
  image-5.4.0-48-generic I have started seeing the following queue
  timeouts from IP-over-Infiniband (ipoib) devices. The devices in
  question are (with newest available firmware, 2.42.5000):

  # lspci -nnk -s 83:00.0
  83:00.0 Network controller [0280]: Mellanox Technologies MT27500 Family 
[ConnectX-3] [15b3:1003]
  Subsystem: Mellanox Technologies MT27500 Family [ConnectX-3] 
[15b3:0027]
  Kernel driver in use: mlx4_core

  Below is the WARN from one machine's syslog. The others are
  practically identical. When the WARN happens on any of the machines,
  other 2 will _also_ exhibit queue timeouts. Additionally, other
  (unrelated) machines connected to the same infiniband fabric will
  exhibit a 12-second transmission delay. This could conceivably be
  caused by these 3 servers also being Subnet Managers (opensm package).

  The infiniband fabric is partitioned, with the affected partition
  (8011) seeing most of the traffic.

  

  kernel: [52642.480066] [ cut here ]
  kernel: [52642.480092] NETDEV WATCHDOG: ib0.8011 (): transmit queue 0 timed 
out
  kernel: [52642.480120] WARNING: CPU: 13 PID: 0 at 
/build/linux-hwe-5.4-8m2I8l/linux-hwe-5.4-5.4.0/net/sched/sch_generic.c:448 
dev_watchdog+0x264/0x270
  kernel: [52642.480121] Modules linked in: aufs overlay ip6table_raw 
ip6table_mangle ip6table_nat iptable_raw iptable_mangle iptable_nat nf_tables 
nfnetlink cfg80211 ebtable_filter ebtables ip6table_filter ip6_tables 
iptable_filter bpfilter mst_pciconf(OE) mst_pci(OE) 8021q garp mrp stp llc 
nls_iso8859_1 intel_rapl_msr lz4 lz4_compress intel_rapl_common ib_iser rdma_cm 
sb_edac iw_cm iscsi_tcp libiscsi_tcp libiscsi x86_pkg_temp_thermal 
scsi_transport_iscsi intel_powerclamp zram veth vhost_net tap coretemp vhost 
kvm_intel crct10dif_pclmul crc32_pclmul ghash_clmulni_intel kvm openvswitch nsh 
nf_conncount nf_nat nf_conntrack rapl nf_defrag_ipv6 nf_defrag_ipv4 joydev 
input_leds intel_cstate ib_ipoib mei_me mei ib_cm ioatdma ib_umad lpc_ich 
acpi_pad acpi_power_meter mac_hid ipmi_si ipmi_ssif ipmi_devintf 
ipmi_msghandler kyber_iosched sch_fq_codel tcp_highspeed ip_tables x_tables 
autofs4 zfs(POE) zunicode(POE) zavl(POE) icp(POE) zcommon(POE) znvpair(POE) 
spl(OE) zlua(POE) btrfs zstd_compress raid10 raid456
  kernel: [52642.480182]  async_raid6_recov async_memcpy async_pq async_xor 
async_tx xor raid6_pq libcrc32c raid0 multipath linear dm_mirror dm_region_hash 
dm_log mlx4_ib ib_uverbs ib_core hid_generic raid1 ses enclosure usbhid hid ast 
drm_vram_helper i2c_algo_bit ttm drm_kms_helper syscopyarea sysfillrect 
sysimgblt aesni_intel fb_sys_fops ixgbe glue_helper mpt3sas nvme xfrm_algo 
crypto_simd ahci raid_class dca cryptd mlx4_core drm megaraid_sas nvme_core 
libahci scsi_transport_sas mdio wmi
  kernel: [52642.480221] CPU: 13 PID: 0 Comm: swapper/13 Tainted: P   
OE 5.4.0-48-generic #52~18.04.1-Ubuntu
  kernel: [52642.480223] Hardware name: Supermicro Super Server/X10DRW-iT, BIOS 
2.0b 04/13/2017
  kernel: [52642.480226] RIP: 0010:dev_watchdog+0x264/0x270
  kernel: [52642.480229] Code: 48 85 c0 75 e6 eb a0 4c 89 ef c6 05 42 c1 e7 00 
01 e8 30 b8 fa ff 89 d9 48 89 c2 4c 89 ee 48 c7 c7 50 05 63 ae e8 4c 31 71 ff 
<0f> 0b eb 82 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 48 89 e5 41
  kernel: [52642.480230] RSP: 0018:b4998c970e48 EFLAGS: 00010282
  kernel: [52642.480233] RAX:  RBX:  RCX: 
083f
  kernel: [52642.480234] RDX:  RSI: 00f6 RDI: 
083f
  kernel: [52642.480235] RBP: b4998c970e78 R08: 08fd R09: 
0003
  kernel: [52642.480237] R10: b4998c970ee8 R11: 0001 R12: 
0001
  kernel: [52642.480238] R13: 93d302465000 R14: 93d302465480 R15: 
93f2d293c880
  kernel: [52642.480240] FS:  () GS:93f33f64() 
knlGS:
  kernel: [52642.480242] CS:  0010 DS:  ES:  CR0: 80050033
  kernel: [52642.480243] CR2: f90140127000 CR3: 002a26e0a004 CR4: 
001626e0
  kernel: [52642.480245] Call Trace:
  kernel: [52642.480247]  
  kernel: [52642.480252]  ? pfifo_fast_reset+0x110/0x110
  kernel: [52642.480255]  call_timer_fn+0x32/0x130
  kernel: [52642.480258]  run_timer_softirq+0x443/0x480
  kernel: [5264