** Changed in: ubuntu-power-systems Status: Triaged => Fix Committed
-- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1776389 Title: [Ubuntu 1804][boston][ixgbe] EEH causes kernel BUG at /build/linux- jWa1Fv/linux-4.15.0/drivers/pci/msi.c:352 (i2S) Status in The Ubuntu-power-systems project: Fix Committed Status in linux package in Ubuntu: Fix Committed Status in linux source package in Bionic: Fix Committed Bug description: == Comment: #0 - ABDUL HALEEM <> - 2018-02-16 08:26:15 == Problem: ------------ Injecting error multiple times causes kernel crash. echo 0x0:1:4:0x6000008000000:0xfff80000 > /sys/kernel/debug/powerpc/PCI0000/err_injct EEH: PHB#0 failure detected, location: N/A EEH: PHB#0-PE#0 has failed 6 times in the last hour and has been permanently disabled. EEH: Unable to recover from failure from PHB#0-PE#0. Please try reseating or replacing it ixgbe 0000:01:00.1: Adapter removed kernel BUG at /build/linux-jWa1Fv/linux-4.15.0/drivers/pci/msi.c:352! Oops: Exception in kernel mode, sig: 5 [#1] LE SMP NR_CPUS=2048 NUMA PowerNV Modules linked in: rpcsec_gss_krb5 nfsv4 nfs fscache joydev input_leds mac_hid idt_89hpesx ofpart ipmi_powernv cmdlinepart ipmi_devintf ipmi_msghandler at24 powernv_flash mtd opal_prd ibmpowernv uio_pdrv_genirq vmx_crypto uio sch_fq_codel nfsd auth_rpcgss nfs_acl lockd grace sunrpc ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi ip_tables x_tables autofs4 btrfs zstd_compress raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear ses enclosure scsi_transport_sas qla2xxx ast hid_generic ttm drm_kms_helper ixgbe syscopyarea usbhid igb sysfillrect sysimgblt nvme_fc fb_sys_fops hid nvme_fabrics crct10dif_vpmsum crc32c_vpmsum drm i40e scsi_transport_fc aacraid i2c_algo_bit mdio CPU: 28 PID: 972 Comm: eehd Not tainted 4.15.0-10-generic #11-Ubuntu NIP: c00000000077f080 LR: c00000000077f070 CTR: c0000000000aac30 REGS: c000000ff1deb5a0 TRAP: 0700 Not tainted (4.15.0-10-generic) MSR: 9000000000029033 <SF,HV,EE,ME,IR,DR,RI,LE> CR: 24002822 XER: 20040000 CFAR: c00000000018bddc SOFTE: 1 GPR00: c00000000077f070 c000000ff1deb820 c0000000016ea600 c000000fbb5fac00 GPR04: 00000000000002c5 0000000000000000 0000000000000000 0000000000000000 GPR08: c000000fbb5fac00 0000000000000001 c000000fec617a00 c000000fdfd86488 GPR12: 0000000000000040 c000000007a33400 c000000000138be8 c000000ff90ec1c0 GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000 GPR20: 0000000000000000 0000000000000000 0000000000000000 c000000000f48d10 GPR24: c000000000f48ce8 c000200e4fcf4000 c000000fc6900b18 c000200e4fcf4000 GPR28: c000200e4fcf4288 c008000010624480 0000000000000000 c000000fbb633ea0 NIP [c00000000077f080] free_msi_irqs+0xa0/0x260 LR [c00000000077f070] free_msi_irqs+0x90/0x260 Call Trace: [c000000ff1deb820] [c00000000077f070] free_msi_irqs+0x90/0x260 (unreliable) [c000000ff1deb880] [c00000000077fa68] pci_disable_msix+0x128/0x170 [c000000ff1deb8c0] [c00800001060b5c8] ixgbe_reset_interrupt_capability+0x90/0xd0 [ixgbe] [c000000ff1deb8f0] [c0080000105d52f4] ixgbe_remove+0xec/0x240 [ixgbe] [c000000ff1deb990] [c0000000007670ec] pci_device_remove+0x6c/0x110 [c000000ff1deb9d0] [c00000000085d194] device_release_driver_internal+0x224/0x310 [c000000ff1deba20] [c00000000075b398] pci_stop_bus_device+0x98/0xe0 [c000000ff1deba60] [c00000000075b588] pci_stop_and_remove_bus_device+0x28/0x40 [c000000ff1deba90] [c00000000005e1d0] pci_hp_remove_devices+0x90/0x130 [c000000ff1debb20] [c00000000005e184] pci_hp_remove_devices+0x44/0x130 [c000000ff1debbb0] [c00000000003ec04] eeh_handle_normal_event+0x134/0x580 [c000000ff1debc60] [c00000000003f160] eeh_handle_event+0x30/0x338 [c000000ff1debd10] [c00000000003f830] eeh_event_handler+0x140/0x200 [c000000ff1debdc0] [c000000000138d88] kthread+0x1a8/0x1b0 [c000000ff1debe30] [c00000000000b528] ret_from_kernel_thread+0x5c/0xb4 Instruction dump: 419effe0 3bc00000 4800000c 60420000 807f0010 7c7e1a14 78630020 4ba0cd3d 60000000 e9430158 312affff 7d295110 <0b090000> 813f0014 395e0001 7d5e07b4 ---[ end trace 23c446a470e60864 ]--- ixgbe 0000:01:00.0: Adapter removed Sending IPI to other CPUs OPAL: Switch to big-endian OS OPAL: Switch to little-endian OS PHB#0000[0:0]: eeh_freeze_clear on fenced PHB ---uname output--- Linux ltciofvtr-bostonlc1 4.15.0-10-generic #11-Ubuntu SMP Tue Feb 13 18:21:52 UTC 2018 ppc64le ppc64le ppc64le GNU/Linux Machine Type = Boston-LC 0000:00:00.0 PCI bridge [0604]: IBM Device [1014:04c1] 0000:01:00.0 Ethernet controller [0200]: Intel Corporation 82599ES 10-Gigabit SFI/SFP+ Network Connection [8086:10fb] (rev 01) 0000:01:00.1 Ethernet controller [0200]: Intel Corporation 82599ES 10-Gigabit SFI/SFP+ Network Connection [8086:10fb] (rev 01) # ethtool -i enp1s0f0 driver: ixgbe version: 5.1.0-k firmware-version: 0x800006da expansion-rom-version: bus-info: 0000:01:00.0 supports-statistics: yes supports-test: yes supports-eeprom-access: yes supports-register-dump: yes supports-priv-flags: yes Userspace tool common name: EEH == Comment: #6 - Mauro Rodrigues <> - 2018-03-19 11:54:03 == Even though, probably it will not be accepted as is, I'll send a solution upstream. The long story short: we add ixgbe_free_irq right before the ixgbe_clear_interrupt_scheme in ixgbe_remove That created a side effect, this is hotplug remove and with the patch applied, with the usual removal path (for instance from unbind in sysfs) that removes the interruption twice. To avoid that I'll send a patch that integrates the free_irq in the clear interruption schema code path. == Comment: #8 - Mauro Rodrigues <> - 2018-04-18 12:23:34 == waiting for upstream feedback at: http://patchwork.ozlabs.org/patch/900279/ which reads "ixgbe: Fix free irq process when removing device due to PCI Errors" == Comment: #9 - Mauro Rodrigues <> - 2018-05-03 11:56:49 == The v3 of the patch is going through intel's queue for further testing http://patchwork.ozlabs.org/patch/907695/ which reads: "ixgbe/ixgbevf: Free IRQ when PCI error recovery removes the device" == Comment: #11 - Mauro Rodrigues <> - 2018-06-11 10:06:35 == this got merged to Torvald's tree last week and I didn't notice before. https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/drivers/net/ethernet/intel/ixgbe?id=b212d815e77c72be921979119c715166cc8987b1 which reads: "ixgbe/ixgbevf: Free IRQ when PCI error recovery removes the device" I'll submit to canonical ML today. To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu-power-systems/+bug/1776389/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp