** Changed in: ubuntu-power-systems
       Status: Triaged => Fix Committed

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1776389

Title:
  [Ubuntu 1804][boston][ixgbe] EEH causes kernel BUG at /build/linux-
  jWa1Fv/linux-4.15.0/drivers/pci/msi.c:352 (i2S)

Status in The Ubuntu-power-systems project:
  Fix Committed
Status in linux package in Ubuntu:
  Fix Committed
Status in linux source package in Bionic:
  Fix Committed

Bug description:
  == Comment: #0 - ABDUL HALEEM <> - 2018-02-16 08:26:15 ==
  Problem:
  ------------
  Injecting error multiple times causes kernel crash.

  echo 0x0:1:4:0x6000008000000:0xfff80000 >
  /sys/kernel/debug/powerpc/PCI0000/err_injct

  EEH: PHB#0 failure detected, location: N/A
  EEH: PHB#0-PE#0 has failed 6 times in the
  last hour and has been permanently disabled.
  EEH: Unable to recover from failure from PHB#0-PE#0.
  Please try reseating or replacing it
  ixgbe 0000:01:00.1: Adapter removed
  kernel BUG at /build/linux-jWa1Fv/linux-4.15.0/drivers/pci/msi.c:352!
  Oops: Exception in kernel mode, sig: 5 [#1]
  LE SMP NR_CPUS=2048 NUMA PowerNV
  Modules linked in: rpcsec_gss_krb5 nfsv4 nfs fscache joydev input_leds 
mac_hid idt_89hpesx ofpart ipmi_powernv cmdlinepart ipmi_devintf 
ipmi_msghandler at24 powernv_flash mtd opal_prd ibmpowernv uio_pdrv_genirq 
vmx_crypto uio sch_fq_codel nfsd auth_rpcgss nfs_acl lockd grace sunrpc ib_iser 
rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi 
scsi_transport_iscsi ip_tables x_tables autofs4 btrfs zstd_compress raid10 
raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq 
libcrc32c raid1 raid0 multipath linear ses enclosure scsi_transport_sas qla2xxx 
ast hid_generic ttm drm_kms_helper ixgbe syscopyarea usbhid igb sysfillrect 
sysimgblt nvme_fc fb_sys_fops hid nvme_fabrics crct10dif_vpmsum crc32c_vpmsum 
drm i40e scsi_transport_fc aacraid i2c_algo_bit mdio
  CPU: 28 PID: 972 Comm: eehd Not tainted 4.15.0-10-generic #11-Ubuntu
  NIP:  c00000000077f080 LR: c00000000077f070 CTR: c0000000000aac30
  REGS: c000000ff1deb5a0 TRAP: 0700   Not tainted  (4.15.0-10-generic)
  MSR:  9000000000029033 <SF,HV,EE,ME,IR,DR,RI,LE>  CR: 24002822  XER: 20040000
  CFAR: c00000000018bddc SOFTE: 1
  GPR00: c00000000077f070 c000000ff1deb820 c0000000016ea600 c000000fbb5fac00
  GPR04: 00000000000002c5 0000000000000000 0000000000000000 0000000000000000
  GPR08: c000000fbb5fac00 0000000000000001 c000000fec617a00 c000000fdfd86488
  GPR12: 0000000000000040 c000000007a33400 c000000000138be8 c000000ff90ec1c0
  GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
  GPR20: 0000000000000000 0000000000000000 0000000000000000 c000000000f48d10
  GPR24: c000000000f48ce8 c000200e4fcf4000 c000000fc6900b18 c000200e4fcf4000
  GPR28: c000200e4fcf4288 c008000010624480 0000000000000000 c000000fbb633ea0
  NIP [c00000000077f080] free_msi_irqs+0xa0/0x260
  LR [c00000000077f070] free_msi_irqs+0x90/0x260
  Call Trace:
  [c000000ff1deb820] [c00000000077f070] free_msi_irqs+0x90/0x260 (unreliable)
  [c000000ff1deb880] [c00000000077fa68] pci_disable_msix+0x128/0x170
  [c000000ff1deb8c0] [c00800001060b5c8] 
ixgbe_reset_interrupt_capability+0x90/0xd0 [ixgbe]
  [c000000ff1deb8f0] [c0080000105d52f4] ixgbe_remove+0xec/0x240 [ixgbe]
  [c000000ff1deb990] [c0000000007670ec] pci_device_remove+0x6c/0x110
  [c000000ff1deb9d0] [c00000000085d194] 
device_release_driver_internal+0x224/0x310
  [c000000ff1deba20] [c00000000075b398] pci_stop_bus_device+0x98/0xe0
  [c000000ff1deba60] [c00000000075b588] pci_stop_and_remove_bus_device+0x28/0x40
  [c000000ff1deba90] [c00000000005e1d0] pci_hp_remove_devices+0x90/0x130
  [c000000ff1debb20] [c00000000005e184] pci_hp_remove_devices+0x44/0x130
  [c000000ff1debbb0] [c00000000003ec04] eeh_handle_normal_event+0x134/0x580
  [c000000ff1debc60] [c00000000003f160] eeh_handle_event+0x30/0x338
  [c000000ff1debd10] [c00000000003f830] eeh_event_handler+0x140/0x200
  [c000000ff1debdc0] [c000000000138d88] kthread+0x1a8/0x1b0
  [c000000ff1debe30] [c00000000000b528] ret_from_kernel_thread+0x5c/0xb4
  Instruction dump:
  419effe0 3bc00000 4800000c 60420000 807f0010 7c7e1a14 78630020 4ba0cd3d
  60000000 e9430158 312affff 7d295110 <0b090000> 813f0014 395e0001 7d5e07b4
  ---[ end trace 23c446a470e60864 ]---
  ixgbe 0000:01:00.0: Adapter removed

  Sending IPI to other CPUs
  OPAL: Switch to big-endian OS
  OPAL: Switch to little-endian OS
  PHB#0000[0:0]: eeh_freeze_clear on fenced PHB

   
  ---uname output---
  Linux ltciofvtr-bostonlc1 4.15.0-10-generic #11-Ubuntu SMP Tue Feb 13 
18:21:52 UTC 2018 ppc64le ppc64le ppc64le GNU/Linux
   
  Machine Type = Boston-LC 

  0000:00:00.0 PCI bridge [0604]: IBM Device [1014:04c1]
  0000:01:00.0 Ethernet controller [0200]: Intel Corporation 82599ES 10-Gigabit 
SFI/SFP+ Network Connection [8086:10fb] (rev 01)
  0000:01:00.1 Ethernet controller [0200]: Intel Corporation 82599ES 10-Gigabit 
SFI/SFP+ Network Connection [8086:10fb] (rev 01)

  # ethtool  -i enp1s0f0
  driver: ixgbe
  version: 5.1.0-k
  firmware-version: 0x800006da
  expansion-rom-version: 
  bus-info: 0000:01:00.0
  supports-statistics: yes
  supports-test: yes
  supports-eeprom-access: yes
  supports-register-dump: yes
  supports-priv-flags: yes

   Userspace tool common name: EEH

  == Comment: #6 - Mauro Rodrigues <> - 2018-03-19 11:54:03 ==
  Even though, probably it will not be accepted as is, I'll send a solution 
upstream.

  The long story short: we add ixgbe_free_irq right before the 
ixgbe_clear_interrupt_scheme in ixgbe_remove
  That created a side effect, this is hotplug remove and with the patch 
applied, with the usual removal path (for instance from unbind in sysfs) that 
removes the interruption twice.
  To avoid that I'll send a patch that integrates the free_irq in the clear 
interruption schema code path.

  == Comment: #8 - Mauro Rodrigues <> - 2018-04-18 12:23:34 ==
  waiting for upstream feedback at:
  http://patchwork.ozlabs.org/patch/900279/

  which reads "ixgbe: Fix free irq process when removing device due to
  PCI Errors"

  == Comment: #9 - Mauro Rodrigues <> - 2018-05-03 11:56:49 ==
  The v3 of the patch is going through intel's queue for further testing 
  http://patchwork.ozlabs.org/patch/907695/
  which reads: "ixgbe/ixgbevf: Free IRQ when PCI error recovery removes the 
device"

  
  == Comment: #11 - Mauro Rodrigues <> - 2018-06-11 10:06:35 ==
   this got merged to Torvald's tree last week and I didn't notice before.
  
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/drivers/net/ethernet/intel/ixgbe?id=b212d815e77c72be921979119c715166cc8987b1

  which reads:
  "ixgbe/ixgbevf: Free IRQ when PCI error recovery removes the device"

  I'll submit to canonical ML today.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu-power-systems/+bug/1776389/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to     : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

Reply via email to