A test kernel of v5.4 (kernel series where the problem has been found)
has been tested by Field Engineer and here's the outcome:

"
-Extensive testing about 4/5 failovers both HWE (v5.11) and the patched kernels 
seem stable (v5.4).

Thank you this unblocks us for deployment of this cloud.
"

- Eric

** Description changed:

  [Impact]
  
  It has been brought to my attention the following:
  
  "
  We have been experiencing node lockups and degradation when testing fiber 
channel fail over for multi-path PURESTORAGE drives.
  
  Testing usually consists of either failing over the fabric or the local
  I/O module for the Cisco chassis which houses a number of individual
  blades.
  
  After rebooting a local Chassis I/O module we see commands like multipath -ll 
hanging.
  Resetting the blades individual fiber channel interface results in the 
following messages.
  "
  
  6051160.241383]  rport-9:0-1: blocked FC remote port time out: removing 
target and saving binding
  [6051160.252901] BUG: kernel NULL pointer dereference, address: 
0000000000000040
  [6051160.262267] #PF: supervisor read access in kernel mode
  [6051160.269314] #PF: error_code(0x0000) - not-present page
  [6051160.276016] PGD 0 P4D 0
  [6051160.279807] Oops: 0000 [#1] SMP NOPTI
  [6051160.284642] CPU: 10 PID: 49346 Comm: kworker/10:2 Tainted: P           O 
     5.4.0-77-generic #86-Ubuntu
  [6051160.295967] Hardware name: Cisco Systems Inc UCSB-B200-M5/UCSB-B200-M5, 
BIOS B200M5.4.1.1d.0.0609200543 06/09/2020
  [6051160.308199] Workqueue: fc_dl_9 fc_timeout_deleted_rport 
[scsi_transport_fc]
  [6051160.316640] RIP: 0010:fnic_terminate_rport_io+0x10f/0x510 [fnic]
  [6051160.324050] Code: 48 89 c3 48 85 c0 0f 84 7b 02 00 00 48 05 20 01 00 00 
48 89 45 b0 0f 84 6b 02 00 00 48 8b 83 58 01 00 00 48 8b 80 b8 01 00 00 <48> 8b 
78 40 e8 68 e6 06 00 85 c0 0f 84 4c 02 00 00 48 8b 83 58 01
  [6051160.346553] RSP: 0018:ffffbc224f297d90 EFLAGS: 00010082
  [6051160.353115] RAX: 0000000000000000 RBX: ffff90abdd4c4b00 RCX: 
ffff90d8ab2c2bb0
  [6051160.361983] RDX: ffff90d8b5467400 RSI: 0000000000000000 RDI: 
ffff90d8ab3b4b40
  [6051160.370812] RBP: ffffbc224f297df8 R08: ffff90d8c08978c8 R09: 
ffff90d8b8850800
  [6051160.379518] R10: ffff90d8a59d64c0 R11: 0000000000000001 R12: 
ffff90d8ab2c31f8
  [6051160.388242] R13: 0000000000000000 R14: 0000000000000246 R15: 
ffff90d8ab2c27b8
  [6051160.396953] FS:  0000000000000000(0000) GS:ffff90d8c0880000(0000) 
knlGS:0000000000000000
  [6051160.406838] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  [6051160.414168] CR2: 0000000000000040 CR3: 0000000fc1c0a004 CR4: 
00000000007626e0
  [6051160.423146] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 
0000000000000000
  [6051160.431884] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 
0000000000000400
  [6051160.440615] PKRU: 55555554
  [6051160.444337] Call Trace:
  [6051160.447841]  fc_terminate_rport_io+0x56/0x70 [scsi_transport_fc]
  [6051160.455263]  fc_timeout_deleted_rport.cold+0x1bc/0x2c7 
[scsi_transport_fc]
  [6051160.463623]  process_one_work+0x1eb/0x3b0
  [6051160.468784]  worker_thread+0x4d/0x400
  [6051160.473660]  kthread+0x104/0x140
  [6051160.478102]  ? process_one_work+0x3b0/0x3b0
  [6051160.483439]  ? kthread_park+0x90/0x90
  [6051160.488213]  ret_from_fork+0x1f/0x40
  [6051160.492901] Modules linked in: dm_service_time zfs(PO) zunicode(PO) 
zlua(PO) zavl(PO) icp(PO) zcommon(PO) znvpair(PO) spl(O) ebtable_filter 
ebtables ip6table_raw ip6table_mangle ip6table_nat iptable_raw iptable_mangle 
iptable_nat nf_nat vhost_vsock vmw_vsock_virtio_transport_common vsock 
unix_diag nf_conntrack_netlink nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 
vhost_net vhost tap 8021q garp mrp bluetooth ecdh_generic ecc tcp_diag 
inet_diag sctp nf_tables nfnetlink ip6table_filter ip6_tables iptable_filter 
bpfilter bridge stp llc nls_iso8859_1 dm_queue_length dm_multipath scsi_dh_rdac 
scsi_dh_emc scsi_dh_alua intel_rapl_msr intel_rapl_common isst_if_common 
skx_edac nfit x86_pkg_temp_thermal intel_powerclamp ipmi_ssif coretemp 
kvm_intel kvm rapl input_leds joydev intel_cstate mei_me ioatdma mei dca 
ipmi_si ipmi_devintf ipmi_msghandler acpi_power_meter acpi_pad mac_hid 
sch_fq_codel ip_tables x_tables autofs4 btrfs zstd_compress raid10 raid456 
async_raid6_recov async_memcpy async_pq async_xor
  [6051160.492928]  async_tx xor raid6_pq libcrc32c raid1 raid0 multipath 
linear fnic mgag200 drm_vram_helper i2c_algo_bit ttm drm_kms_helper 
crct10dif_pclmul syscopyarea hid_generic crc32_pclmul libfcoe sysfillrect 
ghash_clmulni_intel sysimgblt aesni_intel fb_sys_fops crypto_simd libfc usbhid 
cryptd scsi_transport_fc hid drm glue_helper enic ahci lpc_ich libahci wmi
  [6051160.632623] CR2: 0000000000000040
  [6051160.637043] ---[ end trace 236e6f4850146477 ]---
  
  [Test Plan]
  
+ <sbparke> ???
+ 
  [Where problems could occur]
+ 
+ Cisco "fNIC" driver enables FCoE support for the Cisco UCS Virtual
+ Interface Card family of products.
+ 
+ If a problem arise it would be limited to these VIC which are specially
+ designed for Cisco UCS blade and rack servers and possibly command to
+ terminate I/O in any case at worst case (again only on Cisco UCS hw
+ family.
  
  [Other informations]
  
  
https://support.oracle.com/knowledge/Oracle%20Linux%20and%20Virtualization/2792832_1.html#FIX
  https://www.spinics.net/lists/linux-scsi/msg142179.html

** Description changed:

  [Impact]
  
  It has been brought to my attention the following:
  
  "
  We have been experiencing node lockups and degradation when testing fiber 
channel fail over for multi-path PURESTORAGE drives.
  
  Testing usually consists of either failing over the fabric or the local
  I/O module for the Cisco chassis which houses a number of individual
  blades.
  
  After rebooting a local Chassis I/O module we see commands like multipath -ll 
hanging.
  Resetting the blades individual fiber channel interface results in the 
following messages.
  "
  
  6051160.241383]  rport-9:0-1: blocked FC remote port time out: removing 
target and saving binding
  [6051160.252901] BUG: kernel NULL pointer dereference, address: 
0000000000000040
  [6051160.262267] #PF: supervisor read access in kernel mode
  [6051160.269314] #PF: error_code(0x0000) - not-present page
  [6051160.276016] PGD 0 P4D 0
  [6051160.279807] Oops: 0000 [#1] SMP NOPTI
  [6051160.284642] CPU: 10 PID: 49346 Comm: kworker/10:2 Tainted: P           O 
     5.4.0-77-generic #86-Ubuntu
  [6051160.295967] Hardware name: Cisco Systems Inc UCSB-B200-M5/UCSB-B200-M5, 
BIOS B200M5.4.1.1d.0.0609200543 06/09/2020
  [6051160.308199] Workqueue: fc_dl_9 fc_timeout_deleted_rport 
[scsi_transport_fc]
  [6051160.316640] RIP: 0010:fnic_terminate_rport_io+0x10f/0x510 [fnic]
  [6051160.324050] Code: 48 89 c3 48 85 c0 0f 84 7b 02 00 00 48 05 20 01 00 00 
48 89 45 b0 0f 84 6b 02 00 00 48 8b 83 58 01 00 00 48 8b 80 b8 01 00 00 <48> 8b 
78 40 e8 68 e6 06 00 85 c0 0f 84 4c 02 00 00 48 8b 83 58 01
  [6051160.346553] RSP: 0018:ffffbc224f297d90 EFLAGS: 00010082
  [6051160.353115] RAX: 0000000000000000 RBX: ffff90abdd4c4b00 RCX: 
ffff90d8ab2c2bb0
  [6051160.361983] RDX: ffff90d8b5467400 RSI: 0000000000000000 RDI: 
ffff90d8ab3b4b40
  [6051160.370812] RBP: ffffbc224f297df8 R08: ffff90d8c08978c8 R09: 
ffff90d8b8850800
  [6051160.379518] R10: ffff90d8a59d64c0 R11: 0000000000000001 R12: 
ffff90d8ab2c31f8
  [6051160.388242] R13: 0000000000000000 R14: 0000000000000246 R15: 
ffff90d8ab2c27b8
  [6051160.396953] FS:  0000000000000000(0000) GS:ffff90d8c0880000(0000) 
knlGS:0000000000000000
  [6051160.406838] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  [6051160.414168] CR2: 0000000000000040 CR3: 0000000fc1c0a004 CR4: 
00000000007626e0
  [6051160.423146] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 
0000000000000000
  [6051160.431884] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 
0000000000000400
  [6051160.440615] PKRU: 55555554
  [6051160.444337] Call Trace:
  [6051160.447841]  fc_terminate_rport_io+0x56/0x70 [scsi_transport_fc]
  [6051160.455263]  fc_timeout_deleted_rport.cold+0x1bc/0x2c7 
[scsi_transport_fc]
  [6051160.463623]  process_one_work+0x1eb/0x3b0
  [6051160.468784]  worker_thread+0x4d/0x400
  [6051160.473660]  kthread+0x104/0x140
  [6051160.478102]  ? process_one_work+0x3b0/0x3b0
  [6051160.483439]  ? kthread_park+0x90/0x90
  [6051160.488213]  ret_from_fork+0x1f/0x40
  [6051160.492901] Modules linked in: dm_service_time zfs(PO) zunicode(PO) 
zlua(PO) zavl(PO) icp(PO) zcommon(PO) znvpair(PO) spl(O) ebtable_filter 
ebtables ip6table_raw ip6table_mangle ip6table_nat iptable_raw iptable_mangle 
iptable_nat nf_nat vhost_vsock vmw_vsock_virtio_transport_common vsock 
unix_diag nf_conntrack_netlink nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 
vhost_net vhost tap 8021q garp mrp bluetooth ecdh_generic ecc tcp_diag 
inet_diag sctp nf_tables nfnetlink ip6table_filter ip6_tables iptable_filter 
bpfilter bridge stp llc nls_iso8859_1 dm_queue_length dm_multipath scsi_dh_rdac 
scsi_dh_emc scsi_dh_alua intel_rapl_msr intel_rapl_common isst_if_common 
skx_edac nfit x86_pkg_temp_thermal intel_powerclamp ipmi_ssif coretemp 
kvm_intel kvm rapl input_leds joydev intel_cstate mei_me ioatdma mei dca 
ipmi_si ipmi_devintf ipmi_msghandler acpi_power_meter acpi_pad mac_hid 
sch_fq_codel ip_tables x_tables autofs4 btrfs zstd_compress raid10 raid456 
async_raid6_recov async_memcpy async_pq async_xor
  [6051160.492928]  async_tx xor raid6_pq libcrc32c raid1 raid0 multipath 
linear fnic mgag200 drm_vram_helper i2c_algo_bit ttm drm_kms_helper 
crct10dif_pclmul syscopyarea hid_generic crc32_pclmul libfcoe sysfillrect 
ghash_clmulni_intel sysimgblt aesni_intel fb_sys_fops crypto_simd libfc usbhid 
cryptd scsi_transport_fc hid drm glue_helper enic ahci lpc_ich libahci wmi
  [6051160.632623] CR2: 0000000000000040
  [6051160.637043] ---[ end trace 236e6f4850146477 ]---
  
  [Test Plan]
  
  <sbparke> ???
  
  [Where problems could occur]
  
  Cisco "fNIC" driver enables FCoE support for the Cisco UCS Virtual
  Interface Card family of products.
  
  If a problem arise it would be limited to these VIC which are specially
  designed for Cisco UCS blade and rack servers and possibly command to
  terminate I/O in any case at worst case (again only on Cisco UCS hw
  family.
  
+ Note that Field Engineer and I did test the patch on Cisco UCS hw and
+ the patch didn't reproduce the problem.
+ 
  [Other informations]
  
  
https://support.oracle.com/knowledge/Oracle%20Linux%20and%20Virtualization/2792832_1.html#FIX
  https://www.spinics.net/lists/linux-scsi/msg142179.html

** Changed in: linux (Ubuntu)
       Status: Incomplete => In Progress

** Also affects: linux (Ubuntu Focal)
   Importance: Undecided
       Status: New

** Changed in: linux (Ubuntu)
       Status: In Progress => Fix Released

** Changed in: linux (Ubuntu Focal)
       Status: New => In Progress

** Changed in: linux (Ubuntu Focal)
     Assignee: (unassigned) => Eric Desrochers (slashd)

** Changed in: linux (Ubuntu Focal)
   Importance: Undecided => Critical

** Changed in: linux (Ubuntu Focal)
   Importance: Critical => High

** Description changed:

  [Impact]
  
  It has been brought to my attention the following:
  
  "
  We have been experiencing node lockups and degradation when testing fiber 
channel fail over for multi-path PURESTORAGE drives.
  
  Testing usually consists of either failing over the fabric or the local
  I/O module for the Cisco chassis which houses a number of individual
  blades.
  
  After rebooting a local Chassis I/O module we see commands like multipath -ll 
hanging.
  Resetting the blades individual fiber channel interface results in the 
following messages.
  "
  
  6051160.241383]  rport-9:0-1: blocked FC remote port time out: removing 
target and saving binding
  [6051160.252901] BUG: kernel NULL pointer dereference, address: 
0000000000000040
  [6051160.262267] #PF: supervisor read access in kernel mode
  [6051160.269314] #PF: error_code(0x0000) - not-present page
  [6051160.276016] PGD 0 P4D 0
  [6051160.279807] Oops: 0000 [#1] SMP NOPTI
  [6051160.284642] CPU: 10 PID: 49346 Comm: kworker/10:2 Tainted: P           O 
     5.4.0-77-generic #86-Ubuntu
  [6051160.295967] Hardware name: Cisco Systems Inc UCSB-B200-M5/UCSB-B200-M5, 
BIOS B200M5.4.1.1d.0.0609200543 06/09/2020
  [6051160.308199] Workqueue: fc_dl_9 fc_timeout_deleted_rport 
[scsi_transport_fc]
  [6051160.316640] RIP: 0010:fnic_terminate_rport_io+0x10f/0x510 [fnic]
  [6051160.324050] Code: 48 89 c3 48 85 c0 0f 84 7b 02 00 00 48 05 20 01 00 00 
48 89 45 b0 0f 84 6b 02 00 00 48 8b 83 58 01 00 00 48 8b 80 b8 01 00 00 <48> 8b 
78 40 e8 68 e6 06 00 85 c0 0f 84 4c 02 00 00 48 8b 83 58 01
  [6051160.346553] RSP: 0018:ffffbc224f297d90 EFLAGS: 00010082
  [6051160.353115] RAX: 0000000000000000 RBX: ffff90abdd4c4b00 RCX: 
ffff90d8ab2c2bb0
  [6051160.361983] RDX: ffff90d8b5467400 RSI: 0000000000000000 RDI: 
ffff90d8ab3b4b40
  [6051160.370812] RBP: ffffbc224f297df8 R08: ffff90d8c08978c8 R09: 
ffff90d8b8850800
  [6051160.379518] R10: ffff90d8a59d64c0 R11: 0000000000000001 R12: 
ffff90d8ab2c31f8
  [6051160.388242] R13: 0000000000000000 R14: 0000000000000246 R15: 
ffff90d8ab2c27b8
  [6051160.396953] FS:  0000000000000000(0000) GS:ffff90d8c0880000(0000) 
knlGS:0000000000000000
  [6051160.406838] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  [6051160.414168] CR2: 0000000000000040 CR3: 0000000fc1c0a004 CR4: 
00000000007626e0
  [6051160.423146] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 
0000000000000000
  [6051160.431884] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 
0000000000000400
  [6051160.440615] PKRU: 55555554
  [6051160.444337] Call Trace:
  [6051160.447841]  fc_terminate_rport_io+0x56/0x70 [scsi_transport_fc]
  [6051160.455263]  fc_timeout_deleted_rport.cold+0x1bc/0x2c7 
[scsi_transport_fc]
  [6051160.463623]  process_one_work+0x1eb/0x3b0
  [6051160.468784]  worker_thread+0x4d/0x400
  [6051160.473660]  kthread+0x104/0x140
  [6051160.478102]  ? process_one_work+0x3b0/0x3b0
  [6051160.483439]  ? kthread_park+0x90/0x90
  [6051160.488213]  ret_from_fork+0x1f/0x40
  [6051160.492901] Modules linked in: dm_service_time zfs(PO) zunicode(PO) 
zlua(PO) zavl(PO) icp(PO) zcommon(PO) znvpair(PO) spl(O) ebtable_filter 
ebtables ip6table_raw ip6table_mangle ip6table_nat iptable_raw iptable_mangle 
iptable_nat nf_nat vhost_vsock vmw_vsock_virtio_transport_common vsock 
unix_diag nf_conntrack_netlink nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 
vhost_net vhost tap 8021q garp mrp bluetooth ecdh_generic ecc tcp_diag 
inet_diag sctp nf_tables nfnetlink ip6table_filter ip6_tables iptable_filter 
bpfilter bridge stp llc nls_iso8859_1 dm_queue_length dm_multipath scsi_dh_rdac 
scsi_dh_emc scsi_dh_alua intel_rapl_msr intel_rapl_common isst_if_common 
skx_edac nfit x86_pkg_temp_thermal intel_powerclamp ipmi_ssif coretemp 
kvm_intel kvm rapl input_leds joydev intel_cstate mei_me ioatdma mei dca 
ipmi_si ipmi_devintf ipmi_msghandler acpi_power_meter acpi_pad mac_hid 
sch_fq_codel ip_tables x_tables autofs4 btrfs zstd_compress raid10 raid456 
async_raid6_recov async_memcpy async_pq async_xor
  [6051160.492928]  async_tx xor raid6_pq libcrc32c raid1 raid0 multipath 
linear fnic mgag200 drm_vram_helper i2c_algo_bit ttm drm_kms_helper 
crct10dif_pclmul syscopyarea hid_generic crc32_pclmul libfcoe sysfillrect 
ghash_clmulni_intel sysimgblt aesni_intel fb_sys_fops crypto_simd libfc usbhid 
cryptd scsi_transport_fc hid drm glue_helper enic ahci lpc_ich libahci wmi
  [6051160.632623] CR2: 0000000000000040
  [6051160.637043] ---[ end trace 236e6f4850146477 ]---
  
  [Test Plan]
  
  <sbparke> ???
  
  [Where problems could occur]
  
  Cisco "fNIC" driver enables FCoE support for the Cisco UCS Virtual
  Interface Card family of products.
  
  If a problem arise it would be limited to these VIC which are specially
  designed for Cisco UCS blade and rack servers and possibly command to
  terminate I/O in any case at worst case (again only on Cisco UCS hw
  family.
  
  Note that Field Engineer and I did test the patch on Cisco UCS hw and
- the patch didn't reproduce the problem.
+ the patch didn't reproduce the problem nor produce observable subsequent
+ issues/regressions.
  
  [Other informations]
  
  
https://support.oracle.com/knowledge/Oracle%20Linux%20and%20Virtualization/2792832_1.html#FIX
  https://www.spinics.net/lists/linux-scsi/msg142179.html

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1944586

Title:
  kernel bug found when disconnecting one fiber channel interface on
  Cisco Chassis with fnic DRV_VERSION "1.6.0.47"

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1944586/+subscriptions


-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

Reply via email to