------- Comment From naf...@cn.ibm.com 2022-08-30 12:28 EDT-------
@Canonical,  I'd like to know that when will the fix be back ported to hwe-5.4 
kernel? Will it be a new hotpatch like this one ? 
https://launchpad.net/~canonical-support-eng/+archive/ubuntu/sf334332 or 
something else. Thanks so much. Sorry I am not familiar with the process, and 
just like to know how can I include the fix in our cloud environment. Thanks.

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1987287

Title:
  [UBUNTU 20.04] mlx5 driver crashes on accessing device attributes
  during recovery

Status in Ubuntu on IBM z Systems:
  New
Status in linux package in Ubuntu:
  Invalid
Status in linux source package in Focal:
  New

Bug description:
  We recently got a bug report for systems running Ubuntu 20.04 that were
  crashing with backtraces pointing at the mlx5 driver's handling of 
mlx5_ethtool_get_link_ksettings()
  when this is called through the sysfs (going through ethtool might have 
different checks).
  I managed to find a reliable way to reproduce the issue that I believe isn't 
tied to IBM Z at all.

  The procedure to reproduce is as follows. I created a script to read
  the sysfs attributes for the link's speed and duplex mode in a loop:

  #!/usr/bin/env bash

  if [ $# -lt 1 ]; then
          echo "Usage: $0 <netif>"
          exit 1
  fi

  while true; do
          cat /sys/class/net/$1/duplex > /dev/null
          cat /sys/class/net/$1/speed > /dev/null
  done

  Executed with:

  # ./script.sh enP10p0s0

  I ran this in one bash session and then in another one I triggered a PCI 
reset with
  the follwoing command where one needs to replace <dev> with the PCI address 
of the NIC:

  echo 1 > /sys/bus/pci/devices/<dev>/reset

  Then first I got a lot of the following messages:

   mlx5_core 0010:00:00.0 enP16p0s0: mlx5e_ethtool_get_link_ksettings:
  query port ptys failed: -5

  And then as the mlx5 driver's recovery kicks in the oops as below:

  [  659.103947] mlx5_core 0010:00:00.0: wait vital counter value 0x7b399f 
after 1 iterations
  [  659.103947] mlx5_core 0010:00:00.0: mlx5_pci_resume was called
  [  659.103966] mlx5_core 0010:00:00.0: firmware version: 14.32.1010
  [  659.104169] Unable to handle kernel pointer dereference in virtual kernel 
address space
  [  659.104171] Failing address: 0000000000000000 TEID: 0000000000000483
  [  659.104172] Fault in home space mode while using kernel ASCE.
  [  659.104173] AS:000000003d29c007 R3:00000000fffd0007 S:00000000fffd5800 
P:000000000000003d
  [  659.104200] Oops: 0004 ilc:2 [#1] SMP
  [  659.104202] Modules linked in: s390_trng ism smc pnet chsc_sch eadm_sch 
vfio_ccw vfio_mdev mdev vfio_iommu_type1 vfio sch_fq_codel drm 
drm_panel_orientation_quirks i2c_core ip_tables x_tables btrfs zstd_compress 
zlib_deflate raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor 
async_tx xor raid6_pq libcrc32c raid1 raid0 linear mlx5_ib dm_service_time pkey 
zcrypt crc32_vx_s390 ib_uverbs ghash_s390 ib_core qeth_l2 prng aes_s390 
des_s390 nvme libdes sha3_512_s390 sha3_256_s390 sha512_s390 sha256_s390 
sha1_s390 sha_common mlx5_core tls mlxfw ptp nvme_core pps_core dasd_eckd_mod 
dasd_mod zfcp scsi_transport_fc qeth qdio ccwgroup scsi_dh_emc scsi_dh_rdac 
scsi_dh_alua dm_multipath
  [  659.104232] CPU: 6 PID: 438216 Comm: cat Not tainted 5.4.0-124-generic 
#140-Ubuntu
  [  659.104233] Hardware name: IBM 3931 XYZ XXXX (LPAR)
  [  659.104234] Krnl PSW : 0404c00180000000 000000003bfa661e 
(__queue_work+0xfe/0x520)
  [  659.104241]            R:0 T:1 IO:0 EX:0 Key:0 M:1 W:0 P:0 AS:3 CC:0 PM:0 
RI:0 EA:3
  [  659.104242] Krnl GPRS: 000000003c291570 0000000000000000 0000000000000007 
000000007fffffff
  [  659.104243]            00000000e2fe46e0 0000000fffffffe0 0000000000000006 
000000003d039588
  [  659.104244]            0000000000000000 0000000000000000 00000000e2fe46e0 
00000000bfb3e000
  [  659.104245]            00000000e194c400 000003e007d6fb78 000000003bfa6602 
000003e007d6f860
  [  659.104251] Krnl Code: 000000003bfa6612: a77400e5            brc     
7,000000003bfa67dc
                            000000003bfa6616: 582003ac            l       
%r2,940
                           #000000003bfa661a: a7180000            lhi     %r1,0
                           >000000003bfa661e: ba129000            cs      
%r1,%r2,0(%r9)
                            000000003bfa6622: a77401a7            brc     
7,000000003bfa6970
                            000000003bfa6626: e310b0180012        lt      
%r1,24(%r11)
                            000000003bfa662c: a78400ff            brc     
8,000000003bfa682a
                            000000003bfa6630: c0040000004b        brcl    
0,000000003bfa66c6
  [  659.104261] Call Trace:
  [  659.104263] ([<0000000000000000>] 0x0)
  [  659.104265]  [<000000003bfa6aa2>] queue_work_on+0x62/0x70
  [  659.104329]  [<000003ff80a2920a>] cmd_exec+0x4ea/0x840 [mlx5_core]
  [  659.104349]  [<000003ff80a29680>] mlx5_cmd_exec+0x40/0x70 [mlx5_core]
  [  659.104369]  [<000003ff80a334a8>] mlx5_core_access_reg+0x108/0x150 
[mlx5_core]
  [  659.104387]  [<000003ff80a3354e>] mlx5_query_port_ptys+0x5e/0x70 
[mlx5_core]
  [  659.104407]  [<000003ff80a5b928>] 
mlx5e_ethtool_get_link_ksettings+0x58/0x460 [mlx5_core]
  [  659.104410]  [<000000003c662a68>] duplex_show+0x78/0xe0
  [  659.104414]  [<000000003c538ecc>] dev_attr_show+0x2c/0x70
  [  659.104417]  [<000000003c293386>] sysfs_kf_seq_show+0xa6/0x150
  [  659.104420]  [<000000003c207470>] seq_read+0xe0/0x4f0
  [  659.104422]  [<000000003c1d4cf4>] vfs_read+0x94/0x160
  [  659.104423]  [<000000003c1d4ea8>] ksys_read+0x68/0x100
  [  659.104426]  [<000000003c7f2034>] system_call+0xd8/0x2c8
  [  659.104427] Last Breaking-Event-Address:
  [  659.104428]  [<000000003bfa65d6>] __queue_work+0xb6/0x520
  [  659.104430] ---[ end trace 9fc1a6358b456876 ]---

  Digging into the code and git history I found the following upstream commit 
added in v5.7
  which besides being part of the 5.6.x stable patches somehow didn't make it 
into
  the 5.4.x stable queue nor Ubuntu 20.04, possibly because there is a 
(trivial) merge conflict:

  commit f7936ddd35d8b849daf0372770c7c9dbe7910fca
  Author: Eran Ben Elisha <era...@mellanox.com>
  Date:   Thu Mar 19 21:43:13 2020 +0200

      net/mlx5: Avoid processing commands before cmdif is ready

      When driver is reloading during recovery flow, it can't get new commands
      till command interface is up again. Otherwise we may get to null pointer
      trying to access non initialized command structures.

      Add cmdif state to avoid processing commands while cmdif is not
  ready.

      Fixes: e126ba97dba9 ("mlx5: Add driver for Mellanox Connect-IB adapters")
      Signed-off-by: Eran Ben Elisha <era...@mellanox.com>
      Signed-off-by: Moshe Shemesh <mo...@mellanox.com>
      Signed-off-by: Saeed Mahameed <sae...@mellanox.com>

  With a quick backport onto 5.4.0-124.140 (patch attached below) the issue is 
gone
  and the system no longer crashes but instead recovers successfully. I believe 
the
  crash we saw is then exactly the null pointer access mentioned in the commit 
description.

  == Comment: #4 - Niklas Schnelle - 2022-08-22 06:41:41 ==
  There was only a trivial merge conflict where the context
  of the struct decleration changed.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu-z-systems/+bug/1987287/+subscriptions


-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to     : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

Reply via email to