Public bug reported:

TLDR: system with rare LSI controller stopped booting when we upgraded
to kernel 4.15.0-66-generic.

The controller is a Cisco UCS C3000 RAID Controller for M4 Server Blade
with 4G RAID Cache, which Cisco tell me is a dual chip Broadcom 3316
ROC. To my knowledge, it is only used in their 56 disk 4U storage
server, so is probably not the most common of devices.

The controllers is running the latest firmware from Cisco (29.00.1-0110,
from 2015). It has 1 RAID1 VD and 56 drives exposed as JBODs.

Two different systems (16.04) with this controller went down for
automatic reboots last week and failed to come back up. The BMC notified
us that the RAID controller was in a faulted state, so we initially
assumed a hardware failure, but this turned out not to be the case.

The previous kernel version (4.15.0-65) boots fine. This also happens
with kernel 5.0.0-23 after we moved one up to 18.04.

On booting, we get dumped back to an initramfs prompt, and dmesg shows
some concerning stuff from the megaraid_sas driver:

[    1.845268] megaraid_sas 0000:12:00.0: FW now in Ready state
[    1.845270] megaraid_sas 0000:12:00.0: 32 bit DMA mask and 32 bit consistent 
mask
[    1.845933] megaraid_sas 0000:12:00.0: firmware supports msix        : (16)
[    1.845935] megaraid_sas 0000:12:00.0: current msix/online cpus      : 
(16/40)
[    1.845936] megaraid_sas 0000:12:00.0: RDPQ mode     : (disabled)
[    1.845937] megaraid_sas 0000:12:00.0: Current firmware supports maximum 
commands: 928        LDIO threshold: 0
[    1.846204] megaraid_sas 0000:12:00.0: Configured max firmware commands: 927
[    1.849914] megaraid_sas 0000:12:00.0: FW supports sync cache        : No
[    1.931975] megaraid_sas 0000:12:00.0: firmware type : Extended VD(240 
VD)firmware
[    1.931977] megaraid_sas 0000:12:00.0: controller type       : MR(4095MB)
[    1.931978] megaraid_sas 0000:12:00.0: Online Controller Reset(OCR)  : 
Enabled
[    1.931979] megaraid_sas 0000:12:00.0: Secure JBOD support   : Yes
[    1.964209] megaraid_sas 0000:12:00.0: INIT adapter done
[    2.210527] megaraid_sas 0000:12:00.0: pci id                : 
(0x1000)/(0x00ce)/(0x1137)/(0x0197)
[    2.210528] megaraid_sas 0000:12:00.0: unevenspan support    : no
[    2.210529] megaraid_sas 0000:12:00.0: firmware crash dump   : no
[    2.210530] megaraid_sas 0000:12:00.0: jbod sync map         : yes
[    2.588209] megaraid_sas 0000:12:00.0: Iop2SysDoorbellIntfor scsi0
[    2.588222] megaraid_sas 0000:12:00.0: Found FW in FAULT state, will reset 
adapter scsi0.
[    2.588223] megaraid_sas 0000:12:00.0: resetting fusion adapter scsi0.
[    2.588367] megaraid_sas 0000:12:00.0: Reset not supported, killing adapter 
scsi0.

At this point, booting fails, presumably because the boot device has
disappeared.

I’m not sure that if the “FAULT state” message is true. The devices
comes up fine without error when we reboot on the pervious kernel
version, and the device logs (storcli /c0 show termlog) doesn’t indicate
any issues when the system successfully boots. The battery is fine. The
only hint we see in the terminal logs is:

T0: C0:supported dgbflags:
T0: C0:    biosDisable: 0
T0: C0:    ddrDisable: 0
T0: C0: *** HW Encryption Disabled : dcrReg=0
T0: C0:Reading Detroit Cache enable at DCR cache config register: 0x0
T0: C0:Reading Detroit Cache init at DCR cache control/status register: 0x0
T0: C0:TreeVelleInit Complete (Velle Config register 103ff) 
T0: C0:RegionLockMaroInit Complete (Maro config register c00103ff 
T0: C0:DRAM_LOCAL_BASE: 40000000
T0: C0:MEM_FIXED_SIZE: 1800000
T0: C0:MEM_FIXED_END: 41800000
T0: C0:FW_DRAM_REGION_START: 41800000
T0: C0:FW_DRAM_REGION_SIZE: 2900000
T0: C0:MEM_POOL_BASE: 43b25ca0
T0: C0:Initializing memory pool size=005DA360 bytes
T0: C0:I2Chandle obtained for MUX [0]0x0 
T0: C0:I2Chandle obtained for MUX [1]0x10 
T0: C0:I2Chandle obtained for MUX [5]0x50 
T0: C0:I2Chandle obtained for MUX [2]0x20 
T0: C0:I2Chandle obtained for MUX [3]0x30 
T0: C0:I2Chandle obtained for MUX [4]0x40 
T0: C0:I2Chandle obtained for MUX [9]0x90 
T0: C0:LogInit: Flushing events from previous boot
T0: C0:EVT#19168-10/29/19 12:24:06:  15=Fatal firmware error: Line 1711 in 
../../raid/2108vI2o.c

We noticed that the following changes in that kernel update:

    - scsi: megaraid_sas: Fix combined reply queue mode detection
    - scsi: megaraid_sas: Add check for reset adapter bit

I’m not sure if this is a firmware issue, or a bug in the driver, but
we’re effectively stuck on 4.15.0-65 for now.

** Affects: linux-hwe (Ubuntu)
     Importance: Undecided
         Status: New

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1850550

Title:
  megaraid_sas driver in linux-modules-4.15.0-66-generic prevents server
  with "Cisco UCS C3000 RAID Controller for M4 Server Blade"  from
  booting

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux-hwe/+bug/1850550/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

Reply via email to