Public bug reported: TLDR: system with rare LSI controller stopped booting when we upgraded to kernel 4.15.0-66-generic.
The controller is a Cisco UCS C3000 RAID Controller for M4 Server Blade with 4G RAID Cache, which Cisco tell me is a dual chip Broadcom 3316 ROC. To my knowledge, it is only used in their 56 disk 4U storage server, so is probably not the most common of devices. The controllers is running the latest firmware from Cisco (29.00.1-0110, from 2015). It has 1 RAID1 VD and 56 drives exposed as JBODs. Two different systems (16.04) with this controller went down for automatic reboots last week and failed to come back up. The BMC notified us that the RAID controller was in a faulted state, so we initially assumed a hardware failure, but this turned out not to be the case. The previous kernel version (4.15.0-65) boots fine. This also happens with kernel 5.0.0-23 after we moved one up to 18.04. On booting, we get dumped back to an initramfs prompt, and dmesg shows some concerning stuff from the megaraid_sas driver: [ 1.845268] megaraid_sas 0000:12:00.0: FW now in Ready state [ 1.845270] megaraid_sas 0000:12:00.0: 32 bit DMA mask and 32 bit consistent mask [ 1.845933] megaraid_sas 0000:12:00.0: firmware supports msix : (16) [ 1.845935] megaraid_sas 0000:12:00.0: current msix/online cpus : (16/40) [ 1.845936] megaraid_sas 0000:12:00.0: RDPQ mode : (disabled) [ 1.845937] megaraid_sas 0000:12:00.0: Current firmware supports maximum commands: 928 LDIO threshold: 0 [ 1.846204] megaraid_sas 0000:12:00.0: Configured max firmware commands: 927 [ 1.849914] megaraid_sas 0000:12:00.0: FW supports sync cache : No [ 1.931975] megaraid_sas 0000:12:00.0: firmware type : Extended VD(240 VD)firmware [ 1.931977] megaraid_sas 0000:12:00.0: controller type : MR(4095MB) [ 1.931978] megaraid_sas 0000:12:00.0: Online Controller Reset(OCR) : Enabled [ 1.931979] megaraid_sas 0000:12:00.0: Secure JBOD support : Yes [ 1.964209] megaraid_sas 0000:12:00.0: INIT adapter done [ 2.210527] megaraid_sas 0000:12:00.0: pci id : (0x1000)/(0x00ce)/(0x1137)/(0x0197) [ 2.210528] megaraid_sas 0000:12:00.0: unevenspan support : no [ 2.210529] megaraid_sas 0000:12:00.0: firmware crash dump : no [ 2.210530] megaraid_sas 0000:12:00.0: jbod sync map : yes [ 2.588209] megaraid_sas 0000:12:00.0: Iop2SysDoorbellIntfor scsi0 [ 2.588222] megaraid_sas 0000:12:00.0: Found FW in FAULT state, will reset adapter scsi0. [ 2.588223] megaraid_sas 0000:12:00.0: resetting fusion adapter scsi0. [ 2.588367] megaraid_sas 0000:12:00.0: Reset not supported, killing adapter scsi0. At this point, booting fails, presumably because the boot device has disappeared. I’m not sure that if the “FAULT state” message is true. The devices comes up fine without error when we reboot on the pervious kernel version, and the device logs (storcli /c0 show termlog) doesn’t indicate any issues when the system successfully boots. The battery is fine. The only hint we see in the terminal logs is: T0: C0:supported dgbflags: T0: C0: biosDisable: 0 T0: C0: ddrDisable: 0 T0: C0: *** HW Encryption Disabled : dcrReg=0 T0: C0:Reading Detroit Cache enable at DCR cache config register: 0x0 T0: C0:Reading Detroit Cache init at DCR cache control/status register: 0x0 T0: C0:TreeVelleInit Complete (Velle Config register 103ff) T0: C0:RegionLockMaroInit Complete (Maro config register c00103ff T0: C0:DRAM_LOCAL_BASE: 40000000 T0: C0:MEM_FIXED_SIZE: 1800000 T0: C0:MEM_FIXED_END: 41800000 T0: C0:FW_DRAM_REGION_START: 41800000 T0: C0:FW_DRAM_REGION_SIZE: 2900000 T0: C0:MEM_POOL_BASE: 43b25ca0 T0: C0:Initializing memory pool size=005DA360 bytes T0: C0:I2Chandle obtained for MUX [0]0x0 T0: C0:I2Chandle obtained for MUX [1]0x10 T0: C0:I2Chandle obtained for MUX [5]0x50 T0: C0:I2Chandle obtained for MUX [2]0x20 T0: C0:I2Chandle obtained for MUX [3]0x30 T0: C0:I2Chandle obtained for MUX [4]0x40 T0: C0:I2Chandle obtained for MUX [9]0x90 T0: C0:LogInit: Flushing events from previous boot T0: C0:EVT#19168-10/29/19 12:24:06: 15=Fatal firmware error: Line 1711 in ../../raid/2108vI2o.c We noticed that the following changes in that kernel update: - scsi: megaraid_sas: Fix combined reply queue mode detection - scsi: megaraid_sas: Add check for reset adapter bit I’m not sure if this is a firmware issue, or a bug in the driver, but we’re effectively stuck on 4.15.0-65 for now. ** Affects: linux-hwe (Ubuntu) Importance: Undecided Status: New -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1850550 Title: megaraid_sas driver in linux-modules-4.15.0-66-generic prevents server with "Cisco UCS C3000 RAID Controller for M4 Server Blade" from booting To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux-hwe/+bug/1850550/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs