[Bug 1850550] Re: megaraid_sas driver in linux-modules-4.15.0-66-generic prevents server with "Cisco UCS C3000 RAID Controller for M4 Server Blade" from booting

2019-11-01 Thread Steffen Higel
We were pointed at a newer firmware for the RAID controller 29.00.1-0356
(it wasn't immediately clear that this was available for our generation
of blade), and the system boots properly on both 4.15.0-66-generic and
5.0.0-23-generic.

I think that this should be closed.

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1850550

Title:
  megaraid_sas driver in linux-modules-4.15.0-66-generic prevents server
  with "Cisco UCS C3000 RAID Controller for M4 Server Blade"  from
  booting

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux-hwe/+bug/1850550/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1850550] [NEW] megaraid_sas driver in linux-modules-4.15.0-66-generic prevents server with "Cisco UCS C3000 RAID Controller for M4 Server Blade" from booting

2019-10-29 Thread Steffen Higel
Public bug reported:

TLDR: system with rare LSI controller stopped booting when we upgraded
to kernel 4.15.0-66-generic.

The controller is a Cisco UCS C3000 RAID Controller for M4 Server Blade
with 4G RAID Cache, which Cisco tell me is a dual chip Broadcom 3316
ROC. To my knowledge, it is only used in their 56 disk 4U storage
server, so is probably not the most common of devices.

The controllers is running the latest firmware from Cisco (29.00.1-0110,
from 2015). It has 1 RAID1 VD and 56 drives exposed as JBODs.

Two different systems (16.04) with this controller went down for
automatic reboots last week and failed to come back up. The BMC notified
us that the RAID controller was in a faulted state, so we initially
assumed a hardware failure, but this turned out not to be the case.

The previous kernel version (4.15.0-65) boots fine. This also happens
with kernel 5.0.0-23 after we moved one up to 18.04.

On booting, we get dumped back to an initramfs prompt, and dmesg shows
some concerning stuff from the megaraid_sas driver:

[1.845268] megaraid_sas :12:00.0: FW now in Ready state
[1.845270] megaraid_sas :12:00.0: 32 bit DMA mask and 32 bit consistent 
mask
[1.845933] megaraid_sas :12:00.0: firmware supports msix: (16)
[1.845935] megaraid_sas :12:00.0: current msix/online cpus  : 
(16/40)
[1.845936] megaraid_sas :12:00.0: RDPQ mode : (disabled)
[1.845937] megaraid_sas :12:00.0: Current firmware supports maximum 
commands: 928LDIO threshold: 0
[1.846204] megaraid_sas :12:00.0: Configured max firmware commands: 927
[1.849914] megaraid_sas :12:00.0: FW supports sync cache: No
[1.931975] megaraid_sas :12:00.0: firmware type : Extended VD(240 
VD)firmware
[1.931977] megaraid_sas :12:00.0: controller type   : MR(4095MB)
[1.931978] megaraid_sas :12:00.0: Online Controller Reset(OCR)  : 
Enabled
[1.931979] megaraid_sas :12:00.0: Secure JBOD support   : Yes
[1.964209] megaraid_sas :12:00.0: INIT adapter done
[2.210527] megaraid_sas :12:00.0: pci id: 
(0x1000)/(0x00ce)/(0x1137)/(0x0197)
[2.210528] megaraid_sas :12:00.0: unevenspan support: no
[2.210529] megaraid_sas :12:00.0: firmware crash dump   : no
[2.210530] megaraid_sas :12:00.0: jbod sync map : yes
[2.588209] megaraid_sas :12:00.0: Iop2SysDoorbellIntfor scsi0
[2.588222] megaraid_sas :12:00.0: Found FW in FAULT state, will reset 
adapter scsi0.
[2.588223] megaraid_sas :12:00.0: resetting fusion adapter scsi0.
[2.588367] megaraid_sas :12:00.0: Reset not supported, killing adapter 
scsi0.

At this point, booting fails, presumably because the boot device has
disappeared.

I’m not sure that if the “FAULT state” message is true. The devices
comes up fine without error when we reboot on the pervious kernel
version, and the device logs (storcli /c0 show termlog) doesn’t indicate
any issues when the system successfully boots. The battery is fine. The
only hint we see in the terminal logs is:

T0: C0:supported dgbflags:
T0: C0:biosDisable: 0
T0: C0:ddrDisable: 0
T0: C0: *** HW Encryption Disabled : dcrReg=0
T0: C0:Reading Detroit Cache enable at DCR cache config register: 0x0
T0: C0:Reading Detroit Cache init at DCR cache control/status register: 0x0
T0: C0:TreeVelleInit Complete (Velle Config register 103ff) 
T0: C0:RegionLockMaroInit Complete (Maro config register c00103ff 
T0: C0:DRAM_LOCAL_BASE: 4000
T0: C0:MEM_FIXED_SIZE: 180
T0: C0:MEM_FIXED_END: 4180
T0: C0:FW_DRAM_REGION_START: 4180
T0: C0:FW_DRAM_REGION_SIZE: 290
T0: C0:MEM_POOL_BASE: 43b25ca0
T0: C0:Initializing memory pool size=005DA360 bytes
T0: C0:I2Chandle obtained for MUX [0]0x0 
T0: C0:I2Chandle obtained for MUX [1]0x10 
T0: C0:I2Chandle obtained for MUX [5]0x50 
T0: C0:I2Chandle obtained for MUX [2]0x20 
T0: C0:I2Chandle obtained for MUX [3]0x30 
T0: C0:I2Chandle obtained for MUX [4]0x40 
T0: C0:I2Chandle obtained for MUX [9]0x90 
T0: C0:LogInit: Flushing events from previous boot
T0: C0:EVT#19168-10/29/19 12:24:06:  15=Fatal firmware error: Line 1711 in 
../../raid/2108vI2o.c

We noticed that the following changes in that kernel update:

- scsi: megaraid_sas: Fix combined reply queue mode detection
- scsi: megaraid_sas: Add check for reset adapter bit

I’m not sure if this is a firmware issue, or a bug in the driver, but
we’re effectively stuck on 4.15.0-65 for now.

** Affects: linux-hwe (Ubuntu)
 Importance: Undecided
 Status: New

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1850550

Title:
  megaraid_sas driver in linux-modules-4.15.0-66-generic prevents server
  with "Cisco UCS C3000 RAID Controller for M4 Server Blade"  from
  booting

To manage notifications about this bug go to: