Re: 9305-16i fault 0x5853 (regression from 4.17.2 to 4.19.2) and EEH fence errors
losure logical id(0x500062b203842300), slot(5) [ 1727.534641] scsi target0:0:4: handle(0x001e), sas_addr(0x443322110600) [ 1727.534643] scsi target0:0:4: enclosure logical id(0x500062b203842300), slot(4) [ 1727.534733] scsi target0:0:6: handle(0x001f), sas_addr(0x443322111000) [ 1727.534735] scsi target0:0:6: enclosure logical id(0x500062b203842300), slot(11) [ 1727.534823] scsi target0:0:7: handle(0x0020), sas_addr(0x44332200) [ 1727.534825] scsi target0:0:7: enclosure logical id(0x500062b203842300), slot(10) [ 1727.534915] scsi target0:0:9: handle(0x0021), sas_addr(0x443322111300) [ 1727.534918] scsi target0:0:9: enclosure logical id(0x500062b203842300), slot(9) [ 1727.535004] scsi target0:0:8: handle(0x0022), sas_addr(0x443322111200) [ 1727.535006] scsi target0:0:8: enclosure logical id(0x500062b203842300), slot(8) [ 1727.535588] mpt3sas_cm0: search for end-devices: complete [ 1727.535589] mpt3sas_cm0: search for end-devices: start [ 1727.535590] mpt3sas_cm0: search for PCIe end-devices: complete [ 1727.535591] mpt3sas_cm0: search for expanders: start [ 1727.535592] mpt3sas_cm0: search for expanders: complete [ 1727.535598] mpt3sas_cm0: hard reset: success [ 1727.535601] EEH: PE#fd (PCI 0031:01:00.0): mpt3sas driver reports: 'recovered' [ 1727.535602] EEH: Finished:'slot_reset' with aggregate recovery state:'recovered' [ 1727.535604] EEH: Notify device driver to resume [ 1727.535606] EEH: Beginning: 'resume' [ 1727.535608] EEH: PE#fd (PCI 0031:01:00.0): Invoking mpt3sas->resume() [ 1727.535610] mpt3sas_cm0: PCI error: resume callback!! [ 1727.536115] mpt3sas_cm0: removing unresponding devices: start [ 1727.536118] mpt3sas_cm0: removing unresponding devices: end-devices [ 1727.536121] mpt3sas_cm0: Removing unresponding devices: pcie end-devices [ 1727.536123] mpt3sas_cm0: removing unresponding devices: expanders [ 1727.536124] mpt3sas_cm0: removing unresponding devices: complete [ 1727.536132] mpt3sas_cm0: scan devices: start [ 1727.536357] EEH: PE#fd (PCI 0031:01:00.0): mpt3sas driver reports: 'none' [ 1727.536359] EEH: Finished:'resume' [ 1727.536360] EEH: Recovery successful. [ 1727.551703] mpt3sas_cm0: scan devices: expanders start [ 1727.552312] mpt3sas_cm0: break from expander scan: ioc_status(0x0022), loginfo(0x310f0400) [ 1727.552313] mpt3sas_cm0: scan devices: expanders complete [ 1727.552314] mpt3sas_cm0: scan devices: end devices start [ 1727.561673] mpt3sas_cm0: break from end device scan: ioc_status(0x0022), loginfo(0x310f0400) [ 1727.561675] mpt3sas_cm0: scan devices: end devices complete [ 1727.561676] mpt3sas_cm0: scan devices: pcie end devices start [ 1727.561706] mpt3sas_cm0: log_info(0x3003011d): originator(IOP), code(0x03), sub_code(0x011d) [ 1727.561755] mpt3sas_cm0: log_info(0x3003011d): originator(IOP), code(0x03), sub_code(0x011d) [ 1727.561759] mpt3sas_cm0: break from pcie end device scan: ioc_status(0x0021), loginfo(0x3003011d) [ 1727.561760] mpt3sas_cm0: pcie devices: pcie end devices complete [ 1727.561761] mpt3sas_cm0: scan devices: complete [ 1728.031230] sd 0:0:0:0: Power-on or device reset occurred [ 1728.031235] sd 0:0:1:0: Power-on or device reset occurred [ 1728.031236] sd 0:0:4:0: Power-on or device reset occurred [ 1728.031238] sd 0:0:2:0: Power-on or device reset occurred [ 1728.031240] sd 0:0:7:0: Power-on or device reset occurred [ 1728.031243] sd 0:0:6:0: Power-on or device reset occurred [ 1728.031245] sd 0:0:8:0: Power-on or device reset occurred [ 1728.031260] sd 0:0:9:0: Power-on or device reset occurred [ 1728.031719] sd 0:0:5:0: Power-on or device reset occurred [ 1728.819966] mpt3sas_cm0: log_info(0x31120320): originator(PL), code(0x12), sub_code(0x0320) [ 1729.281480] sd 0:0:8:0: Power-on or device reset occurred [ 1729.443974] sd 0:0:3:0: Power-on or device reset occurred [ 1730.031108] sd 0:0:8:0: Power-on or device reset occurred On 11/18/18 7:52 PM, Matt Corallo wrote: (not subscribed to lists, please keep me on CC) When upgrading from 4.17.2 to 4.19.2, my 9305-16i started faulting on load (well, not so much load given its all spinning disks, but once every hour or five during a btrfs rebalance). Mostly the disks would all come back online and it would just result in 30 seconds of disks offline, but occasionally it would fall offline completely. dmesg is below, though its not so useful. Downgrading to 4.17.2 again fixed the issue completely. [19983.155887] mpt3sas_cm0: fault_state(0x5853)! [19983.155932] mpt3sas_cm0: sending diag reset !! [19984.093087] mpt3sas_cm0: diag reset: SUCCESS [19984.155056] mpt3sas_cm0: CurrentHostPageSize is 0: Setting default host page size to 4k [19984.301067] mpt3sas_cm0: _base_display_fwpkg_version: complete [19984.301400] mpt3sas_cm0: LSISAS3224: FWVersion(09.00.100.00), ChipRevision(0x01), BiosVersion(00.00.00.00) [19984.301403] mpt3sas_cm0: Protocol=( [19984.3014
9305-16i fault 0x5853 (regression from 4.17.2 to 4.19.2)
(not subscribed to lists, please keep me on CC) When upgrading from 4.17.2 to 4.19.2, my 9305-16i started faulting on load (well, not so much load given its all spinning disks, but once every hour or five during a btrfs rebalance). Mostly the disks would all come back online and it would just result in 30 seconds of disks offline, but occasionally it would fall offline completely. dmesg is below, though its not so useful. Downgrading to 4.17.2 again fixed the issue completely. [19983.155887] mpt3sas_cm0: fault_state(0x5853)! [19983.155932] mpt3sas_cm0: sending diag reset !! [19984.093087] mpt3sas_cm0: diag reset: SUCCESS [19984.155056] mpt3sas_cm0: CurrentHostPageSize is 0: Setting default host page size to 4k [19984.301067] mpt3sas_cm0: _base_display_fwpkg_version: complete [19984.301400] mpt3sas_cm0: LSISAS3224: FWVersion(09.00.100.00), ChipRevision(0x01), BiosVersion(00.00.00.00) [19984.301403] mpt3sas_cm0: Protocol=( [19984.301404] Initiator [19984.301405] ,Target [19984.301406] ), [19984.301407] Capabilities=( [19984.301408] TLR [19984.301410] ,EEDP [19984.301411] ,Snapshot Buffer [19984.301412] ,Diag Trace Buffer [19984.301413] ,Task Set Full [19984.301414] ,NCQ [19984.301415] ) [19984.301473] mpt3sas_cm0: sending port enable !! [19995.149962] mpt3sas_cm0: port enable: SUCCESS [19995.150077] mpt3sas_cm0: search for end-devices: start [19995.151143] scsi target0:0:0: handle(0x0019), sas_addr(0x44332211) [19995.151147] scsi target0:0:0: enclosure logical id(0x500062b203842300), slot(3) [19995.151196] scsi target0:0:1: handle(0x001a), sas_addr(0x443322110300) [19995.151199] scsi target0:0:1: enclosure logical id(0x500062b203842300), slot(1) [19995.151245] scsi target0:0:3: handle(0x001b), sas_addr(0x443322110500) [19995.151247] scsi target0:0:3: enclosure logical id(0x500062b203842300), slot(6) [19995.151293] scsi target0:0:2: handle(0x001c), sas_addr(0x443322110400) [19995.151296] scsi target0:0:2: enclosure logical id(0x500062b203842300), slot(7) [19995.151342] scsi target0:0:4: handle(0x001d), sas_addr(0x443322110600) [19995.151345] scsi target0:0:4: enclosure logical id(0x500062b203842300), slot(4) [19995.151391] scsi target0:0:6: handle(0x001e), sas_addr(0x443322111000) [19995.151393] scsi target0:0:6: enclosure logical id(0x500062b203842300), slot(11) [19995.151439] scsi target0:0:7: handle(0x001f), sas_addr(0x44332200) [19995.151441] scsi target0:0:7: enclosure logical id(0x500062b203842300), slot(10) [19995.151487] scsi target0:0:9: handle(0x0020), sas_addr(0x443322111300) [19995.151490] scsi target0:0:9: enclosure logical id(0x500062b203842300), slot(9) [19995.151535] scsi target0:0:10: handle(0x0021), sas_addr(0x443322111200) [19995.151537] scsi target0:0:10: enclosure logical id(0x500062b203842300), slot(8) [19995.151607] mpt3sas_cm0: search for end-devices: complete [19995.151609] mpt3sas_cm0: search for end-devices: start [19995.151611] mpt3sas_cm0: search for PCIe end-devices: complete [19995.151613] mpt3sas_cm0: search for expanders: start [19995.151614] mpt3sas_cm0: search for expanders: complete [19995.151624] mpt3sas_cm0: _base_fault_reset_work: hard reset: success [19995.151630] mpt3sas_cm0: removing unresponding devices: start [19995.151631] mpt3sas_cm0: removing unresponding devices: end-devices [19995.151633] mpt3sas_cm0: Removing unresponding devices: pcie end-devices [19995.151635] mpt3sas_cm0: removing unresponding devices: expanders [19995.151636] mpt3sas_cm0: removing unresponding devices: complete [19995.151642] mpt3sas_cm0: scan devices: start [19995.152075] mpt3sas_cm0: scan devices: expanders start [19995.152139] mpt3sas_cm0: break from expander scan: ioc_status(0x0022), loginfo(0x310f0400) [19995.152141] mpt3sas_cm0: scan devices: expanders complete [19995.152142] mpt3sas_cm0: scan devices: end devices start [19995.156007] mpt3sas_cm0: break from end device scan: ioc_status(0x0022), loginfo(0x310f0400) [19995.156009] mpt3sas_cm0: scan devices: end devices complete [19995.156010] mpt3sas_cm0: scan devices: pcie end devices start [19995.156028] mpt3sas_cm0: log_info(0x3003011d): originator(IOP), code(0x03), sub_code(0x011d) [19995.156047] mpt3sas_cm0: log_info(0x3003011d): originator(IOP), code(0x03), sub_code(0x011d) [19995.156053] mpt3sas_cm0: break from pcie end device scan: ioc_status(0x0021), loginfo(0x3003011d) [19995.156054] mpt3sas_cm0: pcie devices: pcie end devices complete [19995.156055] mpt3sas_cm0: scan devices: complete [19995.650024] sd 0:0:0:0: Power-on or device reset occurred [19995.650052] sd 0:0:6:0: Power-on or device reset occurred [19996.239565] sd 0:0:10:0: Power-on or device reset occurred [19996.341924] sd 0:0:9:0: Power-on or device reset occurred [19996.650155] sd 0:0:1:0: Power-on or device reset occurred [19996.650184] sd 0:0:3:0: Power-on or device reset occurred [19996.650197] sd 0:0:2:0: Power-on or device reset occurred [19996.65