Re: 9305-16i fault 0x5853 (regression from 4.17.2 to 4.19.2) and EEH fence errors

2018-12-06 Thread Matt Corallo
losure logical 
id(0x500062b203842300), slot(5)
[ 1727.534641] scsi target0:0:4: handle(0x001e), 
sas_addr(0x443322110600)
[ 1727.534643] scsi target0:0:4: enclosure logical 
id(0x500062b203842300), slot(4)
[ 1727.534733] scsi target0:0:6: handle(0x001f), 
sas_addr(0x443322111000)
[ 1727.534735] scsi target0:0:6: enclosure logical 
id(0x500062b203842300), slot(11)
[ 1727.534823] scsi target0:0:7: handle(0x0020), 
sas_addr(0x44332200)
[ 1727.534825] scsi target0:0:7: enclosure logical 
id(0x500062b203842300), slot(10)
[ 1727.534915] scsi target0:0:9: handle(0x0021), 
sas_addr(0x443322111300)
[ 1727.534918] scsi target0:0:9: enclosure logical 
id(0x500062b203842300), slot(9)
[ 1727.535004] scsi target0:0:8: handle(0x0022), 
sas_addr(0x443322111200)
[ 1727.535006] scsi target0:0:8: enclosure logical 
id(0x500062b203842300), slot(8)

[ 1727.535588] mpt3sas_cm0: search for end-devices: complete
[ 1727.535589] mpt3sas_cm0: search for end-devices: start
[ 1727.535590] mpt3sas_cm0: search for PCIe end-devices: complete
[ 1727.535591] mpt3sas_cm0: search for expanders: start
[ 1727.535592] mpt3sas_cm0: search for expanders: complete
[ 1727.535598] mpt3sas_cm0: hard reset: success
[ 1727.535601] EEH: PE#fd (PCI 0031:01:00.0): mpt3sas driver reports: 
'recovered'
[ 1727.535602] EEH: Finished:'slot_reset' with aggregate recovery 
state:'recovered'

[ 1727.535604] EEH: Notify device driver to resume
[ 1727.535606] EEH: Beginning: 'resume'
[ 1727.535608] EEH: PE#fd (PCI 0031:01:00.0): Invoking mpt3sas->resume()
[ 1727.535610] mpt3sas_cm0: PCI error: resume callback!!
[ 1727.536115] mpt3sas_cm0: removing unresponding devices: start
[ 1727.536118] mpt3sas_cm0: removing unresponding devices: end-devices
[ 1727.536121] mpt3sas_cm0:  Removing unresponding devices: pcie end-devices
[ 1727.536123] mpt3sas_cm0: removing unresponding devices: expanders
[ 1727.536124] mpt3sas_cm0: removing unresponding devices: complete
[ 1727.536132] mpt3sas_cm0: scan devices: start
[ 1727.536357] EEH: PE#fd (PCI 0031:01:00.0): mpt3sas driver reports: 'none'
[ 1727.536359] EEH: Finished:'resume'
[ 1727.536360] EEH: Recovery successful.
[ 1727.551703] mpt3sas_cm0: scan devices: expanders start
[ 1727.552312] mpt3sas_cm0: 	break from expander scan: 
ioc_status(0x0022), loginfo(0x310f0400)

[ 1727.552313] mpt3sas_cm0: scan devices: expanders complete
[ 1727.552314] mpt3sas_cm0: scan devices: end devices start
[ 1727.561673] mpt3sas_cm0: 	break from end device scan: 
ioc_status(0x0022), loginfo(0x310f0400)

[ 1727.561675] mpt3sas_cm0: scan devices: end devices complete
[ 1727.561676] mpt3sas_cm0: scan devices: pcie end devices start
[ 1727.561706] mpt3sas_cm0: log_info(0x3003011d): originator(IOP), 
code(0x03), sub_code(0x011d)
[ 1727.561755] mpt3sas_cm0: log_info(0x3003011d): originator(IOP), 
code(0x03), sub_code(0x011d)
[ 1727.561759] mpt3sas_cm0: 	break from pcie end device scan: 
ioc_status(0x0021), loginfo(0x3003011d)

[ 1727.561760] mpt3sas_cm0: pcie devices: pcie end devices complete
[ 1727.561761] mpt3sas_cm0: scan devices: complete
[ 1728.031230] sd 0:0:0:0: Power-on or device reset occurred
[ 1728.031235] sd 0:0:1:0: Power-on or device reset occurred
[ 1728.031236] sd 0:0:4:0: Power-on or device reset occurred
[ 1728.031238] sd 0:0:2:0: Power-on or device reset occurred
[ 1728.031240] sd 0:0:7:0: Power-on or device reset occurred
[ 1728.031243] sd 0:0:6:0: Power-on or device reset occurred
[ 1728.031245] sd 0:0:8:0: Power-on or device reset occurred
[ 1728.031260] sd 0:0:9:0: Power-on or device reset occurred
[ 1728.031719] sd 0:0:5:0: Power-on or device reset occurred
[ 1728.819966] mpt3sas_cm0: log_info(0x31120320): originator(PL), 
code(0x12), sub_code(0x0320)

[ 1729.281480] sd 0:0:8:0: Power-on or device reset occurred
[ 1729.443974] sd 0:0:3:0: Power-on or device reset occurred
[ 1730.031108] sd 0:0:8:0: Power-on or device reset occurred



On 11/18/18 7:52 PM, Matt Corallo wrote:

(not subscribed to lists, please keep me on CC)

When upgrading from 4.17.2 to 4.19.2, my 9305-16i started faulting on 
load (well, not so much load given its all spinning disks, but once 
every hour or five during a btrfs rebalance). Mostly the disks would 
all come back online and it would just result in 30 seconds of disks 
offline, but occasionally it would fall offline completely. dmesg is 
below, though its not so useful. Downgrading to 4.17.2 again fixed the 
issue completely.


[19983.155887] mpt3sas_cm0: fault_state(0x5853)!
[19983.155932] mpt3sas_cm0: sending diag reset !!
[19984.093087] mpt3sas_cm0: diag reset: SUCCESS
[19984.155056] mpt3sas_cm0: CurrentHostPageSize is 0: Setting default 
host page size to 4k

[19984.301067] mpt3sas_cm0: _base_display_fwpkg_version: complete
[19984.301400] mpt3sas_cm0: LSISAS3224: FWVersion(09.00.100.00), 
ChipRevision(0x01), BiosVersion(00.00.00.00)

[19984.301403] mpt3sas_cm0: Protocol=(
[19984.3014

9305-16i fault 0x5853 (regression from 4.17.2 to 4.19.2)

2018-11-18 Thread Matt Corallo

(not subscribed to lists, please keep me on CC)

When upgrading from 4.17.2 to 4.19.2, my 9305-16i started faulting on 
load (well, not so much load given its all spinning disks, but once 
every hour or five during a btrfs rebalance). Mostly the disks would all 
come back online and it would just result in 30 seconds of disks 
offline, but occasionally it would fall offline completely. dmesg is 
below, though its not so useful. Downgrading to 4.17.2 again fixed the 
issue completely.


[19983.155887] mpt3sas_cm0: fault_state(0x5853)!
[19983.155932] mpt3sas_cm0: sending diag reset !!
[19984.093087] mpt3sas_cm0: diag reset: SUCCESS
[19984.155056] mpt3sas_cm0: CurrentHostPageSize is 0: Setting default 
host page size to 4k

[19984.301067] mpt3sas_cm0: _base_display_fwpkg_version: complete
[19984.301400] mpt3sas_cm0: LSISAS3224: FWVersion(09.00.100.00), 
ChipRevision(0x01), BiosVersion(00.00.00.00)

[19984.301403] mpt3sas_cm0: Protocol=(
[19984.301404] Initiator
[19984.301405] ,Target
[19984.301406] ),
[19984.301407] Capabilities=(
[19984.301408] TLR
[19984.301410] ,EEDP
[19984.301411] ,Snapshot Buffer
[19984.301412] ,Diag Trace Buffer
[19984.301413] ,Task Set Full
[19984.301414] ,NCQ
[19984.301415] )
[19984.301473] mpt3sas_cm0: sending port enable !!
[19995.149962] mpt3sas_cm0: port enable: SUCCESS
[19995.150077] mpt3sas_cm0: search for end-devices: start
[19995.151143] scsi target0:0:0: handle(0x0019), 
sas_addr(0x44332211)
[19995.151147] scsi target0:0:0: enclosure logical 
id(0x500062b203842300), slot(3)
[19995.151196] scsi target0:0:1: handle(0x001a), 
sas_addr(0x443322110300)
[19995.151199] scsi target0:0:1: enclosure logical 
id(0x500062b203842300), slot(1)
[19995.151245] scsi target0:0:3: handle(0x001b), 
sas_addr(0x443322110500)
[19995.151247] scsi target0:0:3: enclosure logical 
id(0x500062b203842300), slot(6)
[19995.151293] scsi target0:0:2: handle(0x001c), 
sas_addr(0x443322110400)
[19995.151296] scsi target0:0:2: enclosure logical 
id(0x500062b203842300), slot(7)
[19995.151342] scsi target0:0:4: handle(0x001d), 
sas_addr(0x443322110600)
[19995.151345] scsi target0:0:4: enclosure logical 
id(0x500062b203842300), slot(4)
[19995.151391] scsi target0:0:6: handle(0x001e), 
sas_addr(0x443322111000)
[19995.151393] scsi target0:0:6: enclosure logical 
id(0x500062b203842300), slot(11)
[19995.151439] scsi target0:0:7: handle(0x001f), 
sas_addr(0x44332200)
[19995.151441] scsi target0:0:7: enclosure logical 
id(0x500062b203842300), slot(10)
[19995.151487] scsi target0:0:9: handle(0x0020), 
sas_addr(0x443322111300)
[19995.151490] scsi target0:0:9: enclosure logical 
id(0x500062b203842300), slot(9)
[19995.151535] scsi target0:0:10: handle(0x0021), 
sas_addr(0x443322111200)
[19995.151537] scsi target0:0:10: enclosure logical 
id(0x500062b203842300), slot(8)

[19995.151607] mpt3sas_cm0: search for end-devices: complete
[19995.151609] mpt3sas_cm0: search for end-devices: start
[19995.151611] mpt3sas_cm0: search for PCIe end-devices: complete
[19995.151613] mpt3sas_cm0: search for expanders: start
[19995.151614] mpt3sas_cm0: search for expanders: complete
[19995.151624] mpt3sas_cm0: _base_fault_reset_work: hard reset: success
[19995.151630] mpt3sas_cm0: removing unresponding devices: start
[19995.151631] mpt3sas_cm0: removing unresponding devices: end-devices
[19995.151633] mpt3sas_cm0:  Removing unresponding devices: pcie end-devices
[19995.151635] mpt3sas_cm0: removing unresponding devices: expanders
[19995.151636] mpt3sas_cm0: removing unresponding devices: complete
[19995.151642] mpt3sas_cm0: scan devices: start
[19995.152075] mpt3sas_cm0: scan devices: expanders start
[19995.152139] mpt3sas_cm0: 	break from expander scan: 
ioc_status(0x0022), loginfo(0x310f0400)

[19995.152141] mpt3sas_cm0: scan devices: expanders complete
[19995.152142] mpt3sas_cm0: scan devices: end devices start
[19995.156007] mpt3sas_cm0: 	break from end device scan: 
ioc_status(0x0022), loginfo(0x310f0400)

[19995.156009] mpt3sas_cm0: scan devices: end devices complete
[19995.156010] mpt3sas_cm0: scan devices: pcie end devices start
[19995.156028] mpt3sas_cm0: log_info(0x3003011d): originator(IOP), 
code(0x03), sub_code(0x011d)
[19995.156047] mpt3sas_cm0: log_info(0x3003011d): originator(IOP), 
code(0x03), sub_code(0x011d)
[19995.156053] mpt3sas_cm0: 	break from pcie end device scan: 
ioc_status(0x0021), loginfo(0x3003011d)

[19995.156054] mpt3sas_cm0: pcie devices: pcie end devices complete
[19995.156055] mpt3sas_cm0: scan devices: complete
[19995.650024] sd 0:0:0:0: Power-on or device reset occurred
[19995.650052] sd 0:0:6:0: Power-on or device reset occurred
[19996.239565] sd 0:0:10:0: Power-on or device reset occurred
[19996.341924] sd 0:0:9:0: Power-on or device reset occurred
[19996.650155] sd 0:0:1:0: Power-on or device reset occurred
[19996.650184] sd 0:0:3:0: Power-on or device reset occurred
[19996.650197] sd 0:0:2:0: Power-on or device reset occurred
[19996.65