Re: 9211 (LSI/SAS) issues on 11.2-STABLE

2019-02-07 Thread Borja Marcos


> On 6 Feb 2019, at 16:34, Karl Denninger  wrote:
> 
> On 2/6/2019 09:18, Borja Marcos wrote:
 Number of Hardware Resets has incremented.  There are no other errors 
 shown:
>> What is _exactly_ that value? Is it related to the number of resets sent 
>> from the HBA
>> _or_ the device resetting by itself?
> Good question.  What counts as a "reset"; UNIT ATTENTION is what the
> controller receives but whether that's a power reset, a reset *command*
> from the HBA or a firmware crash (yikes!) in the disk I'm not certain.

In my youth I wrote software for tape drives. After a reset, no matter how it 
was
initiated (the device itself or the HBA) the device will give you a UNIT 
ATTENTION
if I remember well (25 years ago). 


 I'd throw possible shade at the backplane or cable /but I have already
 swapped both out for spares without any change in behavior./
>> What about the power supply? 
>> 
> There are multiple other devices and the system board on that supply
> (and thus voltage rails) but it too has been swapped out without
> change.  In fact at this point other than the system board and RAM
> (which is ECC, and is showing no errors in the system's BMC log)
> /everything /in the server case (HBA, SATA expander, backplane, power
> supply and cables) has been swapped for spares.  No change in behavior.
> 
> Note that with 20.0.7.0 firmware in the HBA instead of a unit attention
> I get a *controller* reset (!!!) which detaches some random number of
> devices from ZFS's point of view before it comes back up (depending on
> what's active at the time) which is potentially catastrophic if it hits
> the system pool.  I immediately went back to 19.0.0.0 firmware on the
> HBA; I had upgraded to 20.0.7.0 since there had been good reports of
> stability with it when I first saw this, thinking there was a drive
> change that might have resulted in issues with it when running 19.0
> firmware on the card.

I have a system running 12.0-RELEASE-p1 with a LSI2008, 15 SAS disks and a SATA 
SSD
and I haven’t seen any problems. This is heavily loaded with just 8 GB of 
memory and a lot
of stuff running. 

mps0:  port 0x9000-0x90ff mem 0xdfff-0xdff
f,0xdff8-0xdffb irq 17 at device 0.0 numa-domain 0 on pci4
mps0: Firmware: 20.00.07.00, Driver: 21.02.00.00-fbsd
mps0: IOCCapabilities: 185c

ses0:
Enclosure Name: LSILOGIC SASX28 A.0 5021

> This system was completely stable for over a year on 11.1-STABLE and in
> fact hadn't been rebooted or logged a single "event" in over six months;
> the problems started immediately upon upgrade to 11.2-STABLE and
> persists on 12.0-STABLE.  The disks in question haven't changed either
> (so it can't be a difference in firmware that is in a newer purchased
> disk, for example.)

But you are right, a panic because of a disk problem points to a bug. As long 
as the ZFS
pool is usable, trouble with one of its disks should just be logged. Unless of 
course
the disk is used for swap or the disk failure leads to the system being unable 
to 
complete a page in. Again, it shouldn’t happen.

> I'm thinking perhaps *something* in the codebase change made the HBA ->
> SAS Expander combination trouble where it wasn't before; I've got a
> couple of 16i HBAs on the way which will allow me to remove the SAS
> expander to see if that causes the problem to disappear.  I've got a
> bunch of these Lenovo expanders and have been using them without any
> sort of trouble in multiple machines; it's only when I went beyond 11.1
> that I started having trouble with them.

It might be some backplane misbehavior triggering a bug, complicated.





Borja.



___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: 9211 (LSI/SAS) issues on 11.2-STABLE

2019-02-06 Thread Karl Denninger
On 2/6/2019 09:18, Borja Marcos wrote:
>> On 5 Feb 2019, at 23:49, Karl Denninger  wrote:
>>
>> BTW under 12.0-STABLE (built this afternoon after the advisories came
>> out, with the patches) it's MUCH worse.  I get the same device resets
>> BUT it's followed by an immediate panic which I cannot dump as it
>> generates a page-fault (supervisor read data, page not present) in the
>> mps *driver* at mpssas_send_abort+0x21.
>> This precludes a dump of course since attempting to do so gives you a
>> double-panic (I was wondering why I didn't get a crash dump!); I'll
>> re-jigger the box to stick a dump device on an internal SATA device so I
>> can successfully get the dump when it happens and see if I can obtain a
>> proper crash dump on this.
>>
>> I think it's fair to assume that 12.0-STABLE should not panic on a disk
>> problem (unless of course the problem is trying to page something back
>> in -- it's not, the drive that aborts and resets is on a data pack doing
>> a scrub)
> It shouldn’t panic I imagine.
>
> mps0: Sending reset from mpssas_send_abort for target ID 37
>
>>> 0x06  =  =   =  ===  == Transport Statistics (rev 1) ==
>>> 0x06  0x008  4   6  ---  Number of Hardware Resets
>>> 0x06  0x010  4   0  ---  Number of ASR Events
>>> 0x06  0x018  4   0  ---  Number of Interface CRC Errors
>>> |||_ C monitored condition met
>>> ||__ D supports DSN
>>> |___ N normalized value
>>>
>>> 0x06  0x008  4   7  ---  Number of Hardware Resets
>>> 0x06  0x010  4   0  ---  Number of ASR Events
>>> 0x06  0x018  4   0  ---  Number of Interface CRC Errors
>>> |||_ C monitored condition met
>>> ||__ D supports DSN
>>> |___ N normalized value
>>>
>>> Number of Hardware Resets has incremented.  There are no other errors shown:
> What is _exactly_ that value? Is it related to the number of resets sent from 
> the HBA
> _or_ the device resetting by itself?
Good question.  What counts as a "reset"; UNIT ATTENTION is what the
controller receives but whether that's a power reset, a reset *command*
from the HBA or a firmware crash (yikes!) in the disk I'm not certain.
>>> I'd throw possible shade at the backplane or cable /but I have already
>>> swapped both out for spares without any change in behavior./
> What about the power supply? 
>
There are multiple other devices and the system board on that supply
(and thus voltage rails) but it too has been swapped out without
change.  In fact at this point other than the system board and RAM
(which is ECC, and is showing no errors in the system's BMC log)
/everything /in the server case (HBA, SATA expander, backplane, power
supply and cables) has been swapped for spares.  No change in behavior.

Note that with 20.0.7.0 firmware in the HBA instead of a unit attention
I get a *controller* reset (!!!) which detaches some random number of
devices from ZFS's point of view before it comes back up (depending on
what's active at the time) which is potentially catastrophic if it hits
the system pool.  I immediately went back to 19.0.0.0 firmware on the
HBA; I had upgraded to 20.0.7.0 since there had been good reports of
stability with it when I first saw this, thinking there was a drive
change that might have resulted in issues with it when running 19.0
firmware on the card.

This system was completely stable for over a year on 11.1-STABLE and in
fact hadn't been rebooted or logged a single "event" in over six months;
the problems started immediately upon upgrade to 11.2-STABLE and
persists on 12.0-STABLE.  The disks in question haven't changed either
(so it can't be a difference in firmware that is in a newer purchased
disk, for example.)

I'm thinking perhaps *something* in the codebase change made the HBA ->
SAS Expander combination trouble where it wasn't before; I've got a
couple of 16i HBAs on the way which will allow me to remove the SAS
expander to see if that causes the problem to disappear.  I've got a
bunch of these Lenovo expanders and have been using them without any
sort of trouble in multiple machines; it's only when I went beyond 11.1
that I started having trouble with them.

-- 
Karl Denninger
k...@denninger.net 
/The Market Ticker/
/[S/MIME encrypted email preferred]/


smime.p7s
Description: S/MIME Cryptographic Signature


Re: 9211 (LSI/SAS) issues on 11.2-STABLE

2019-02-06 Thread Borja Marcos


> On 5 Feb 2019, at 23:49, Karl Denninger  wrote:
> 
> BTW under 12.0-STABLE (built this afternoon after the advisories came
> out, with the patches) it's MUCH worse.  I get the same device resets
> BUT it's followed by an immediate panic which I cannot dump as it
> generates a page-fault (supervisor read data, page not present) in the
> mps *driver* at mpssas_send_abort+0x21.

> This precludes a dump of course since attempting to do so gives you a
> double-panic (I was wondering why I didn't get a crash dump!); I'll
> re-jigger the box to stick a dump device on an internal SATA device so I
> can successfully get the dump when it happens and see if I can obtain a
> proper crash dump on this.
> 
> I think it's fair to assume that 12.0-STABLE should not panic on a disk
> problem (unless of course the problem is trying to page something back
> in -- it's not, the drive that aborts and resets is on a data pack doing
> a scrub)

It shouldn’t panic I imagine.

 mps0: Sending reset from mpssas_send_abort for target ID 37


>> 0x06  =  =   =  ===  == Transport Statistics (rev 1) ==
>> 0x06  0x008  4   6  ---  Number of Hardware Resets
>> 0x06  0x010  4   0  ---  Number of ASR Events
>> 0x06  0x018  4   0  ---  Number of Interface CRC Errors
>> |||_ C monitored condition met
>> ||__ D supports DSN
>> |___ N normalized value
>> 
>> 0x06  0x008  4   7  ---  Number of Hardware Resets
>> 0x06  0x010  4   0  ---  Number of ASR Events
>> 0x06  0x018  4   0  ---  Number of Interface CRC Errors
>> |||_ C monitored condition met
>> ||__ D supports DSN
>> |___ N normalized value
>> 
>> Number of Hardware Resets has incremented.  There are no other errors shown:

What is _exactly_ that value? Is it related to the number of resets sent from 
the HBA
_or_ the device resetting by itself?

>> I'd throw possible shade at the backplane or cable /but I have already
>> swapped both out for spares without any change in behavior./

What about the power supply? 





Borja.


___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: 9211 (LSI/SAS) issues on 11.2-STABLE

2019-02-05 Thread Karl Denninger
BTW under 12.0-STABLE (built this afternoon after the advisories came
out, with the patches) it's MUCH worse.  I get the same device resets
BUT it's followed by an immediate panic which I cannot dump as it
generates a page-fault (supervisor read data, page not present) in the
mps *driver* at mpssas_send_abort+0x21.

This precludes a dump of course since attempting to do so gives you a
double-panic (I was wondering why I didn't get a crash dump!); I'll
re-jigger the box to stick a dump device on an internal SATA device so I
can successfully get the dump when it happens and see if I can obtain a
proper crash dump on this.

I think it's fair to assume that 12.0-STABLE should not panic on a disk
problem (unless of course the problem is trying to page something back
in -- it's not, the drive that aborts and resets is on a data pack doing
a scrub)

On 2/5/2019 10:26, Karl Denninger wrote:
> On 2/5/2019 09:22, Karl Denninger wrote:
>> On 2/2/2019 12:02, Karl Denninger wrote:
>>> I recently started having some really oddball things  happening under
>>> stress.  This coincided with the machine being updated to 11.2-STABLE
>>> (FreeBSD 11.2-STABLE #1 r342918:) from 11.1.
>>>
>>> Specifically, I get "errors" like this:
>>>
>>>     (da12:mps0:0:37:0): READ(10). CDB: 28 00 af 82 bb 08 00 01 00 00
>>> length 131072 SMID 269 Aborting command 0xfe0001179110
>>> mps0: Sending reset from mpssas_send_abort for target ID 37
>>>     (da12:mps0:0:37:0): READ(10). CDB: 28 00 af 82 bc 08 00 01 00 00
>>> length 131072 SMID 924 terminated ioc 804b loginfo 3114 scsi 0 state
>>> c xfer 0
>>>     (da12:mps0:0:37:0): READ(10). CDB: 28 00 af 82 ba 08 00 01 00 00
>>> length 131072 SMID 161 terminated ioc 804b loginfo 3114 scsi 0 state
>>> c xfer 0
>>> mps0: Unfreezing devq for target ID 37
>>> (da12:mps0:0:37:0): READ(10). CDB: 28 00 af 82 bc 08 00 01 00 00
>>> (da12:mps0:0:37:0): CAM status: CCB request completed with an error
>>> (da12:mps0:0:37:0): Retrying command
>>> (da12:mps0:0:37:0): READ(10). CDB: 28 00 af 82 bb 08 00 01 00 00
>>> (da12:mps0:0:37:0): CAM status: Command timeout
>>> (da12:mps0:0:37:0): Retrying command
>>> (da12:mps0:0:37:0): READ(10). CDB: 28 00 af 82 ba 08 00 01 00 00
>>> (da12:mps0:0:37:0): CAM status: CCB request completed with an error
>>> (da12:mps0:0:37:0): Retrying command
>>> (da12:mps0:0:37:0): READ(10). CDB: 28 00 af 82 ba 08 00 01 00 00
>>> (da12:mps0:0:37:0): CAM status: SCSI Status Error
>>> (da12:mps0:0:37:0): SCSI status: Check Condition
>>> (da12:mps0:0:37:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on,
>>> reset, or bus device reset occurred)
>>> (da12:mps0:0:37:0): Retrying command (per sense data)
>>>
>>> The "Unit Attention" implies the drive reset.  It only occurs on certain
>>> drives under very heavy load (e.g. a scrub.)  I've managed to provoke it
>>> on two different brands of disk across multiple firmware and capacities,
>>> however, which tends to point away from a drive firmware problem.
>>>
>>> A look at the pool data shows /no /errors (e.g. no checksum problems,
>>> etc) and a look at the disk itself (using smartctl) shows no problems
>>> either -- whatever is going on here the adapter is recovering from it
>>> without any data corruption or loss registered on *either end*!
>>>
>>> The configuration is an older SuperMicro Xeon board (X8DTL-IF) and shows:
>>>
>>> mps0:  port 0xc000-0xc0ff mem
>>> 0xfbb3c000-0xfbb3,0xfbb4-0xfbb7 irq 30 at device 0.0 on pci3
>>> mps0: Firmware: 19.00.00.00, Driver: 21.02.00.00-fbsd
>>> mps0: IOCCapabilities:
>>> 1285c
>> After considerable additional work this looks increasingly like either a
>> missed interrupt or a command is getting lost between the host adapter
>> and the expander.
>>
>> I'm going to turn the driver debug level up and see if I can capture
>> more information. whatever is behind this, however, it is
>> almost-certainly related to something that changed between 11.1 and
>> 11.2, as I never saw these on the 11.1-STABLE build.
>>
>> --
>> Karl Denninger
>> k...@denninger.net 
>> /The Market Ticker/
>> /[S/MIME encrypted email preferred]/
> Pretty decent trace here -- any ideas?
>
> mps0: timedout cm 0xfe00011b5020 allocated tm 0xfe00011812a0
>     (da11:mps0:0:37:0): READ(10). CDB: 28 00 82 b5 3b 80 00 01 00 00
> length 131072 SMID 634 Aborting command 0xfe00011b5020
> mps0: Sending reset from mpssas_send_abort for target ID 37
> mps0: queued timedout cm 0xfe00011c2760 for processing by tm
> 0xfe00011812a0
> mps0: queued timedout cm 0xfe00011a74f0 for processing by tm
> 0xfe00011812a0
> mps0: queued timedout cm 0xfe00011cfd50 for processing by tm
> 0xfe00011812a0
> mps0: EventReply    :
>     EventDataLength: 2
>     AckRequired: 0
>     Event: SasDiscovery (0x16)
>     EventContext: 0x0
>     Flags: 1
>     ReasonCode: Discovery Started
>     PhysicalPort: 0
>     DiscoveryStatus: 0
> mps0: 

Re: 9211 (LSI/SAS) issues on 11.2-STABLE

2019-02-05 Thread Karl Denninger
On 2/5/2019 09:22, Karl Denninger wrote:
> On 2/2/2019 12:02, Karl Denninger wrote:
>> I recently started having some really oddball things  happening under
>> stress.  This coincided with the machine being updated to 11.2-STABLE
>> (FreeBSD 11.2-STABLE #1 r342918:) from 11.1.
>>
>> Specifically, I get "errors" like this:
>>
>>     (da12:mps0:0:37:0): READ(10). CDB: 28 00 af 82 bb 08 00 01 00 00
>> length 131072 SMID 269 Aborting command 0xfe0001179110
>> mps0: Sending reset from mpssas_send_abort for target ID 37
>>     (da12:mps0:0:37:0): READ(10). CDB: 28 00 af 82 bc 08 00 01 00 00
>> length 131072 SMID 924 terminated ioc 804b loginfo 3114 scsi 0 state
>> c xfer 0
>>     (da12:mps0:0:37:0): READ(10). CDB: 28 00 af 82 ba 08 00 01 00 00
>> length 131072 SMID 161 terminated ioc 804b loginfo 3114 scsi 0 state
>> c xfer 0
>> mps0: Unfreezing devq for target ID 37
>> (da12:mps0:0:37:0): READ(10). CDB: 28 00 af 82 bc 08 00 01 00 00
>> (da12:mps0:0:37:0): CAM status: CCB request completed with an error
>> (da12:mps0:0:37:0): Retrying command
>> (da12:mps0:0:37:0): READ(10). CDB: 28 00 af 82 bb 08 00 01 00 00
>> (da12:mps0:0:37:0): CAM status: Command timeout
>> (da12:mps0:0:37:0): Retrying command
>> (da12:mps0:0:37:0): READ(10). CDB: 28 00 af 82 ba 08 00 01 00 00
>> (da12:mps0:0:37:0): CAM status: CCB request completed with an error
>> (da12:mps0:0:37:0): Retrying command
>> (da12:mps0:0:37:0): READ(10). CDB: 28 00 af 82 ba 08 00 01 00 00
>> (da12:mps0:0:37:0): CAM status: SCSI Status Error
>> (da12:mps0:0:37:0): SCSI status: Check Condition
>> (da12:mps0:0:37:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on,
>> reset, or bus device reset occurred)
>> (da12:mps0:0:37:0): Retrying command (per sense data)
>>
>> The "Unit Attention" implies the drive reset.  It only occurs on certain
>> drives under very heavy load (e.g. a scrub.)  I've managed to provoke it
>> on two different brands of disk across multiple firmware and capacities,
>> however, which tends to point away from a drive firmware problem.
>>
>> A look at the pool data shows /no /errors (e.g. no checksum problems,
>> etc) and a look at the disk itself (using smartctl) shows no problems
>> either -- whatever is going on here the adapter is recovering from it
>> without any data corruption or loss registered on *either end*!
>>
>> The configuration is an older SuperMicro Xeon board (X8DTL-IF) and shows:
>>
>> mps0:  port 0xc000-0xc0ff mem
>> 0xfbb3c000-0xfbb3,0xfbb4-0xfbb7 irq 30 at device 0.0 on pci3
>> mps0: Firmware: 19.00.00.00, Driver: 21.02.00.00-fbsd
>> mps0: IOCCapabilities:
>> 1285c
> After considerable additional work this looks increasingly like either a
> missed interrupt or a command is getting lost between the host adapter
> and the expander.
>
> I'm going to turn the driver debug level up and see if I can capture
> more information. whatever is behind this, however, it is
> almost-certainly related to something that changed between 11.1 and
> 11.2, as I never saw these on the 11.1-STABLE build.
>
> --
> Karl Denninger
> k...@denninger.net 
> /The Market Ticker/
> /[S/MIME encrypted email preferred]/
Pretty decent trace here -- any ideas?

mps0: timedout cm 0xfe00011b5020 allocated tm 0xfe00011812a0
    (da11:mps0:0:37:0): READ(10). CDB: 28 00 82 b5 3b 80 00 01 00 00
length 131072 SMID 634 Aborting command 0xfe00011b5020
mps0: Sending reset from mpssas_send_abort for target ID 37
mps0: queued timedout cm 0xfe00011c2760 for processing by tm
0xfe00011812a0
mps0: queued timedout cm 0xfe00011a74f0 for processing by tm
0xfe00011812a0
mps0: queued timedout cm 0xfe00011cfd50 for processing by tm
0xfe00011812a0
mps0: EventReply    :
    EventDataLength: 2
    AckRequired: 0
    Event: SasDiscovery (0x16)
    EventContext: 0x0
    Flags: 1
    ReasonCode: Discovery Started
    PhysicalPort: 0
    DiscoveryStatus: 0
mps0: (0)->(mpssas_fw_work) Working on  Event: [16]
mps0: (1)->(mpssas_fw_work) Event Free: [16]
    (da11:mps0:0:37:0): READ(10). CDB: 28 00 82 b5 3c 80 00 01 00 00
length 131072 SMID 961 completed timedout cm 0xfe00011cfd50 ccb
0xf8019458e000 during recovery ioc 804b scsi 0 state c
(da11:mps0:0:37:0): READ(10). CDB: 28 00 82 b5 3c 80 00 01 00 00 length
131072 SMID 961 terminated ioc 804b loginfo 3114 scsi 0 state c xfer 0
    (da11:mps0:0:37:0): READ(10). CDB: 28 00 82 b5 3b 80 00 01 00 00
length 131072 SMID 634 completed timedout cm
0xfe00011b5(da11:mps0:0:37:0): READ(10). CDB: 28 00 82 b5 3c 80 00
01 00 00
020 ccb 0xf80056fb5000 during recovery ioc 8048 scsi 0 state
c(da11:mps0:0:37:0): CAM status: CCB request completed with an error
(da11:mps0:0:37:0): Retrying command
    (da11:mps0:0:37:0): READ(10). CDB: 28 00 82 b5 3a 80 00 01 00 00
length 131072 SMID 798 completed timedout cm 0xfe00011c2760 ccb
0xf80054e86000 during recovery ioc 804b scsi 0 state 

Re: 9211 (LSI/SAS) issues on 11.2-STABLE

2019-02-05 Thread Karl Denninger

On 2/2/2019 12:02, Karl Denninger wrote:
> I recently started having some really oddball things  happening under
> stress.  This coincided with the machine being updated to 11.2-STABLE
> (FreeBSD 11.2-STABLE #1 r342918:) from 11.1.
>
> Specifically, I get "errors" like this:
>
>     (da12:mps0:0:37:0): READ(10). CDB: 28 00 af 82 bb 08 00 01 00 00
> length 131072 SMID 269 Aborting command 0xfe0001179110
> mps0: Sending reset from mpssas_send_abort for target ID 37
>     (da12:mps0:0:37:0): READ(10). CDB: 28 00 af 82 bc 08 00 01 00 00
> length 131072 SMID 924 terminated ioc 804b loginfo 3114 scsi 0 state
> c xfer 0
>     (da12:mps0:0:37:0): READ(10). CDB: 28 00 af 82 ba 08 00 01 00 00
> length 131072 SMID 161 terminated ioc 804b loginfo 3114 scsi 0 state
> c xfer 0
> mps0: Unfreezing devq for target ID 37
> (da12:mps0:0:37:0): READ(10). CDB: 28 00 af 82 bc 08 00 01 00 00
> (da12:mps0:0:37:0): CAM status: CCB request completed with an error
> (da12:mps0:0:37:0): Retrying command
> (da12:mps0:0:37:0): READ(10). CDB: 28 00 af 82 bb 08 00 01 00 00
> (da12:mps0:0:37:0): CAM status: Command timeout
> (da12:mps0:0:37:0): Retrying command
> (da12:mps0:0:37:0): READ(10). CDB: 28 00 af 82 ba 08 00 01 00 00
> (da12:mps0:0:37:0): CAM status: CCB request completed with an error
> (da12:mps0:0:37:0): Retrying command
> (da12:mps0:0:37:0): READ(10). CDB: 28 00 af 82 ba 08 00 01 00 00
> (da12:mps0:0:37:0): CAM status: SCSI Status Error
> (da12:mps0:0:37:0): SCSI status: Check Condition
> (da12:mps0:0:37:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on,
> reset, or bus device reset occurred)
> (da12:mps0:0:37:0): Retrying command (per sense data)
>
> The "Unit Attention" implies the drive reset.  It only occurs on certain
> drives under very heavy load (e.g. a scrub.)  I've managed to provoke it
> on two different brands of disk across multiple firmware and capacities,
> however, which tends to point away from a drive firmware problem.
>
> A look at the pool data shows /no /errors (e.g. no checksum problems,
> etc) and a look at the disk itself (using smartctl) shows no problems
> either -- whatever is going on here the adapter is recovering from it
> without any data corruption or loss registered on *either end*!
>
> The configuration is an older SuperMicro Xeon board (X8DTL-IF) and shows:
>
> mps0:  port 0xc000-0xc0ff mem
> 0xfbb3c000-0xfbb3,0xfbb4-0xfbb7 irq 30 at device 0.0 on pci3
> mps0: Firmware: 19.00.00.00, Driver: 21.02.00.00-fbsd
> mps0: IOCCapabilities:
> 1285c

After considerable additional work this looks increasingly like either a
missed interrupt or a command is getting lost between the host adapter
and the expander.

I'm going to turn the driver debug level up and see if I can capture
more information. whatever is behind this, however, it is
almost-certainly related to something that changed between 11.1 and
11.2, as I never saw these on the 11.1-STABLE build.

--
Karl Denninger
k...@denninger.net 
/The Market Ticker/
/[S/MIME encrypted email preferred]/


smime.p7s
Description: S/MIME Cryptographic Signature


9211 (LSI/SAS) issues on 11.2-STABLE

2019-02-02 Thread Karl Denninger
I recently started having some really oddball things  happening under
stress.  This coincided with the machine being updated to 11.2-STABLE
(FreeBSD 11.2-STABLE #1 r342918:) from 11.1.

Specifically, I get "errors" like this:

    (da12:mps0:0:37:0): READ(10). CDB: 28 00 af 82 bb 08 00 01 00 00
length 131072 SMID 269 Aborting command 0xfe0001179110
mps0: Sending reset from mpssas_send_abort for target ID 37
    (da12:mps0:0:37:0): READ(10). CDB: 28 00 af 82 bc 08 00 01 00 00
length 131072 SMID 924 terminated ioc 804b loginfo 3114 scsi 0 state
c xfer 0
    (da12:mps0:0:37:0): READ(10). CDB: 28 00 af 82 ba 08 00 01 00 00
length 131072 SMID 161 terminated ioc 804b loginfo 3114 scsi 0 state
c xfer 0
mps0: Unfreezing devq for target ID 37
(da12:mps0:0:37:0): READ(10). CDB: 28 00 af 82 bc 08 00 01 00 00
(da12:mps0:0:37:0): CAM status: CCB request completed with an error
(da12:mps0:0:37:0): Retrying command
(da12:mps0:0:37:0): READ(10). CDB: 28 00 af 82 bb 08 00 01 00 00
(da12:mps0:0:37:0): CAM status: Command timeout
(da12:mps0:0:37:0): Retrying command
(da12:mps0:0:37:0): READ(10). CDB: 28 00 af 82 ba 08 00 01 00 00
(da12:mps0:0:37:0): CAM status: CCB request completed with an error
(da12:mps0:0:37:0): Retrying command
(da12:mps0:0:37:0): READ(10). CDB: 28 00 af 82 ba 08 00 01 00 00
(da12:mps0:0:37:0): CAM status: SCSI Status Error
(da12:mps0:0:37:0): SCSI status: Check Condition
(da12:mps0:0:37:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on,
reset, or bus device reset occurred)
(da12:mps0:0:37:0): Retrying command (per sense data)

The "Unit Attention" implies the drive reset.  It only occurs on certain
drives under very heavy load (e.g. a scrub.)  I've managed to provoke it
on two different brands of disk across multiple firmware and capacities,
however, which tends to point away from a drive firmware problem.

A look at the pool data shows /no /errors (e.g. no checksum problems,
etc) and a look at the disk itself (using smartctl) shows no problems
either -- whatever is going on here the adapter is recovering from it
without any data corruption or loss registered on *either end*!

The configuration is an older SuperMicro Xeon board (X8DTL-IF) and shows:

mps0:  port 0xc000-0xc0ff mem
0xfbb3c000-0xfbb3,0xfbb4-0xfbb7 irq 30 at device 0.0 on pci3
mps0: Firmware: 19.00.00.00, Driver: 21.02.00.00-fbsd
mps0: IOCCapabilities:
1285c

There is also a SAS expander connected to that with all but the boot
drives on it (the LSI card will not boot from the expander so the boot
mirror is directly connected to the adapter.)

Thinking this might be a firmware/driver compatibility related problem I
flashed the card to 20.00.07.00, which is the latest available.  That
made the situation **MUCH** worse; now instead of getting unit attention
issues I got *controller* resets (!!) which invariably some random
device (and sometimes more than one) in one of the pools to get
detached, as the controller didn't come back up fast enough for ZFS and
it declares the device(s) in question "removed".

Needless to say I immediately flashed the card back to 19.00.00.00!

This configuration has been completely stable on 11.1 for upwards of a
year, and only started misbehaving when I updated the OS to 11.2.  I've
pounded the living daylights out of this box for a very long time on a
succession of FreeBSD OS builds and up to 11.1 have never seen anything
like this; if I had a bad drive, it was clearly the drive.

Looking at the commit logs for the mps driver it appears there isn't
much here that *could* be involved, unless there's an interrupt issue
with some of the MSI changes that is interacting with my specific
motherboard line.

Any ideas on running this down would be appreciated; it's not easy to
trigger it on the 19.0 firmware but on 20. I can force a controller
reset and detach within a few minutes by running scrubs so if there are
things I can try (I have a sandbox machine with the same hardware in it
that won't make me cry much if I blow it up) that would great.

Thanks!

-- 
Karl Denninger
k...@denninger.net 
/The Market Ticker/
/[S/MIME encrypted email preferred]/


smime.p7s
Description: S/MIME Cryptographic Signature