Hello list,

I should apologise, technically speaking, we are still running Solaris
10/u10, which isn't IllumOS.  We would love to go to IllumOS kernel, due to
the problems we are encountering. More on that in a sec...

So, what appears to happen, each time a device dies in our Supermicro+LSI
SAS2008 NFS servers, it takes out the whole server. The last three that
happened, in December, were all SSDs dying (3 separate times/servers) and
each time we had to power cycle the server.

Since we have about 50 of these storage servers, and to change the OS, we
would have to do 3am maintenances for each one. It would be nice if I could
show that all those sleepless nights would be worth it. But I'm having a
hard time to replicate the issue.

I used a SATA extension cable, and cut one of the data lines during
transfers to see if it would trigger the problem, but the damned thing
ended up being a dream advertisement for how well ZFS handles failures.
Error count went up, SSD was marked faulty, and spare kicked in. I have
repeated this a number of times but each time ZFS handles it beautifully.
(typical).

Any great ideas on how to simulate failed disks? Pulling them out doesn't
generally work, since the controller gets notified of disconnect, as
opposed to the device no longer communicating.

Now, there HAS been some changes in mpt_sas.c in IllumOS, most noticeably

https://www.illumos.org/issues/3195
https://www.illumos.org/issues/4310
https://www.illumos.org/issues/5306
https://www.illumos.org/issues/5483

so I am hoping it perhaps has been addressed. Anyone dare venture a guess?


The log entries for one of the SSDs dying and taking out the server looks
like (and again, this is Solaris 10):

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
scsi: [ID 107833 kern.warning] WARNING:
/pci@0,0/pci8086,340f@8/pci1000,3020@0 (mpt_sas0):

Disconnected command timeout for Target 30

{
mptsas_check_scsi_io: IOCStatus=0x48 IOCLogInfo=0x31140000
} * 8

mptsas_check_task_mgt: IOCStatus=0x4a

mptsas_check_task_mgt: Task 0x3 failed. Target=30

mptsas_ioc_task_management failed try to reset ioc to recovery!

mpt0 Firmware version v12.0.0.0 (?)

{
 /scsi_vhci/disk@g50015179596fd400 (sd2): Command Timeout on path
mpt_sas1/disk@w50015179596fd400,0

SCSI transport failed: reason 'timeout': retrying command

 /scsi_vhci/disk@g50015179596fa188 (sd16): Command failed to complete (4)
on path mpt_sas1/disk@w50015179596fa188,0

SCSI transport failed: reason 'reset': retrying command

} * 8

mptsas_restart_ioc failed

Target 30 reset for command timeout recovery failed!

MPT Firmware Fault, code: 1500

mpt0 Firmware version v12.0.0.0 (?)

mpt0: IOC Operational.

{
SCSI transport failed: reason 'reset': retrying command
} * 16

mptsas_access_config_page: IOCStatus=0x22 IOCLogInfo=0x30030116

Error for Command: read(10)                Error Level: Retryable
Requested Block: 85734615                  Error Block: 85734615
Vendor: ATA                                Serial Number: CVPR132407CH
Sense Key: Unit Attention
ASC: 0x29 (power on, reset, or bus reset occurred), ASCQ: 0x0, FRU: 0x0

genunix: [ID 408114 kern.info] /pci@0,0/pci8086,340f@8/pci1000,3020@0
(mpt_sas0) down


~~~~~~~~~~~~~~~~~~~~~~~~

I would assume that last message about "down" is somewhat .. undesirable.



Garrett D'Amore does make a good point about SATA devices in the "mpt_sas
wedge" thread, and that all devices get reset when it tries to reset the
one drive. But should/would that lead to a complete halt of all IO? If that
is the case, there is not much we can do besides replacing all the hardware?

Lund


-- 
Jorgen Lundman       | <[email protected]>
Unix Administrator   | +81 (0)3 -5456-2687 ext 1017 (work)
Shibuya-ku, Tokyo    | +81 (0)90-5578-8500          (cell)
Japan                | +81 (0)3 -3375-1767          (home)


-------------------------------------------
illumos-discuss
Archives: https://www.listbox.com/member/archive/182180/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182180/21175430-2e6923be
Modify Your Subscription: 
https://www.listbox.com/member/?member_id=21175430&id_secret=21175430-6a77cda4
Powered by Listbox: http://www.listbox.com

Reply via email to