Im by no means an illumos expert or dev...but in my time as an applogic support 
engineer ive seen these same issues on linux when the following is true:

0. older kernel
+
1. long uptime
+
2. high load average
+
3. Sata NCQ enabled

causes scsi interfaces to disconnect randomly and cause major disk issues.

in linux disabling ncq in /sys/devices/block/sda/
by setting queue_depth to 0 resolves the issue.

i dont know if it could be the disk or controller module that these machines 
share in common but the logs and symptoms are almost identical.

again i couldbe way off because im no illumos expert...its just eerily familiar.

Id find out how to disable NCQ and then test it.

Someyimes you have to disable it in the HBA/raid card and the OS to completely 
disable.


Thanks and have a great day,

John Barfield

> On Jan 20, 2015, at 8:53 PM, Jorgen Lundman via illumos-discuss 
> <[email protected]> wrote:
> 
> 
> Hello list,
> 
> I should apologise, technically speaking, we are still running Solaris
> 10/u10, which isn't IllumOS.  We would love to go to IllumOS kernel, due to
> the problems we are encountering. More on that in a sec...
> 
> So, what appears to happen, each time a device dies in our Supermicro+LSI
> SAS2008 NFS servers, it takes out the whole server. The last three that
> happened, in December, were all SSDs dying (3 separate times/servers) and
> each time we had to power cycle the server.
> 
> Since we have about 50 of these storage servers, and to change the OS, we
> would have to do 3am maintenances for each one. It would be nice if I could
> show that all those sleepless nights would be worth it. But I'm having a
> hard time to replicate the issue.
> 
> I used a SATA extension cable, and cut one of the data lines during
> transfers to see if it would trigger the problem, but the damned thing
> ended up being a dream advertisement for how well ZFS handles failures.
> Error count went up, SSD was marked faulty, and spare kicked in. I have
> repeated this a number of times but each time ZFS handles it beautifully.
> (typical).
> 
> Any great ideas on how to simulate failed disks? Pulling them out doesn't
> generally work, since the controller gets notified of disconnect, as
> opposed to the device no longer communicating.
> 
> Now, there HAS been some changes in mpt_sas.c in IllumOS, most noticeably
> 
> https://www.illumos.org/issues/3195
> https://www.illumos.org/issues/4310
> https://www.illumos.org/issues/5306
> https://www.illumos.org/issues/5483
> 
> so I am hoping it perhaps has been addressed. Anyone dare venture a guess?
> 
> 
> The log entries for one of the SSDs dying and taking out the server looks
> like (and again, this is Solaris 10):
> 
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> scsi: [ID 107833 kern.warning] WARNING:
> /pci@0,0/pci8086,340f@8/pci1000,3020@0 (mpt_sas0):
> 
> Disconnected command timeout for Target 30
> 
> {
> mptsas_check_scsi_io: IOCStatus=0x48 IOCLogInfo=0x31140000
> } * 8
> 
> mptsas_check_task_mgt: IOCStatus=0x4a
> 
> mptsas_check_task_mgt: Task 0x3 failed. Target=30
> 
> mptsas_ioc_task_management failed try to reset ioc to recovery!
> 
> mpt0 Firmware version v12.0.0.0 (?)
> 
> {
> /scsi_vhci/disk@g50015179596fd400 (sd2): Command Timeout on path
> mpt_sas1/disk@w50015179596fd400,0
> 
> SCSI transport failed: reason 'timeout': retrying command
> 
> /scsi_vhci/disk@g50015179596fa188 (sd16): Command failed to complete (4)
> on path mpt_sas1/disk@w50015179596fa188,0
> 
> SCSI transport failed: reason 'reset': retrying command
> 
> } * 8
> 
> mptsas_restart_ioc failed
> 
> Target 30 reset for command timeout recovery failed!
> 
> MPT Firmware Fault, code: 1500
> 
> mpt0 Firmware version v12.0.0.0 (?)
> 
> mpt0: IOC Operational.
> 
> {
> SCSI transport failed: reason 'reset': retrying command
> } * 16
> 
> mptsas_access_config_page: IOCStatus=0x22 IOCLogInfo=0x30030116
> 
> Error for Command: read(10)                Error Level: Retryable
> Requested Block: 85734615                  Error Block: 85734615
> Vendor: ATA                                Serial Number: CVPR132407CH
> Sense Key: Unit Attention
> ASC: 0x29 (power on, reset, or bus reset occurred), ASCQ: 0x0, FRU: 0x0
> 
> genunix: [ID 408114 kern.info] /pci@0,0/pci8086,340f@8/pci1000,3020@0
> (mpt_sas0) down
> 
> 
> ~~~~~~~~~~~~~~~~~~~~~~~~
> 
> I would assume that last message about "down" is somewhat .. undesirable.
> 
> 
> 
> Garrett D'Amore does make a good point about SATA devices in the "mpt_sas
> wedge" thread, and that all devices get reset when it tries to reset the
> one drive. But should/would that lead to a complete halt of all IO? If that
> is the case, there is not much we can do besides replacing all the hardware?
> 
> Lund
> 
> 
> -- 
> Jorgen Lundman       | <[email protected]>
> Unix Administrator   | +81 (0)3 -5456-2687 ext 1017 (work)
> Shibuya-ku, Tokyo    | +81 (0)90-5578-8500          (cell)
> Japan                | +81 (0)3 -3375-1767          (home)
> 
> 
> -------------------------------------------
> illumos-discuss
> Archives: https://www.listbox.com/member/archive/182180/=now
> RSS Feed: https://www.listbox.com/member/archive/rss/182180/26677440-40b316d8
> Modify Your Subscription: https://www.listbox.com/member/?&;
> Powered by Listbox: http://www.listbox.com


-------------------------------------------
illumos-discuss
Archives: https://www.listbox.com/member/archive/182180/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182180/21175430-2e6923be
Modify Your Subscription: 
https://www.listbox.com/member/?member_id=21175430&id_secret=21175430-6a77cda4
Powered by Listbox: http://www.listbox.com

Reply via email to