Re: [discuss] mpt is down, broken SSDs

Garrett D'Amore via illumos-discuss Tue, 20 Jan 2015 22:15:07 -0800

Sadly the code for mpt was never opened so we can't really be much help.  But 
do have sata expanders in the system ?  Those are known to be toxic.


Sent from my iPhone

> On Jan 20, 2015, at 8:07 PM, John Barfield via illumos-discuss 
> <[email protected]> wrote:
> 
> Yeah it looks like i spoke too soon...I just realized you said the SSDs are 
> actually failing not just going offline and appearing to fail.
> 
> Sorry I dont know what could be causing the failure to affect your entire 
> server/zpool. Man those logs and symptoms sure looked like a gold old 
> fashioned NCQ issue. 
> 
> Thanks and have a great day,
> 
> John Barfield
> 
>> On Jan 20, 2015, at 9:10 PM, John Barfield via illumos-discuss 
>> <[email protected]> wrote:
>> 
>> Im by no means an illumos expert or dev...but in my time as an applogic 
>> support engineer ive seen these same issues on linux when the following is 
>> true:
>> 
>> 0. older kernel
>> +
>> 1. long uptime
>> +
>> 2. high load average
>> +
>> 3. Sata NCQ enabled
>> 
>> causes scsi interfaces to disconnect randomly and cause major disk issues.
>> 
>> in linux disabling ncq in /sys/devices/block/sda/
>> by setting queue_depth to 0 resolves the issue.
>> 
>> i dont know if it could be the disk or controller module that these machines 
>> share in common but the logs and symptoms are almost identical.
>> 
>> again i couldbe way off because im no illumos expert...its just eerily 
>> familiar.
>> 
>> Id find out how to disable NCQ and then test it.
>> 
>> Someyimes you have to disable it in the HBA/raid card and the OS to 
>> completely disable.
>> 
>> 
>> Thanks and have a great day,
>> 
>> John Barfield
>> 
>>> On Jan 20, 2015, at 8:53 PM, Jorgen Lundman via illumos-discuss 
>>> <[email protected]> wrote:
>>> 
>>> 
>>> Hello list,
>>> 
>>> I should apologise, technically speaking, we are still running Solaris
>>> 10/u10, which isn't IllumOS.  We would love to go to IllumOS kernel, due to
>>> the problems we are encountering. More on that in a sec...
>>> 
>>> So, what appears to happen, each time a device dies in our Supermicro+LSI
>>> SAS2008 NFS servers, it takes out the whole server. The last three that
>>> happened, in December, were all SSDs dying (3 separate times/servers) and
>>> each time we had to power cycle the server.
>>> 
>>> Since we have about 50 of these storage servers, and to change the OS, we
>>> would have to do 3am maintenances for each one. It would be nice if I could
>>> show that all those sleepless nights would be worth it. But I'm having a
>>> hard time to replicate the issue.
>>> 
>>> I used a SATA extension cable, and cut one of the data lines during
>>> transfers to see if it would trigger the problem, but the damned thing
>>> ended up being a dream advertisement for how well ZFS handles failures.
>>> Error count went up, SSD was marked faulty, and spare kicked in. I have
>>> repeated this a number of times but each time ZFS handles it beautifully.
>>> (typical).
>>> 
>>> Any great ideas on how to simulate failed disks? Pulling them out doesn't
>>> generally work, since the controller gets notified of disconnect, as
>>> opposed to the device no longer communicating.
>>> 
>>> Now, there HAS been some changes in mpt_sas.c in IllumOS, most noticeably
>>> 
>>> https://www.illumos.org/issues/3195
>>> https://www.illumos.org/issues/4310
>>> https://www.illumos.org/issues/5306
>>> https://www.illumos.org/issues/5483
>>> 
>>> so I am hoping it perhaps has been addressed. Anyone dare venture a guess?
>>> 
>>> 
>>> The log entries for one of the SSDs dying and taking out the server looks
>>> like (and again, this is Solaris 10):
>>> 
>>> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>>> scsi: [ID 107833 kern.warning] WARNING:
>>> /pci@0,0/pci8086,340f@8/pci1000,3020@0 (mpt_sas0):
>>> 
>>> Disconnected command timeout for Target 30
>>> 
>>> {
>>> mptsas_check_scsi_io: IOCStatus=0x48 IOCLogInfo=0x31140000
>>> } * 8
>>> 
>>> mptsas_check_task_mgt: IOCStatus=0x4a
>>> 
>>> mptsas_check_task_mgt: Task 0x3 failed. Target=30
>>> 
>>> mptsas_ioc_task_management failed try to reset ioc to recovery!
>>> 
>>> mpt0 Firmware version v12.0.0.0 (?)
>>> 
>>> {
>>> /scsi_vhci/disk@g50015179596fd400 (sd2): Command Timeout on path
>>> mpt_sas1/disk@w50015179596fd400,0
>>> 
>>> SCSI transport failed: reason 'timeout': retrying command
>>> 
>>> /scsi_vhci/disk@g50015179596fa188 (sd16): Command failed to complete (4)
>>> on path mpt_sas1/disk@w50015179596fa188,0
>>> 
>>> SCSI transport failed: reason 'reset': retrying command
>>> 
>>> } * 8
>>> 
>>> mptsas_restart_ioc failed
>>> 
>>> Target 30 reset for command timeout recovery failed!
>>> 
>>> MPT Firmware Fault, code: 1500
>>> 
>>> mpt0 Firmware version v12.0.0.0 (?)
>>> 
>>> mpt0: IOC Operational.
>>> 
>>> {
>>> SCSI transport failed: reason 'reset': retrying command
>>> } * 16
>>> 
>>> mptsas_access_config_page: IOCStatus=0x22 IOCLogInfo=0x30030116
>>> 
>>> Error for Command: read(10)                Error Level: Retryable
>>> Requested Block: 85734615                  Error Block: 85734615
>>> Vendor: ATA                                Serial Number: CVPR132407CH
>>> Sense Key: Unit Attention
>>> ASC: 0x29 (power on, reset, or bus reset occurred), ASCQ: 0x0, FRU: 0x0
>>> 
>>> genunix: [ID 408114 kern.info] /pci@0,0/pci8086,340f@8/pci1000,3020@0
>>> (mpt_sas0) down
>>> 
>>> 
>>> ~~~~~~~~~~~~~~~~~~~~~~~~
>>> 
>>> I would assume that last message about "down" is somewhat .. undesirable.
>>> 
>>> 
>>> 
>>> Garrett D'Amore does make a good point about SATA devices in the "mpt_sas
>>> wedge" thread, and that all devices get reset when it tries to reset the
>>> one drive. But should/would that lead to a complete halt of all IO? If that
>>> is the case, there is not much we can do besides replacing all the hardware?
>>> 
>>> Lund
>>> 
>>> 
>>> -- 
>>> Jorgen Lundman       | <[email protected]>
>>> Unix Administrator   | +81 (0)3 -5456-2687 ext 1017 (work)
>>> Shibuya-ku, Tokyo    | +81 (0)90-5578-8500          (cell)
>>> Japan                | +81 (0)3 -3375-1767          (home)
>>> 
>>> 
>>> -------------------------------------------
>>> illumos-discuss
>>> Archives: https://www.listbox.com/member/archive/182180/=now
>>> RSS Feed: 
>>> https://www.listbox.com/member/archive/rss/182180/26677440-40b316d8
>>> Modify Your Subscription: https://www.listbox.com/member/?&;
>>> Powered by Listbox: http://www.listbox.com
>> 
>> 
>> -------------------------------------------
>> illumos-discuss
>> Archives: https://www.listbox.com/member/archive/182180/=now
>> RSS Feed: https://www.listbox.com/member/archive/rss/182180/26677440-40b316d8
>> Modify Your Subscription: https://www.listbox.com/member/?&;
>> Powered by Listbox: http://www.listbox.com
> 
> 
> -------------------------------------------
> illumos-discuss
> Archives: https://www.listbox.com/member/archive/182180/=now
> RSS Feed: https://www.listbox.com/member/archive/rss/182180/22003744-9012f59c
> Modify Your Subscription: https://www.listbox.com/member/?&;
> Powered by Listbox: http://www.listbox.com


-------------------------------------------
illumos-discuss
Archives: https://www.listbox.com/member/archive/182180/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182180/21175430-2e6923be
Modify Your Subscription: 
https://www.listbox.com/member/?member_id=21175430&id_secret=21175430-6a77cda4
Powered by Listbox: http://www.listbox.com

Re: [discuss] mpt is down, broken SSDs

Reply via email to