Re: [discuss] mpt is down, broken SSDs

John Barfield via illumos-discuss Tue, 20 Jan 2015 20:08:11 -0800

Yeah it looks like i spoke too soon...I just realized you said the SSDs are 
actually failing not just going offline and appearing to fail.


Sorry I dont know what could be causing the failure to affect your entire 
server/zpool. Man those logs and symptoms sure looked like a gold old fashioned 
NCQ issue. 

Thanks and have a great day,

John Barfield

> On Jan 20, 2015, at 9:10 PM, John Barfield via illumos-discuss 
> <[email protected]> wrote:
> 
> Im by no means an illumos expert or dev...but in my time as an applogic 
> support engineer ive seen these same issues on linux when the following is 
> true:
> 
> 0. older kernel
> +
> 1. long uptime
> +
> 2. high load average
> +
> 3. Sata NCQ enabled
> 
> causes scsi interfaces to disconnect randomly and cause major disk issues.
> 
> in linux disabling ncq in /sys/devices/block/sda/
> by setting queue_depth to 0 resolves the issue.
> 
> i dont know if it could be the disk or controller module that these machines 
> share in common but the logs and symptoms are almost identical.
> 
> again i couldbe way off because im no illumos expert...its just eerily 
> familiar.
> 
> Id find out how to disable NCQ and then test it.
> 
> Someyimes you have to disable it in the HBA/raid card and the OS to 
> completely disable.
> 
> 
> Thanks and have a great day,
> 
> John Barfield
> 
>> On Jan 20, 2015, at 8:53 PM, Jorgen Lundman via illumos-discuss 
>> <[email protected]> wrote:
>> 
>> 
>> Hello list,
>> 
>> I should apologise, technically speaking, we are still running Solaris
>> 10/u10, which isn't IllumOS.  We would love to go to IllumOS kernel, due to
>> the problems we are encountering. More on that in a sec...
>> 
>> So, what appears to happen, each time a device dies in our Supermicro+LSI
>> SAS2008 NFS servers, it takes out the whole server. The last three that
>> happened, in December, were all SSDs dying (3 separate times/servers) and
>> each time we had to power cycle the server.
>> 
>> Since we have about 50 of these storage servers, and to change the OS, we
>> would have to do 3am maintenances for each one. It would be nice if I could
>> show that all those sleepless nights would be worth it. But I'm having a
>> hard time to replicate the issue.
>> 
>> I used a SATA extension cable, and cut one of the data lines during
>> transfers to see if it would trigger the problem, but the damned thing
>> ended up being a dream advertisement for how well ZFS handles failures.
>> Error count went up, SSD was marked faulty, and spare kicked in. I have
>> repeated this a number of times but each time ZFS handles it beautifully.
>> (typical).
>> 
>> Any great ideas on how to simulate failed disks? Pulling them out doesn't
>> generally work, since the controller gets notified of disconnect, as
>> opposed to the device no longer communicating.
>> 
>> Now, there HAS been some changes in mpt_sas.c in IllumOS, most noticeably
>> 
>> https://www.illumos.org/issues/3195
>> https://www.illumos.org/issues/4310
>> https://www.illumos.org/issues/5306
>> https://www.illumos.org/issues/5483
>> 
>> so I am hoping it perhaps has been addressed. Anyone dare venture a guess?
>> 
>> 
>> The log entries for one of the SSDs dying and taking out the server looks
>> like (and again, this is Solaris 10):
>> 
>> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>> scsi: [ID 107833 kern.warning] WARNING:
>> /pci@0,0/pci8086,340f@8/pci1000,3020@0 (mpt_sas0):
>> 
>> Disconnected command timeout for Target 30
>> 
>> {
>> mptsas_check_scsi_io: IOCStatus=0x48 IOCLogInfo=0x31140000
>> } * 8
>> 
>> mptsas_check_task_mgt: IOCStatus=0x4a
>> 
>> mptsas_check_task_mgt: Task 0x3 failed. Target=30
>> 
>> mptsas_ioc_task_management failed try to reset ioc to recovery!
>> 
>> mpt0 Firmware version v12.0.0.0 (?)
>> 
>> {
>> /scsi_vhci/disk@g50015179596fd400 (sd2): Command Timeout on path
>> mpt_sas1/disk@w50015179596fd400,0
>> 
>> SCSI transport failed: reason 'timeout': retrying command
>> 
>> /scsi_vhci/disk@g50015179596fa188 (sd16): Command failed to complete (4)
>> on path mpt_sas1/disk@w50015179596fa188,0
>> 
>> SCSI transport failed: reason 'reset': retrying command
>> 
>> } * 8
>> 
>> mptsas_restart_ioc failed
>> 
>> Target 30 reset for command timeout recovery failed!
>> 
>> MPT Firmware Fault, code: 1500
>> 
>> mpt0 Firmware version v12.0.0.0 (?)
>> 
>> mpt0: IOC Operational.
>> 
>> {
>> SCSI transport failed: reason 'reset': retrying command
>> } * 16
>> 
>> mptsas_access_config_page: IOCStatus=0x22 IOCLogInfo=0x30030116
>> 
>> Error for Command: read(10)                Error Level: Retryable
>> Requested Block: 85734615                  Error Block: 85734615
>> Vendor: ATA                                Serial Number: CVPR132407CH
>> Sense Key: Unit Attention
>> ASC: 0x29 (power on, reset, or bus reset occurred), ASCQ: 0x0, FRU: 0x0
>> 
>> genunix: [ID 408114 kern.info] /pci@0,0/pci8086,340f@8/pci1000,3020@0
>> (mpt_sas0) down
>> 
>> 
>> ~~~~~~~~~~~~~~~~~~~~~~~~
>> 
>> I would assume that last message about "down" is somewhat .. undesirable.
>> 
>> 
>> 
>> Garrett D'Amore does make a good point about SATA devices in the "mpt_sas
>> wedge" thread, and that all devices get reset when it tries to reset the
>> one drive. But should/would that lead to a complete halt of all IO? If that
>> is the case, there is not much we can do besides replacing all the hardware?
>> 
>> Lund
>> 
>> 
>> -- 
>> Jorgen Lundman       | <[email protected]>
>> Unix Administrator   | +81 (0)3 -5456-2687 ext 1017 (work)
>> Shibuya-ku, Tokyo    | +81 (0)90-5578-8500          (cell)
>> Japan                | +81 (0)3 -3375-1767          (home)
>> 
>> 
>> -------------------------------------------
>> illumos-discuss
>> Archives: https://www.listbox.com/member/archive/182180/=now
>> RSS Feed: https://www.listbox.com/member/archive/rss/182180/26677440-40b316d8
>> Modify Your Subscription: https://www.listbox.com/member/?&;
>> Powered by Listbox: http://www.listbox.com
> 
> 
> -------------------------------------------
> illumos-discuss
> Archives: https://www.listbox.com/member/archive/182180/=now
> RSS Feed: https://www.listbox.com/member/archive/rss/182180/26677440-40b316d8
> Modify Your Subscription: https://www.listbox.com/member/?&;
> Powered by Listbox: http://www.listbox.com


-------------------------------------------
illumos-discuss
Archives: https://www.listbox.com/member/archive/182180/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182180/21175430-2e6923be
Modify Your Subscription: 
https://www.listbox.com/member/?member_id=21175430&id_secret=21175430-6a77cda4
Powered by Listbox: http://www.listbox.com

Re: [discuss] mpt is down, broken SSDs

Reply via email to