Sadly the code for mpt was never opened so we can't really be much help. But do have sata expanders in the system ? Those are known to be toxic.
Sent from my iPhone > On Jan 20, 2015, at 8:07 PM, John Barfield via illumos-discuss > <[email protected]> wrote: > > Yeah it looks like i spoke too soon...I just realized you said the SSDs are > actually failing not just going offline and appearing to fail. > > Sorry I dont know what could be causing the failure to affect your entire > server/zpool. Man those logs and symptoms sure looked like a gold old > fashioned NCQ issue. > > Thanks and have a great day, > > John Barfield > >> On Jan 20, 2015, at 9:10 PM, John Barfield via illumos-discuss >> <[email protected]> wrote: >> >> Im by no means an illumos expert or dev...but in my time as an applogic >> support engineer ive seen these same issues on linux when the following is >> true: >> >> 0. older kernel >> + >> 1. long uptime >> + >> 2. high load average >> + >> 3. Sata NCQ enabled >> >> causes scsi interfaces to disconnect randomly and cause major disk issues. >> >> in linux disabling ncq in /sys/devices/block/sda/ >> by setting queue_depth to 0 resolves the issue. >> >> i dont know if it could be the disk or controller module that these machines >> share in common but the logs and symptoms are almost identical. >> >> again i couldbe way off because im no illumos expert...its just eerily >> familiar. >> >> Id find out how to disable NCQ and then test it. >> >> Someyimes you have to disable it in the HBA/raid card and the OS to >> completely disable. >> >> >> Thanks and have a great day, >> >> John Barfield >> >>> On Jan 20, 2015, at 8:53 PM, Jorgen Lundman via illumos-discuss >>> <[email protected]> wrote: >>> >>> >>> Hello list, >>> >>> I should apologise, technically speaking, we are still running Solaris >>> 10/u10, which isn't IllumOS. We would love to go to IllumOS kernel, due to >>> the problems we are encountering. More on that in a sec... >>> >>> So, what appears to happen, each time a device dies in our Supermicro+LSI >>> SAS2008 NFS servers, it takes out the whole server. The last three that >>> happened, in December, were all SSDs dying (3 separate times/servers) and >>> each time we had to power cycle the server. >>> >>> Since we have about 50 of these storage servers, and to change the OS, we >>> would have to do 3am maintenances for each one. It would be nice if I could >>> show that all those sleepless nights would be worth it. But I'm having a >>> hard time to replicate the issue. >>> >>> I used a SATA extension cable, and cut one of the data lines during >>> transfers to see if it would trigger the problem, but the damned thing >>> ended up being a dream advertisement for how well ZFS handles failures. >>> Error count went up, SSD was marked faulty, and spare kicked in. I have >>> repeated this a number of times but each time ZFS handles it beautifully. >>> (typical). >>> >>> Any great ideas on how to simulate failed disks? Pulling them out doesn't >>> generally work, since the controller gets notified of disconnect, as >>> opposed to the device no longer communicating. >>> >>> Now, there HAS been some changes in mpt_sas.c in IllumOS, most noticeably >>> >>> https://www.illumos.org/issues/3195 >>> https://www.illumos.org/issues/4310 >>> https://www.illumos.org/issues/5306 >>> https://www.illumos.org/issues/5483 >>> >>> so I am hoping it perhaps has been addressed. Anyone dare venture a guess? >>> >>> >>> The log entries for one of the SSDs dying and taking out the server looks >>> like (and again, this is Solaris 10): >>> >>> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ >>> scsi: [ID 107833 kern.warning] WARNING: >>> /pci@0,0/pci8086,340f@8/pci1000,3020@0 (mpt_sas0): >>> >>> Disconnected command timeout for Target 30 >>> >>> { >>> mptsas_check_scsi_io: IOCStatus=0x48 IOCLogInfo=0x31140000 >>> } * 8 >>> >>> mptsas_check_task_mgt: IOCStatus=0x4a >>> >>> mptsas_check_task_mgt: Task 0x3 failed. Target=30 >>> >>> mptsas_ioc_task_management failed try to reset ioc to recovery! >>> >>> mpt0 Firmware version v12.0.0.0 (?) >>> >>> { >>> /scsi_vhci/disk@g50015179596fd400 (sd2): Command Timeout on path >>> mpt_sas1/disk@w50015179596fd400,0 >>> >>> SCSI transport failed: reason 'timeout': retrying command >>> >>> /scsi_vhci/disk@g50015179596fa188 (sd16): Command failed to complete (4) >>> on path mpt_sas1/disk@w50015179596fa188,0 >>> >>> SCSI transport failed: reason 'reset': retrying command >>> >>> } * 8 >>> >>> mptsas_restart_ioc failed >>> >>> Target 30 reset for command timeout recovery failed! >>> >>> MPT Firmware Fault, code: 1500 >>> >>> mpt0 Firmware version v12.0.0.0 (?) >>> >>> mpt0: IOC Operational. >>> >>> { >>> SCSI transport failed: reason 'reset': retrying command >>> } * 16 >>> >>> mptsas_access_config_page: IOCStatus=0x22 IOCLogInfo=0x30030116 >>> >>> Error for Command: read(10) Error Level: Retryable >>> Requested Block: 85734615 Error Block: 85734615 >>> Vendor: ATA Serial Number: CVPR132407CH >>> Sense Key: Unit Attention >>> ASC: 0x29 (power on, reset, or bus reset occurred), ASCQ: 0x0, FRU: 0x0 >>> >>> genunix: [ID 408114 kern.info] /pci@0,0/pci8086,340f@8/pci1000,3020@0 >>> (mpt_sas0) down >>> >>> >>> ~~~~~~~~~~~~~~~~~~~~~~~~ >>> >>> I would assume that last message about "down" is somewhat .. undesirable. >>> >>> >>> >>> Garrett D'Amore does make a good point about SATA devices in the "mpt_sas >>> wedge" thread, and that all devices get reset when it tries to reset the >>> one drive. But should/would that lead to a complete halt of all IO? If that >>> is the case, there is not much we can do besides replacing all the hardware? >>> >>> Lund >>> >>> >>> -- >>> Jorgen Lundman | <[email protected]> >>> Unix Administrator | +81 (0)3 -5456-2687 ext 1017 (work) >>> Shibuya-ku, Tokyo | +81 (0)90-5578-8500 (cell) >>> Japan | +81 (0)3 -3375-1767 (home) >>> >>> >>> ------------------------------------------- >>> illumos-discuss >>> Archives: https://www.listbox.com/member/archive/182180/=now >>> RSS Feed: >>> https://www.listbox.com/member/archive/rss/182180/26677440-40b316d8 >>> Modify Your Subscription: https://www.listbox.com/member/?& >>> Powered by Listbox: http://www.listbox.com >> >> >> ------------------------------------------- >> illumos-discuss >> Archives: https://www.listbox.com/member/archive/182180/=now >> RSS Feed: https://www.listbox.com/member/archive/rss/182180/26677440-40b316d8 >> Modify Your Subscription: https://www.listbox.com/member/?& >> Powered by Listbox: http://www.listbox.com > > > ------------------------------------------- > illumos-discuss > Archives: https://www.listbox.com/member/archive/182180/=now > RSS Feed: https://www.listbox.com/member/archive/rss/182180/22003744-9012f59c > Modify Your Subscription: https://www.listbox.com/member/?& > Powered by Listbox: http://www.listbox.com ------------------------------------------- illumos-discuss Archives: https://www.listbox.com/member/archive/182180/=now RSS Feed: https://www.listbox.com/member/archive/rss/182180/21175430-2e6923be Modify Your Subscription: https://www.listbox.com/member/?member_id=21175430&id_secret=21175430-6a77cda4 Powered by Listbox: http://www.listbox.com
