Yeah it looks like i spoke too soon...I just realized you said the SSDs are actually failing not just going offline and appearing to fail.
Sorry I dont know what could be causing the failure to affect your entire server/zpool. Man those logs and symptoms sure looked like a gold old fashioned NCQ issue. Thanks and have a great day, John Barfield > On Jan 20, 2015, at 9:10 PM, John Barfield via illumos-discuss > <[email protected]> wrote: > > Im by no means an illumos expert or dev...but in my time as an applogic > support engineer ive seen these same issues on linux when the following is > true: > > 0. older kernel > + > 1. long uptime > + > 2. high load average > + > 3. Sata NCQ enabled > > causes scsi interfaces to disconnect randomly and cause major disk issues. > > in linux disabling ncq in /sys/devices/block/sda/ > by setting queue_depth to 0 resolves the issue. > > i dont know if it could be the disk or controller module that these machines > share in common but the logs and symptoms are almost identical. > > again i couldbe way off because im no illumos expert...its just eerily > familiar. > > Id find out how to disable NCQ and then test it. > > Someyimes you have to disable it in the HBA/raid card and the OS to > completely disable. > > > Thanks and have a great day, > > John Barfield > >> On Jan 20, 2015, at 8:53 PM, Jorgen Lundman via illumos-discuss >> <[email protected]> wrote: >> >> >> Hello list, >> >> I should apologise, technically speaking, we are still running Solaris >> 10/u10, which isn't IllumOS. We would love to go to IllumOS kernel, due to >> the problems we are encountering. More on that in a sec... >> >> So, what appears to happen, each time a device dies in our Supermicro+LSI >> SAS2008 NFS servers, it takes out the whole server. The last three that >> happened, in December, were all SSDs dying (3 separate times/servers) and >> each time we had to power cycle the server. >> >> Since we have about 50 of these storage servers, and to change the OS, we >> would have to do 3am maintenances for each one. It would be nice if I could >> show that all those sleepless nights would be worth it. But I'm having a >> hard time to replicate the issue. >> >> I used a SATA extension cable, and cut one of the data lines during >> transfers to see if it would trigger the problem, but the damned thing >> ended up being a dream advertisement for how well ZFS handles failures. >> Error count went up, SSD was marked faulty, and spare kicked in. I have >> repeated this a number of times but each time ZFS handles it beautifully. >> (typical). >> >> Any great ideas on how to simulate failed disks? Pulling them out doesn't >> generally work, since the controller gets notified of disconnect, as >> opposed to the device no longer communicating. >> >> Now, there HAS been some changes in mpt_sas.c in IllumOS, most noticeably >> >> https://www.illumos.org/issues/3195 >> https://www.illumos.org/issues/4310 >> https://www.illumos.org/issues/5306 >> https://www.illumos.org/issues/5483 >> >> so I am hoping it perhaps has been addressed. Anyone dare venture a guess? >> >> >> The log entries for one of the SSDs dying and taking out the server looks >> like (and again, this is Solaris 10): >> >> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ >> scsi: [ID 107833 kern.warning] WARNING: >> /pci@0,0/pci8086,340f@8/pci1000,3020@0 (mpt_sas0): >> >> Disconnected command timeout for Target 30 >> >> { >> mptsas_check_scsi_io: IOCStatus=0x48 IOCLogInfo=0x31140000 >> } * 8 >> >> mptsas_check_task_mgt: IOCStatus=0x4a >> >> mptsas_check_task_mgt: Task 0x3 failed. Target=30 >> >> mptsas_ioc_task_management failed try to reset ioc to recovery! >> >> mpt0 Firmware version v12.0.0.0 (?) >> >> { >> /scsi_vhci/disk@g50015179596fd400 (sd2): Command Timeout on path >> mpt_sas1/disk@w50015179596fd400,0 >> >> SCSI transport failed: reason 'timeout': retrying command >> >> /scsi_vhci/disk@g50015179596fa188 (sd16): Command failed to complete (4) >> on path mpt_sas1/disk@w50015179596fa188,0 >> >> SCSI transport failed: reason 'reset': retrying command >> >> } * 8 >> >> mptsas_restart_ioc failed >> >> Target 30 reset for command timeout recovery failed! >> >> MPT Firmware Fault, code: 1500 >> >> mpt0 Firmware version v12.0.0.0 (?) >> >> mpt0: IOC Operational. >> >> { >> SCSI transport failed: reason 'reset': retrying command >> } * 16 >> >> mptsas_access_config_page: IOCStatus=0x22 IOCLogInfo=0x30030116 >> >> Error for Command: read(10) Error Level: Retryable >> Requested Block: 85734615 Error Block: 85734615 >> Vendor: ATA Serial Number: CVPR132407CH >> Sense Key: Unit Attention >> ASC: 0x29 (power on, reset, or bus reset occurred), ASCQ: 0x0, FRU: 0x0 >> >> genunix: [ID 408114 kern.info] /pci@0,0/pci8086,340f@8/pci1000,3020@0 >> (mpt_sas0) down >> >> >> ~~~~~~~~~~~~~~~~~~~~~~~~ >> >> I would assume that last message about "down" is somewhat .. undesirable. >> >> >> >> Garrett D'Amore does make a good point about SATA devices in the "mpt_sas >> wedge" thread, and that all devices get reset when it tries to reset the >> one drive. But should/would that lead to a complete halt of all IO? If that >> is the case, there is not much we can do besides replacing all the hardware? >> >> Lund >> >> >> -- >> Jorgen Lundman | <[email protected]> >> Unix Administrator | +81 (0)3 -5456-2687 ext 1017 (work) >> Shibuya-ku, Tokyo | +81 (0)90-5578-8500 (cell) >> Japan | +81 (0)3 -3375-1767 (home) >> >> >> ------------------------------------------- >> illumos-discuss >> Archives: https://www.listbox.com/member/archive/182180/=now >> RSS Feed: https://www.listbox.com/member/archive/rss/182180/26677440-40b316d8 >> Modify Your Subscription: https://www.listbox.com/member/?& >> Powered by Listbox: http://www.listbox.com > > > ------------------------------------------- > illumos-discuss > Archives: https://www.listbox.com/member/archive/182180/=now > RSS Feed: https://www.listbox.com/member/archive/rss/182180/26677440-40b316d8 > Modify Your Subscription: https://www.listbox.com/member/?& > Powered by Listbox: http://www.listbox.com ------------------------------------------- illumos-discuss Archives: https://www.listbox.com/member/archive/182180/=now RSS Feed: https://www.listbox.com/member/archive/rss/182180/21175430-2e6923be Modify Your Subscription: https://www.listbox.com/member/?member_id=21175430&id_secret=21175430-6a77cda4 Powered by Listbox: http://www.listbox.com
