Hello list, I should apologise, technically speaking, we are still running Solaris 10/u10, which isn't IllumOS. We would love to go to IllumOS kernel, due to the problems we are encountering. More on that in a sec...
So, what appears to happen, each time a device dies in our Supermicro+LSI SAS2008 NFS servers, it takes out the whole server. The last three that happened, in December, were all SSDs dying (3 separate times/servers) and each time we had to power cycle the server. Since we have about 50 of these storage servers, and to change the OS, we would have to do 3am maintenances for each one. It would be nice if I could show that all those sleepless nights would be worth it. But I'm having a hard time to replicate the issue. I used a SATA extension cable, and cut one of the data lines during transfers to see if it would trigger the problem, but the damned thing ended up being a dream advertisement for how well ZFS handles failures. Error count went up, SSD was marked faulty, and spare kicked in. I have repeated this a number of times but each time ZFS handles it beautifully. (typical). Any great ideas on how to simulate failed disks? Pulling them out doesn't generally work, since the controller gets notified of disconnect, as opposed to the device no longer communicating. Now, there HAS been some changes in mpt_sas.c in IllumOS, most noticeably https://www.illumos.org/issues/3195 https://www.illumos.org/issues/4310 https://www.illumos.org/issues/5306 https://www.illumos.org/issues/5483 so I am hoping it perhaps has been addressed. Anyone dare venture a guess? The log entries for one of the SSDs dying and taking out the server looks like (and again, this is Solaris 10): ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ scsi: [ID 107833 kern.warning] WARNING: /pci@0,0/pci8086,340f@8/pci1000,3020@0 (mpt_sas0): Disconnected command timeout for Target 30 { mptsas_check_scsi_io: IOCStatus=0x48 IOCLogInfo=0x31140000 } * 8 mptsas_check_task_mgt: IOCStatus=0x4a mptsas_check_task_mgt: Task 0x3 failed. Target=30 mptsas_ioc_task_management failed try to reset ioc to recovery! mpt0 Firmware version v12.0.0.0 (?) { /scsi_vhci/disk@g50015179596fd400 (sd2): Command Timeout on path mpt_sas1/disk@w50015179596fd400,0 SCSI transport failed: reason 'timeout': retrying command /scsi_vhci/disk@g50015179596fa188 (sd16): Command failed to complete (4) on path mpt_sas1/disk@w50015179596fa188,0 SCSI transport failed: reason 'reset': retrying command } * 8 mptsas_restart_ioc failed Target 30 reset for command timeout recovery failed! MPT Firmware Fault, code: 1500 mpt0 Firmware version v12.0.0.0 (?) mpt0: IOC Operational. { SCSI transport failed: reason 'reset': retrying command } * 16 mptsas_access_config_page: IOCStatus=0x22 IOCLogInfo=0x30030116 Error for Command: read(10) Error Level: Retryable Requested Block: 85734615 Error Block: 85734615 Vendor: ATA Serial Number: CVPR132407CH Sense Key: Unit Attention ASC: 0x29 (power on, reset, or bus reset occurred), ASCQ: 0x0, FRU: 0x0 genunix: [ID 408114 kern.info] /pci@0,0/pci8086,340f@8/pci1000,3020@0 (mpt_sas0) down ~~~~~~~~~~~~~~~~~~~~~~~~ I would assume that last message about "down" is somewhat .. undesirable. Garrett D'Amore does make a good point about SATA devices in the "mpt_sas wedge" thread, and that all devices get reset when it tries to reset the one drive. But should/would that lead to a complete halt of all IO? If that is the case, there is not much we can do besides replacing all the hardware? Lund -- Jorgen Lundman | <[email protected]> Unix Administrator | +81 (0)3 -5456-2687 ext 1017 (work) Shibuya-ku, Tokyo | +81 (0)90-5578-8500 (cell) Japan | +81 (0)3 -3375-1767 (home) ------------------------------------------- illumos-discuss Archives: https://www.listbox.com/member/archive/182180/=now RSS Feed: https://www.listbox.com/member/archive/rss/182180/21175430-2e6923be Modify Your Subscription: https://www.listbox.com/member/?member_id=21175430&id_secret=21175430-6a77cda4 Powered by Listbox: http://www.listbox.com
