Re: [OmniOS-discuss] zpool degraded while smart sais disks are OK
Hi Richard, Mar 23 Richard Elling wrote: On Mar 21, 2014, at 10:13 PM, Tobias Oetiker t...@oetiker.ch wrote: Yesterday Richard Elling wrote: On Mar 21, 2014, at 3:23 PM, Tobias Oetiker t...@oetiker.ch wrote: [...] it happened over time as you can see from the timestamps in the log. The errors from zfs's point of view were 1 read and about 30 write but according to smart the disks are without flaw Actually, SMART is pretty dumb. In most cases, it only looks for uncorrectable errors that are related to media or heads. For a clue to more permanent errors, you will want to look at the read/write error reports for errors that are corrected with possible delays. You can also look at the grown defects list. This behaviour is expected for drives with errors that are not being quickly corrected or have firmware bugs (horrors!) and where the disk does not do TLER (or its vendor's equivalent) -- richard the error counters look like this: Error counter log: Errors Corrected by Total Correction Gigabytes Total ECC rereads/errors algorithm processed uncorrected fast | delayed rewrites corrected invocations [10^9 bytes] errors read: 34940 0 3494 44904530.879 0 write: 00 0 0 39111 1793.323 0 verify:00 0 0 8133 0.000 0 Errors corrected without delay looks good. The problem lies elsewhere. the disk vendor is HGST in case anyone has further ideas ... the system has 20 of these disks and the problems occured with three of them. The system has been running fine for two months previously. ...and yet there are aborted commands, likely due to a reset after a timeout. Resets aren't issued without cause. There are two different resets issued by the sd driver: LU and bus. If the LU reset doesn't work, the resets are escalated to bus. This is, of course, tunable, but is rarely tuned. A bus reset for SAS is a questionable practice, since SAS is a fabric, not a bus. But the effect of a device in the fabric being reset could be seen as aborted commands by more than one target. To troubleshoot these cases, you need to look at all of the devices in the data path and map the common causes: HBAs, expanders, enclosures, etc. Traverse the devices looking for errors, as you did with the disks. Useful tools: sasinfo, lsiutil/sas2ircu, smp_utils, sg3_utils, mpathadm, fmtopo. thanks for the hints ... after detatching/attaching the 'failed' disks, they got resilvered and a subsequent scrub did not detect any errors ... all a bit mysterious ... will keep an eye on the box to see how it fares on the future ... cheers tobi -- Tobi Oetiker, OETIKER+PARTNER AG, Aarweg 15 CH-4600 Olten, Switzerland www.oetiker.ch t...@oetiker.ch +41 62 775 9902 *** We are hiring IT staff: www.oetiker.ch/jobs *** ___ OmniOS-discuss mailing list OmniOS-discuss@lists.omniti.com http://lists.omniti.com/mailman/listinfo/omnios-discuss
Re: [OmniOS-discuss] zpool degraded while smart sais disks are OK
On Mar 21, 2014, at 10:13 PM, Tobias Oetiker t...@oetiker.ch wrote: Yesterday Richard Elling wrote: On Mar 21, 2014, at 3:23 PM, Tobias Oetiker t...@oetiker.ch wrote: [...] it happened over time as you can see from the timestamps in the log. The errors from zfs's point of view were 1 read and about 30 write but according to smart the disks are without flaw Actually, SMART is pretty dumb. In most cases, it only looks for uncorrectable errors that are related to media or heads. For a clue to more permanent errors, you will want to look at the read/write error reports for errors that are corrected with possible delays. You can also look at the grown defects list. This behaviour is expected for drives with errors that are not being quickly corrected or have firmware bugs (horrors!) and where the disk does not do TLER (or its vendor's equivalent) -- richard the error counters look like this: Error counter log: Errors Corrected by Total Correction Gigabytes Total ECC rereads/errors algorithm processed uncorrected fast | delayed rewrites corrected invocations [10^9 bytes] errors read: 34940 0 3494 44904530.879 0 write: 00 0 0 39111 1793.323 0 verify:00 0 0 8133 0.000 0 Errors corrected without delay looks good. The problem lies elsewhere. the disk vendor is HGST in case anyone has further ideas ... the system has 20 of these disks and the problems occured with three of them. The system has been running fine for two months previously. ...and yet there are aborted commands, likely due to a reset after a timeout. Resets aren't issued without cause. There are two different resets issued by the sd driver: LU and bus. If the LU reset doesn't work, the resets are escalated to bus. This is, of course, tunable, but is rarely tuned. A bus reset for SAS is a questionable practice, since SAS is a fabric, not a bus. But the effect of a device in the fabric being reset could be seen as aborted commands by more than one target. To troubleshoot these cases, you need to look at all of the devices in the data path and map the common causes: HBAs, expanders, enclosures, etc. Traverse the devices looking for errors, as you did with the disks. Useful tools: sasinfo, lsiutil/sas2ircu, smp_utils, sg3_utils, mpathadm, fmtopo. -- richard Vendor: HGST Product: HUS724030ALS640 Revision: A152 User Capacity:3,000,592,982,016 bytes [3.00 TB] Logical block size: 512 bytes Serial number:P8J20SNV Device type: disk Transport protocol: SAS cheers tobi -- Tobi Oetiker, OETIKER+PARTNER AG, Aarweg 15 CH-4600 Olten, Switzerland www.oetiker.ch t...@oetiker.ch +41 62 775 9902 *** We are hiring IT staff: www.oetiker.ch/jobs *** ___ OmniOS-discuss mailing list OmniOS-discuss@lists.omniti.com http://lists.omniti.com/mailman/listinfo/omnios-discuss
Re: [OmniOS-discuss] zpool degraded while smart sais disks are OK
On Mar 21, 2014, at 9:48 AM, Tobias Oetiker t...@oetiker.ch wrote: a zpool on one of our boxes has been degraded with several disks faulted ... * the disks are all sas direct attached * according to smartctl the offending disks have no faults. * zfs decided to fault the disks after the events below. I have now told the pool to clear the errors and it is resilvering the disks ... (in progress) any idea what is happening here ? Mar 2 22:21:51 foo scsi: [ID 243001 kern.warning] WARNING: /pci@0,0/pci8086,3c04@2/pci1000,3020@0 (mpt_sas0): Mar 2 22:21:51 foo mptsas_handle_event_sync: IOCStatus=0x8000, IOCLogInfo=0x3117 Mar 2 22:21:51 foo scsi: [ID 243001 kern.warning] WARNING: /pci@0,0/pci8086,3c04@2/pci1000,3020@0 (mpt_sas0): Mar 2 22:21:51 foo mptsas_handle_event: IOCStatus=0x8000, IOCLogInfo=0x3117 Mar 2 22:21:51 foo scsi: [ID 365881 kern.info] /pci@0,0/pci8086,3c04@2/pci1000,3020@0 (mpt_sas0): Mar 2 22:21:51 foo Log info 0x3117 received for target 11. Mar 2 22:21:51 foo scsi_status=0x0, ioc_status=0x804b, scsi_state=0xc Mar 2 22:21:51 foo scsi: [ID 365881 kern.info] /pci@0,0/pci8086,3c04@2/pci1000,3020@0 (mpt_sas0): Mar 2 22:21:51 foo Log info 0x3117 received for target 11. Mar 2 22:21:51 foo scsi_status=0x0, ioc_status=0x804b, scsi_state=0xc These are command aborted reports from the target device. You will see these every 60 seconds if the disk is not responding and the subsequent reset of the disk aborts the commands that are not responding. -- richard Mar 5 02:20:53 foo scsi: [ID 243001 kern.warning] WARNING: /pci@0,0/pci8086,3c06@2,2/pci1000,3020@0 (mpt_sas1): Mar 5 02:20:53 foo mptsas_handle_event_sync: IOCStatus=0x8000, IOCLogInfo=0x3117 Mar 5 02:20:53 foo scsi: [ID 243001 kern.warning] WARNING: /pci@0,0/pci8086,3c06@2,2/pci1000,3020@0 (mpt_sas1): Mar 5 02:20:53 foo mptsas_handle_event: IOCStatus=0x8000, IOCLogInfo=0x3117 Mar 5 02:20:53 foo scsi: [ID 365881 kern.info] /pci@0,0/pci8086,3c06@2,2/pci1000,3020@0 (mpt_sas1): Mar 5 02:20:53 foo Log info 0x3117 received for target 10. Mar 5 02:20:53 foo scsi_status=0x0, ioc_status=0x804b, scsi_state=0xc Mar 5 02:20:53 foo scsi: [ID 365881 kern.info] /pci@0,0/pci8086,3c06@2,2/pci1000,3020@0 (mpt_sas1): Mar 5 02:20:53 foo Log info 0x3117 received for target 10. Mar 5 02:20:53 foo scsi_status=0x0, ioc_status=0x804b, scsi_state=0xc -- Tobi Oetiker, OETIKER+PARTNER AG, Aarweg 15 CH-4600 Olten, Switzerland www.oetiker.ch t...@oetiker.ch +41 62 775 9902 *** We are hiring IT staff: www.oetiker.ch/jobs *** ___ OmniOS-discuss mailing list OmniOS-discuss@lists.omniti.com http://lists.omniti.com/mailman/listinfo/omnios-discuss -- richard.ell...@richardelling.com +1-760-896-4422 ___ OmniOS-discuss mailing list OmniOS-discuss@lists.omniti.com http://lists.omniti.com/mailman/listinfo/omnios-discuss
Re: [OmniOS-discuss] zpool degraded while smart sais disks are OK
Today Zach Malone wrote: On Fri, Mar 21, 2014 at 3:50 PM, Richard Elling richard.ell...@richardelling.com wrote: On Mar 21, 2014, at 9:48 AM, Tobias Oetiker t...@oetiker.ch wrote: a zpool on one of our boxes has been degraded with several disks faulted ... * the disks are all sas direct attached * according to smartctl the offending disks have no faults. * zfs decided to fault the disks after the events below. I have now told the pool to clear the errors and it is resilvering the disks ... (in progress) any idea what is happening here ? ... Did all the disks fault at the same time, or was it spread out over a longer period? I'd suspect your power supply or disk controller. What are your zpool errors? it happened over time as you can see from the timestamps in the log. The errors from zfs's point of view were 1 read and about 30 write but according to smart the disks are without flaw cheers tobi -- Tobi Oetiker, OETIKER+PARTNER AG, Aarweg 15 CH-4600 Olten, Switzerland www.oetiker.ch t...@oetiker.ch +41 62 775 9902 *** We are hiring IT staff: www.oetiker.ch/jobs *** ___ OmniOS-discuss mailing list OmniOS-discuss@lists.omniti.com http://lists.omniti.com/mailman/listinfo/omnios-discuss
Re: [OmniOS-discuss] zpool degraded while smart sais disks are OK
On Mar 21, 2014, at 3:23 PM, Tobias Oetiker t...@oetiker.ch wrote: Today Zach Malone wrote: On Fri, Mar 21, 2014 at 3:50 PM, Richard Elling richard.ell...@richardelling.com wrote: On Mar 21, 2014, at 9:48 AM, Tobias Oetiker t...@oetiker.ch wrote: a zpool on one of our boxes has been degraded with several disks faulted ... * the disks are all sas direct attached * according to smartctl the offending disks have no faults. * zfs decided to fault the disks after the events below. I have now told the pool to clear the errors and it is resilvering the disks ... (in progress) any idea what is happening here ? ... Did all the disks fault at the same time, or was it spread out over a longer period? I'd suspect your power supply or disk controller. What are your zpool errors? it happened over time as you can see from the timestamps in the log. The errors from zfs's point of view were 1 read and about 30 write but according to smart the disks are without flaw Actually, SMART is pretty dumb. In most cases, it only looks for uncorrectable errors that are related to media or heads. For a clue to more permanent errors, you will want to look at the read/write error reports for errors that are corrected with possible delays. You can also look at the grown defects list. This behaviour is expected for drives with errors that are not being quickly corrected or have firmware bugs (horrors!) and where the disk does not do TLER (or its vendor's equivalent) -- richard ___ OmniOS-discuss mailing list OmniOS-discuss@lists.omniti.com http://lists.omniti.com/mailman/listinfo/omnios-discuss
Re: [OmniOS-discuss] zpool degraded while smart sais disks are OK
Yesterday Richard Elling wrote: On Mar 21, 2014, at 3:23 PM, Tobias Oetiker t...@oetiker.ch wrote: [...] it happened over time as you can see from the timestamps in the log. The errors from zfs's point of view were 1 read and about 30 write but according to smart the disks are without flaw Actually, SMART is pretty dumb. In most cases, it only looks for uncorrectable errors that are related to media or heads. For a clue to more permanent errors, you will want to look at the read/write error reports for errors that are corrected with possible delays. You can also look at the grown defects list. This behaviour is expected for drives with errors that are not being quickly corrected or have firmware bugs (horrors!) and where the disk does not do TLER (or its vendor's equivalent) -- richard the error counters look like this: Error counter log: Errors Corrected by Total Correction Gigabytes Total ECC rereads/errors algorithm processed uncorrected fast | delayed rewrites corrected invocations [10^9 bytes] errors read: 34940 0 3494 44904530.879 0 write: 00 0 0 39111 1793.323 0 verify:00 0 0 8133 0.000 0 the disk vendor is HGST in case anyone has further ideas ... the system has 20 of these disks and the problems occured with three of them. The system has been running fine for two months previously. Vendor: HGST Product: HUS724030ALS640 Revision: A152 User Capacity:3,000,592,982,016 bytes [3.00 TB] Logical block size: 512 bytes Serial number:P8J20SNV Device type: disk Transport protocol: SAS cheers tobi -- Tobi Oetiker, OETIKER+PARTNER AG, Aarweg 15 CH-4600 Olten, Switzerland www.oetiker.ch t...@oetiker.ch +41 62 775 9902 *** We are hiring IT staff: www.oetiker.ch/jobs *** ___ OmniOS-discuss mailing list OmniOS-discuss@lists.omniti.com http://lists.omniti.com/mailman/listinfo/omnios-discuss