Re: [OmniOS-discuss] zpool degraded while smart sais disks are OK

2014-03-31 Thread Tobias Oetiker
Hi Richard,

Mar 23 Richard Elling wrote:


 On Mar 21, 2014, at 10:13 PM, Tobias Oetiker t...@oetiker.ch wrote:

  Yesterday Richard Elling wrote:
 
 
  On Mar 21, 2014, at 3:23 PM, Tobias Oetiker t...@oetiker.ch wrote:
 
  [...]
 
  it happened over time as you can see from the timestamps in the
  log. The errors from zfs's point of view were 1 read and about 30 write
 
  but according to smart the disks are without flaw
 
  Actually, SMART is pretty dumb. In most cases, it only looks for 
  uncorrectable
  errors that are related to media or heads. For a clue to more permanent 
  errors,
  you will want to look at the read/write error reports for errors that are
  corrected with possible delays. You can also look at the grown defects 
  list.
 
  This behaviour is expected for drives with errors that are not being 
  quickly
  corrected or have firmware bugs (horrors!) and where the disk does not do 
  TLER
  (or its vendor's equivalent)
  -- richard
 
  the error counters look like this:
 
 
  Error counter log:
Errors Corrected by   Total   Correction Gigabytes
  Total
ECC  rereads/errors   algorithm  processed
  uncorrected
fast | delayed   rewrites  corrected  invocations   [10^9 bytes]  
  errors
  read:   34940 0  3494  44904530.879 
0
  write: 00 0 0  39111   1793.323 
0
  verify:00 0 0   8133  0.000 
0

 Errors corrected without delay looks good. The problem lies elsewhere.

 
  the disk vendor is HGST in case anyone has further ideas ... the system has 
  20 of these disks and the problems occured with
  three of them. The system has been running fine for two months previously.

 ...and yet there are aborted commands, likely due to a reset after a timeout.
 Resets aren't issued without cause.

 There are two different resets issued by the sd driver: LU and bus. If the
 LU reset doesn't work, the resets are escalated to bus. This is, of course,
 tunable, but is rarely tuned. A bus reset for SAS is a questionable practice,
 since SAS is a fabric, not a bus. But the effect of a device in the fabric
 being reset could be seen as aborted commands by more than one target. To
 troubleshoot these cases, you need to look at all of the devices in the data
 path and map the common causes: HBAs, expanders, enclosures, etc. Traverse
 the devices looking for errors, as you did with the disks. Useful tools:
 sasinfo, lsiutil/sas2ircu, smp_utils, sg3_utils, mpathadm, fmtopo.

thanks for the hints ... after detatching/attaching the 'failed'
disks, they got resilvered and a subsequent scrub did not detect
any errors ...

all a bit mysterious ... will keep an eye on the box to see how it
fares on the future ...

cheers
tobi


-- 
Tobi Oetiker, OETIKER+PARTNER AG, Aarweg 15 CH-4600 Olten, Switzerland
www.oetiker.ch t...@oetiker.ch +41 62 775 9902
*** We are hiring IT staff: www.oetiker.ch/jobs ***
___
OmniOS-discuss mailing list
OmniOS-discuss@lists.omniti.com
http://lists.omniti.com/mailman/listinfo/omnios-discuss


Re: [OmniOS-discuss] zpool degraded while smart sais disks are OK

2014-03-23 Thread Richard Elling

On Mar 21, 2014, at 10:13 PM, Tobias Oetiker t...@oetiker.ch wrote:

 Yesterday Richard Elling wrote:
 
 
 On Mar 21, 2014, at 3:23 PM, Tobias Oetiker t...@oetiker.ch wrote:
 
 [...]
 
 it happened over time as you can see from the timestamps in the
 log. The errors from zfs's point of view were 1 read and about 30 write
 
 but according to smart the disks are without flaw
 
 Actually, SMART is pretty dumb. In most cases, it only looks for 
 uncorrectable
 errors that are related to media or heads. For a clue to more permanent 
 errors,
 you will want to look at the read/write error reports for errors that are
 corrected with possible delays. You can also look at the grown defects list.
 
 This behaviour is expected for drives with errors that are not being quickly
 corrected or have firmware bugs (horrors!) and where the disk does not do 
 TLER
 (or its vendor's equivalent)
 -- richard
 
 the error counters look like this:
 
 
 Error counter log:
   Errors Corrected by   Total   Correction Gigabytes
 Total
   ECC  rereads/errors   algorithm  processed
 uncorrected
   fast | delayed   rewrites  corrected  invocations   [10^9 bytes]  
 errors
 read:   34940 0  3494  44904530.879   
 0
 write: 00 0 0  39111   1793.323   
 0
 verify:00 0 0   8133  0.000   
 0

Errors corrected without delay looks good. The problem lies elsewhere.

 
 the disk vendor is HGST in case anyone has further ideas ... the system has 
 20 of these disks and the problems occured with
 three of them. The system has been running fine for two months previously.

...and yet there are aborted commands, likely due to a reset after a timeout.
Resets aren't issued without cause.

There are two different resets issued by the sd driver: LU and bus. If the
LU reset doesn't work, the resets are escalated to bus. This is, of course,
tunable, but is rarely tuned. A bus reset for SAS is a questionable practice,
since SAS is a fabric, not a bus. But the effect of a device in the fabric
being reset could be seen as aborted commands by more than one target. To
troubleshoot these cases, you need to look at all of the devices in the data
path and map the common causes: HBAs, expanders, enclosures, etc. Traverse
the devices looking for errors, as you did with the disks. Useful tools:
sasinfo, lsiutil/sas2ircu, smp_utils, sg3_utils, mpathadm, fmtopo.
 -- richard


 
 Vendor:   HGST
 Product:  HUS724030ALS640
 Revision: A152
 User Capacity:3,000,592,982,016 bytes [3.00 TB]
 Logical block size:   512 bytes
 Serial number:P8J20SNV
 Device type:  disk
 Transport protocol:   SAS
 
 cheers
 tobi
 
 
 
 -- 
 Tobi Oetiker, OETIKER+PARTNER AG, Aarweg 15 CH-4600 Olten, Switzerland
 www.oetiker.ch t...@oetiker.ch +41 62 775 9902
 *** We are hiring IT staff: www.oetiker.ch/jobs ***

___
OmniOS-discuss mailing list
OmniOS-discuss@lists.omniti.com
http://lists.omniti.com/mailman/listinfo/omnios-discuss


Re: [OmniOS-discuss] zpool degraded while smart sais disks are OK

2014-03-21 Thread Richard Elling

On Mar 21, 2014, at 9:48 AM, Tobias Oetiker t...@oetiker.ch wrote:

 a zpool on one of our boxes has been degraded with several disks
 faulted ...
 
 * the disks are all sas direct attached
 * according to smartctl the offending disks have no faults.
 * zfs decided to fault the disks after the events below.
 
 I have now told the pool to clear the errors and it is resilvering the disks 
 ... (in progress)
 
 any idea what is happening here ?
 
 Mar  2 22:21:51 foo scsi: [ID 243001 kern.warning] WARNING: 
 /pci@0,0/pci8086,3c04@2/pci1000,3020@0 (mpt_sas0):
 Mar  2 22:21:51 foo mptsas_handle_event_sync: IOCStatus=0x8000, 
 IOCLogInfo=0x3117
 Mar  2 22:21:51 foo scsi: [ID 243001 kern.warning] WARNING: 
 /pci@0,0/pci8086,3c04@2/pci1000,3020@0 (mpt_sas0):
 Mar  2 22:21:51 foo mptsas_handle_event: IOCStatus=0x8000, 
 IOCLogInfo=0x3117
 Mar  2 22:21:51 foo scsi: [ID 365881 kern.info] 
 /pci@0,0/pci8086,3c04@2/pci1000,3020@0 (mpt_sas0):
 Mar  2 22:21:51 foo Log info 0x3117 received for target 11.
 Mar  2 22:21:51 foo scsi_status=0x0, ioc_status=0x804b, scsi_state=0xc
 Mar  2 22:21:51 foo scsi: [ID 365881 kern.info] 
 /pci@0,0/pci8086,3c04@2/pci1000,3020@0 (mpt_sas0):
 Mar  2 22:21:51 foo Log info 0x3117 received for target 11.
 Mar  2 22:21:51 foo scsi_status=0x0, ioc_status=0x804b, scsi_state=0xc

These are command aborted reports from the target device. You will see these 
every 60 seconds if the disk
is not responding and the subsequent reset of the disk aborts the commands that 
are not responding.
 -- richard

 
 
 Mar  5 02:20:53 foo scsi: [ID 243001 kern.warning] WARNING: 
 /pci@0,0/pci8086,3c06@2,2/pci1000,3020@0 (mpt_sas1):
 Mar  5 02:20:53 foo mptsas_handle_event_sync: IOCStatus=0x8000, 
 IOCLogInfo=0x3117
 Mar  5 02:20:53 foo scsi: [ID 243001 kern.warning] WARNING: 
 /pci@0,0/pci8086,3c06@2,2/pci1000,3020@0 (mpt_sas1):
 Mar  5 02:20:53 foo mptsas_handle_event: IOCStatus=0x8000, 
 IOCLogInfo=0x3117
 Mar  5 02:20:53 foo scsi: [ID 365881 kern.info] 
 /pci@0,0/pci8086,3c06@2,2/pci1000,3020@0 (mpt_sas1):
 Mar  5 02:20:53 foo Log info 0x3117 received for target 10.
 Mar  5 02:20:53 foo scsi_status=0x0, ioc_status=0x804b, scsi_state=0xc
 Mar  5 02:20:53 foo scsi: [ID 365881 kern.info] 
 /pci@0,0/pci8086,3c06@2,2/pci1000,3020@0 (mpt_sas1):
 Mar  5 02:20:53 foo Log info 0x3117 received for target 10.
 Mar  5 02:20:53 foo scsi_status=0x0, ioc_status=0x804b, scsi_state=0xc
 
 -- 
 Tobi Oetiker, OETIKER+PARTNER AG, Aarweg 15 CH-4600 Olten, Switzerland
 www.oetiker.ch t...@oetiker.ch +41 62 775 9902
 *** We are hiring IT staff: www.oetiker.ch/jobs ***
 ___
 OmniOS-discuss mailing list
 OmniOS-discuss@lists.omniti.com
 http://lists.omniti.com/mailman/listinfo/omnios-discuss

--

richard.ell...@richardelling.com
+1-760-896-4422



___
OmniOS-discuss mailing list
OmniOS-discuss@lists.omniti.com
http://lists.omniti.com/mailman/listinfo/omnios-discuss


Re: [OmniOS-discuss] zpool degraded while smart sais disks are OK

2014-03-21 Thread Tobias Oetiker
Today Zach Malone wrote:

 On Fri, Mar 21, 2014 at 3:50 PM, Richard Elling
 richard.ell...@richardelling.com wrote:
 
  On Mar 21, 2014, at 9:48 AM, Tobias Oetiker t...@oetiker.ch wrote:
 
  a zpool on one of our boxes has been degraded with several disks
  faulted ...
 
  * the disks are all sas direct attached
  * according to smartctl the offending disks have no faults.
  * zfs decided to fault the disks after the events below.
 
  I have now told the pool to clear the errors and it is resilvering the disks
  ... (in progress)
 
  any idea what is happening here ?

 ...

 Did all the disks fault at the same time, or was it spread out over a
 longer period?  I'd suspect your power supply or disk controller.
 What are your zpool errors?

it happened over time as you can see from the timestamps in the
log. The errors from zfs's point of view were 1 read and about 30 write

but according to smart the disks are without flaw

cheers
tobi



-- 
Tobi Oetiker, OETIKER+PARTNER AG, Aarweg 15 CH-4600 Olten, Switzerland
www.oetiker.ch t...@oetiker.ch +41 62 775 9902
*** We are hiring IT staff: www.oetiker.ch/jobs ***
___
OmniOS-discuss mailing list
OmniOS-discuss@lists.omniti.com
http://lists.omniti.com/mailman/listinfo/omnios-discuss


Re: [OmniOS-discuss] zpool degraded while smart sais disks are OK

2014-03-21 Thread Richard Elling

On Mar 21, 2014, at 3:23 PM, Tobias Oetiker t...@oetiker.ch wrote:

 Today Zach Malone wrote:
 
 On Fri, Mar 21, 2014 at 3:50 PM, Richard Elling
 richard.ell...@richardelling.com wrote:
 
 On Mar 21, 2014, at 9:48 AM, Tobias Oetiker t...@oetiker.ch wrote:
 
 a zpool on one of our boxes has been degraded with several disks
 faulted ...
 
 * the disks are all sas direct attached
 * according to smartctl the offending disks have no faults.
 * zfs decided to fault the disks after the events below.
 
 I have now told the pool to clear the errors and it is resilvering the disks
 ... (in progress)
 
 any idea what is happening here ?
 
 ...
 
 Did all the disks fault at the same time, or was it spread out over a
 longer period?  I'd suspect your power supply or disk controller.
 What are your zpool errors?
 
 it happened over time as you can see from the timestamps in the
 log. The errors from zfs's point of view were 1 read and about 30 write
 
 but according to smart the disks are without flaw

Actually, SMART is pretty dumb. In most cases, it only looks for uncorrectable
errors that are related to media or heads. For a clue to more permanent errors,
you will want to look at the read/write error reports for errors that are 
corrected with possible delays. You can also look at the grown defects list.

This behaviour is expected for drives with errors that are not being quickly 
corrected or have firmware bugs (horrors!) and where the disk does not do TLER
(or its vendor's equivalent)
 -- richard

___
OmniOS-discuss mailing list
OmniOS-discuss@lists.omniti.com
http://lists.omniti.com/mailman/listinfo/omnios-discuss


Re: [OmniOS-discuss] zpool degraded while smart sais disks are OK

2014-03-21 Thread Tobias Oetiker
Yesterday Richard Elling wrote:


 On Mar 21, 2014, at 3:23 PM, Tobias Oetiker t...@oetiker.ch wrote:

[...]
 
  it happened over time as you can see from the timestamps in the
  log. The errors from zfs's point of view were 1 read and about 30 write
 
  but according to smart the disks are without flaw

 Actually, SMART is pretty dumb. In most cases, it only looks for uncorrectable
 errors that are related to media or heads. For a clue to more permanent 
 errors,
 you will want to look at the read/write error reports for errors that are
 corrected with possible delays. You can also look at the grown defects list.

 This behaviour is expected for drives with errors that are not being quickly
 corrected or have firmware bugs (horrors!) and where the disk does not do TLER
 (or its vendor's equivalent)
  -- richard

the error counters look like this:


Error counter log:
   Errors Corrected by   Total   Correction Gigabytes
Total
   ECC  rereads/errors   algorithm  processed
uncorrected
   fast | delayed   rewrites  corrected  invocations   [10^9 bytes]  
errors
read:   34940 0  3494  44904530.879 
  0
write: 00 0 0  39111   1793.323 
  0
verify:00 0 0   8133  0.000 
  0

the disk vendor is HGST in case anyone has further ideas ... the system has 20 
of these disks and the problems occured with
three of them. The system has been running fine for two months previously.

Vendor:   HGST
Product:  HUS724030ALS640
Revision: A152
User Capacity:3,000,592,982,016 bytes [3.00 TB]
Logical block size:   512 bytes
Serial number:P8J20SNV
Device type:  disk
Transport protocol:   SAS

cheers
tobi



-- 
Tobi Oetiker, OETIKER+PARTNER AG, Aarweg 15 CH-4600 Olten, Switzerland
www.oetiker.ch t...@oetiker.ch +41 62 775 9902
*** We are hiring IT staff: www.oetiker.ch/jobs ***
___
OmniOS-discuss mailing list
OmniOS-discuss@lists.omniti.com
http://lists.omniti.com/mailman/listinfo/omnios-discuss