Hi Richard, Mar 23 Richard Elling wrote:
> > On Mar 21, 2014, at 10:13 PM, Tobias Oetiker <t...@oetiker.ch> wrote: > > > Yesterday Richard Elling wrote: > > > >> > >> On Mar 21, 2014, at 3:23 PM, Tobias Oetiker <t...@oetiker.ch> wrote: > > > > [...] > >>> > >>> it happened over time as you can see from the timestamps in the > >>> log. The errors from zfs's point of view were 1 read and about 30 write > >>> > >>> but according to smart the disks are without flaw > >> > >> Actually, SMART is pretty dumb. In most cases, it only looks for > >> uncorrectable > >> errors that are related to media or heads. For a clue to more permanent > >> errors, > >> you will want to look at the read/write error reports for errors that are > >> corrected with possible delays. You can also look at the grown defects > >> list. > >> > >> This behaviour is expected for drives with errors that are not being > >> quickly > >> corrected or have firmware bugs (horrors!) and where the disk does not do > >> TLER > >> (or its vendor's equivalent) > >> -- richard > > > > the error counters look like this: > > > > > > Error counter log: > > Errors Corrected by Total Correction Gigabytes > > Total > > ECC rereads/ errors algorithm processed > > uncorrected > > fast | delayed rewrites corrected invocations [10^9 bytes] > > errors > > read: 3494 0 0 3494 44904 530.879 > > 0 > > write: 0 0 0 0 39111 1793.323 > > 0 > > verify: 0 0 0 0 8133 0.000 > > 0 > > Errors corrected without delay looks good. The problem lies elsewhere. > > > > > the disk vendor is HGST in case anyone has further ideas ... the system has > > 20 of these disks and the problems occured with > > three of them. The system has been running fine for two months previously. > > ...and yet there are aborted commands, likely due to a reset after a timeout. > Resets aren't issued without cause. > > There are two different resets issued by the sd driver: LU and bus. If the > LU reset doesn't work, the resets are escalated to bus. This is, of course, > tunable, but is rarely tuned. A bus reset for SAS is a questionable practice, > since SAS is a fabric, not a bus. But the effect of a device in the fabric > being reset could be seen as aborted commands by more than one target. To > troubleshoot these cases, you need to look at all of the devices in the data > path and map the common causes: HBAs, expanders, enclosures, etc. Traverse > the devices looking for errors, as you did with the disks. Useful tools: > sasinfo, lsiutil/sas2ircu, smp_utils, sg3_utils, mpathadm, fmtopo. thanks for the hints ... after detatching/attaching the 'failed' disks, they got resilvered and a subsequent scrub did not detect any errors ... all a bit mysterious ... will keep an eye on the box to see how it fares on the future ... cheers tobi -- Tobi Oetiker, OETIKER+PARTNER AG, Aarweg 15 CH-4600 Olten, Switzerland www.oetiker.ch t...@oetiker.ch +41 62 775 9902 *** We are hiring IT staff: www.oetiker.ch/jobs *** _______________________________________________ OmniOS-discuss mailing list OmniOS-discuss@lists.omniti.com http://lists.omniti.com/mailman/listinfo/omnios-discuss