Re: Help finding a hardware problem (I think)

Lincoln Bryant Wed, 24 Apr 2013 14:03:49 -0700

On Apr 24, 2013, at 2:39 PM, Konstantin Olchanski wrote:

> On Wed, Apr 24, 2013 at 01:27:19PM -0400, Jeff Siddall wrote:
>> On 04/23/2013 07:20 PM, Konstantin Olchanski wrote:
>>>> disk utility show ... SMART [is] fine.
>>>>> 
>>> SMART "health report" is useless. I had dead disks report "SMART OK" and 
>>> perfectly functional disks report "SMART Failure, replace your disk now".
>> 
>> Agreed.  SMART doesn't diagnose everything.
>> 
> 
> Raw data reported by SMART seems solid enough - hours of use, temperatures, 
> bad sector counts, etc.
> 
> But the "SMART overall-health self-assessment test result" is useless and
> for the purpose of predicting disk failure, all data reported by SMART is 
> useless.
> 
> Maybe one exception: when the number of bad sectors starts incrementing 
> rapidly,
> the disk often fails soon thereafter.
> 
> But more typically I see this scenario:
> in the morning - reading the email reports:
> smartctl reports increase of bad sectors
> disk is dropped from the raid array
> smartctl reports that the disk does not support smart (it's way of telling us 
> that the disk died)
> cat mdstat shows [U_] we are now running on the spare disk
> 
> In other words:
> - all disks will fail eventually
> - there is no reliable predictor for "your disk will fail in 7 days, rush to 
> newegg now!", 
> - to prevent complete data loss, implement rsync to some other disk
> - to ensure uninterrupted operation, raid all disk.
> 
> This is all in my experience. Your experience may be different and if you now 
> a source
> for "this disk will never fail" disks, please let me know.
> 
> -- 
> Konstantin Olchanski
> Data Acquisition Systems: The Bytes Must Flow!
> Email: olchansk-at-triumf-dot-ca
> Snail mail: 4004 Wesbrook Mall, TRIUMF, Vancouver, B.C., V6T 2A3, Canada


There is a well-known paper regarding Google's experience with SMART data: 
http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en/us/archive/disk_failures.pdf

They find a number of SMART parameters that are reasonably indicative of 
failure, including "Reallocated Sector", "Current Pending Sector", and "Offline 
Uncorrectable" counts. That said, IIRC, SMART only predicted failures around 
30% of the time.

--Lincoln

Re: Help finding a hardware problem (I think)

Reply via email to