On Apr 24, 2013, at 2:39 PM, Konstantin Olchanski wrote: > On Wed, Apr 24, 2013 at 01:27:19PM -0400, Jeff Siddall wrote: >> On 04/23/2013 07:20 PM, Konstantin Olchanski wrote: >>>> disk utility show ... SMART [is] fine. >>>>> >>> SMART "health report" is useless. I had dead disks report "SMART OK" and >>> perfectly functional disks report "SMART Failure, replace your disk now". >> >> Agreed. SMART doesn't diagnose everything. >> > > Raw data reported by SMART seems solid enough - hours of use, temperatures, > bad sector counts, etc. > > But the "SMART overall-health self-assessment test result" is useless and > for the purpose of predicting disk failure, all data reported by SMART is > useless. > > Maybe one exception: when the number of bad sectors starts incrementing > rapidly, > the disk often fails soon thereafter. > > But more typically I see this scenario: > in the morning - reading the email reports: > smartctl reports increase of bad sectors > disk is dropped from the raid array > smartctl reports that the disk does not support smart (it's way of telling us > that the disk died) > cat mdstat shows [U_] we are now running on the spare disk > > In other words: > - all disks will fail eventually > - there is no reliable predictor for "your disk will fail in 7 days, rush to > newegg now!", > - to prevent complete data loss, implement rsync to some other disk > - to ensure uninterrupted operation, raid all disk. > > This is all in my experience. Your experience may be different and if you now > a source > for "this disk will never fail" disks, please let me know. > > -- > Konstantin Olchanski > Data Acquisition Systems: The Bytes Must Flow! > Email: olchansk-at-triumf-dot-ca > Snail mail: 4004 Wesbrook Mall, TRIUMF, Vancouver, B.C., V6T 2A3, Canada
There is a well-known paper regarding Google's experience with SMART data: http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en/us/archive/disk_failures.pdf They find a number of SMART parameters that are reasonably indicative of failure, including "Reallocated Sector", "Current Pending Sector", and "Offline Uncorrectable" counts. That said, IIRC, SMART only predicted failures around 30% of the time. --Lincoln