Justin Moore wrote: > As mentioned in an earlier e-mail (I think) there were 4 SMART variables > whose values were strongly correlated with failure, and another 4-6 that > were weakly correlated with failure. However, of all the disks that > failed, less than half (around 45%) had ANY of the "strong" signals and > another 25% had some of the "weak" signals. This means that over a > third of disks that failed gave no appreciable warning. Therefore even > combining the variables would give no better than a 70% chance of > predicting failure.
Now we need to know exactly how you defined "failed". Presumably AFTER you have determined that a disk has failed various SMART parameters have very high values. As you say, before there are SMART indicators but no clear trend. What separates one set of SMART values (indicator) from the other (failed)? Is it possible that more frequent monitoring of SMART variables could catch the early failure (chest pains, so to speak) before the total failure (fatal heart failure)? This might give a few more seconds or minutes warning before disk failure, possibly enough time for a node to indicate it is about to fail and shutdown, especially if it can do so without writing much to the disk. Admittedly, this would not be nearly as useful as knowing that a disk will fail in a week! Disks that just stop spinning or won't spin back up (motor/spindle failure) are another problem that presumably cannot be detected by SMART. However this mode of failure is usually only seen in DOA disks and old, old disks. What fraction of the failed disks were this type of failure? Were there postmortem analyses of the power supplies in the failed systems? It wouldn't surprise me if low or noisy power lines led to an increased rate of disk failure. SMART wouldn't give this information (at least, not on any of the disks I have), but lm_sensors would. Thanks, David Mathog [EMAIL PROTECTED] Manager, Sequence Analysis Facility, Biology Division, Caltech _______________________________________________ Beowulf mailing list, [email protected] To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
