On 21/03/13 12:28, Phil Kennedy wrote: > On 3/20/2013 9:12 PM, Holger Parplies wrote: >> I've had that happen (except that I noticed before a drive broke) at least >> once, and I remember that Les has also. From what I remember of his >> explanation (please correct me if I'm wrong), two physical disks concurrently >> positioning their heads can disturb each other (through vibration) in such a >> way that one of them returns a read or write error and is kicked out of the >> array without the drive actually being in any way defective. I *would* >> consider this a shortcoming of Linux software RAID-1. >> >> As Adam wrote, you can easily monitor that. It still is a nuisance, though. > As an aside, i've seen drives in other backuppc / software RAID > instances fail for no good reason, to the point that they pass long > smartctl test, yet mdadm is still convinced that the drive is bad. > Perhaps the vibration issue you've described was the culprit then?
This is perhaps getting a little off-topic for this list, but if you are interested in these issues, I would suggest the linux-raid list has a lot of very knowledgeable people with a lot to say about these sorts of problems. As just one possible explanation, you are using "cheap" drives without properly configuring them. ie, if the drive has a problem reading a sector from the drive, then the drive will try to read the sector (try really hard), what usually happens is the controller or linux driver will timeout while waiting and ask the drive to reset, etc etc... eventually it will think the drive is not responding (because it is still trying to read the sector it had a problem with), and so it will be kicked from the array as a failed drive. There are ways to resolve this, either telling the drive to timeout much more quickly (usually about 7 seconds or less) or telling linux not to be so impatient and wait much longer for the drive to return the failed read (a number of minutes). From memory, if the drive supports ECT then this works. On "RAID or Enterprise" drives, the default is usually to timeout a failed read within a few seconds, because then the RAID can simply read that data from another drive. Linux software raid will notice the read failure and attempt to re-write the failed sector by using data from the other drives. The failed sector will either re-write successfully, or be transparently relocated by the drive. If the write fails, then the drive is kicked from the array. Search keywords like URE (Unrecoverable Read Error), ERC/ECT or just check the linux-raid mailing list, there is an email about this issue frequently. I've *never* had drives being randomly kicked from an array except where either the above was happening, or SATA driver issues. In any case, with proper monitoring, this is almost a non-event. I'm not suggesting this was your issue, nor anybody else's, just suggesting that appears to be a much more common cause of perfectly good drives being randomly kicked from a raid array, as opposed to "vibration" issues. Regards, Adam -- Adam Goryachev Website Managers www.websitemanagers.com.au ------------------------------------------------------------------------------ Everyone hates slow websites. So do we. Make your web apps faster with AppDynamics Download AppDynamics Lite for free today: http://p.sf.net/sfu/appdyn_d2d_mar _______________________________________________ BackupPC-users mailing list BackupPC-users@lists.sourceforge.net List: https://lists.sourceforge.net/lists/listinfo/backuppc-users Wiki: http://backuppc.wiki.sourceforge.net Project: http://backuppc.sourceforge.net/