Re: Bad Blocks in IDE software Raid 1
Hello Russell On 18 Apr 2003 at 17:26, Russell Coker wrote: On Thu, 17 Apr 2003 18:48, I. Forbes wrote: Do you think there would be any benefit gained from burning in a new drive, perhaps by running fsck -c -c, in order to find marginal blocks and get them mapped out before the drive is put onto an array? Maybe. What about doing this on a aray drive that has failed before attempting to remount it with raidhotadd. Generally such a burn-in won't achieve any more benefit than just doing a new raidhotadd. Although it has worked once for me and is something to keep in mind. I tried this with a drive that had been faulted out of an array. I ran fsck -c -c on it before I ran raidhotadd. The drive is one that I has given trouble in the past. It took a long time for the fsck to completed (about 24 hours) but the drive might not have had dma active at the time. In this instance it did not help. The drive has faulted out again after about a weeks operation. It seems this device is on a slow inevitable slide to total failure. I have done a raidhotadd again, but I think I must organize a new drive. Regards Ian - Ian Forbes ZSD http://www.zsd.co.za Office: +27 21 683-1388 Fax: +27 21 674-1106 Snail Mail: P.O. Box 46827, Glosderry, 7702, South Africa -
Re: Bad Blocks in IDE software Raid 1
On Thu, 17 Apr 2003 18:48, I. Forbes wrote: Am I correct in assuming that every time a bad block is discovered and remapped on a software raid1 system: - there is some data loss I believe that if drive-0 in the array returns a read error then the data is read from drive-1 and there is no data loss. Of course if the drive returns bad data and claims it to be good data then you are stuffed. - one of the drives is failed out of the array Yes. I assume there are repeated attempts at reading the bad block, before the above actions are triggerd. Yes, this unfortunately causes things to block for a while... Hopefully these will trigger remapping at the firmware level before the above happens. My experience is that IBM drives don't do this. It could be done but would require more advanced drive firmware. Do you think there would be any benefit gained from burning in a new drive, perhaps by running fsck -c -c, in order to find marginal blocks and get them mapped out before the drive is put onto an array? Maybe. What about doing this on a aray drive that has failed before attempting to remount it with raidhotadd. Generally such a burn-in won't achieve any more benefit than just doing a new raidhotadd. Although it has worked once for me and is something to keep in mind. -- http://www.coker.com.au/selinux/ My NSA Security Enhanced Linux packages http://www.coker.com.au/bonnie++/ Bonnie++ hard drive benchmark http://www.coker.com.au/postal/Postal SMTP/POP benchmark http://www.coker.com.au/~russell/ My home page
Re: Bad Blocks in IDE software Raid 1
Hello Russell On 15 Apr 2003 at 20:21, Russell Coker wrote: If you do a write and something goes wrong then the data will be re-mapped. I don't know how many (if any) drives do read after write verification. If they don't then it's likely that an error will only be discovered some time later when you want to read the data (and this can happen even if the data is verified). Then the drive will return a read error. If you then write to the bad block the drive will usually perform a re-mapping and after that things will be fine. If using software RAID then a raidhotadd operation will usually trigger a re-mapping on the sector that caused the disk in question to be removed from the array. Am I correct in assuming that every time a bad block is discovered and remapped on a software raid1 system: - there is some data loss - one of the drives is failed out of the array I assume there are repeated attempts at reading the bad block, before the above actions are triggerd. Hopefully these will trigger remapping at the firmware level before the above happens. Do you think there would be any benefit gained from burning in a new drive, perhaps by running fsck -c -c, in order to find marginal blocks and get them mapped out before the drive is put onto an array? What about doing this on a aray drive that has failed before attempting to remount it with raidhotadd. Thanks Ian - Ian Forbes ZSD http://www.zsd.co.za Office: +27 21 683-1388 Fax: +27 21 674-1106 Snail Mail: P.O. Box 46827, Glosderry, 7702, South Africa -
Re: Bad Blocks in IDE software Raid 1
On Tue, 15 Apr 2003 19:45, I. Forbes wrote: As far as I know, with modern IDE drives the formated drive includes spare blocks and the drive firmware will automatically re-map the drive to replace bad blocks with ones from the spare space. This all happens transparently without any feedback to the system log files. True. The drive does that wherever possible. If you do a write and something goes wrong then the data will be re-mapped. I don't know how many (if any) drives do read after write verification. If they don't then it's likely that an error will only be discovered some time later when you want to read the data (and this can happen even if the data is verified). Then the drive will return a read error. If you then write to the bad block the drive will usually perform a re-mapping and after that things will be fine. If using software RAID then a raidhotadd operation will usually trigger a re-mapping on the sector that caused the disk in question to be removed from the array. This would imply that bad blocks on one drive in an array are mapped out by the firmware, until a point is reached where there are no spare blocks on that drive. Further bad blocks would result in disk errors and the drive would be failed out of the array. That should not happen for a long time. You can use SMART to determine how many re-mapping events have occurred. Expect to be able to remap at least 1000 blocks before running out. The ext2 file system also handles mapping out of bad blocks. These can be detected during the initial formating of the drive, or during subsequent fsck runs. True, although I've never detected bad blocks during fsck and I don't recall the last time I detected them during format (I haven't even done mkfs -c for years). Can ext2 file systems actively map out bad blocks during normal operation? I don't think so, and I don't think it's desirable with modern IDE and SCSI drives. Finally, if an ext2 filesystem is mounted on a Linux software raid1 device, and a file system error occurs, will a portion of that device be mapped out as a bad block, or will one of the drives be failed out of the array? One of the drives will be removed from the array and the file system drivers won't know the difference. If ext2 maps out a bad block, I assume the same block on both the good and bad drives gets mapped out. True. If one of the drives is failed it would explain why the failure rate on raid drives seems higher than that in single drive machines. ie Raid fails the drive, while in a single drive machine ext2 caries on, hiding the problem from the end user who is not watching the log files. It won't be hidden. It may even result in a kernel panic. But you are correct that there are situations where software RAID will make errors more obvious, this is a good thing IMHO. -- http://www.coker.com.au/selinux/ My NSA Security Enhanced Linux packages http://www.coker.com.au/bonnie++/ Bonnie++ hard drive benchmark http://www.coker.com.au/postal/Postal SMTP/POP benchmark http://www.coker.com.au/~russell/ My home page
Re: Bad Blocks in IDE software Raid 1
On Tuesday 15 April 2003 11:45, I. Forbes wrote: Hello All I have had a number of cases with disk's reporting as failed on systems with IDE drives in software RAID 1 configuration. I suppose the good news is you can change the drive with minimal downtime and no loss of data. But some of my customers are querying the apparent high failure rate. Could it be that all failures are with a certain series of IBM disks? We've had a failure rate of 3 out of 10 disks within two or three months, all of them IBM and of the same series. (No, I can't remember which models, I don't work at that place anymore). No RAID setups, just normal use in workstations. It was discussed quite a bit in many places, I think on the linux kernel list is a reference that it was actually a firmware problem of those disks. cheers -- vbi -- random link of the day: http://fortytwo.ch/sienapei/zafeigah pgpOK0PDpYEzV.pgp Description: signature