Re: Bad Blocks in IDE software Raid 1

2003-04-30 Thread I. Forbes
Hello Russell 

On 18 Apr 2003 at 17:26, Russell Coker wrote:

 On Thu, 17 Apr 2003 18:48, I. Forbes wrote:

  Do you think there would be any benefit gained from burning in a
  new drive, perhaps by running fsck -c -c, in order to find marginal
  blocks and get them mapped out before the drive is put onto an array?
 
 Maybe.
 
  What about doing this on a aray drive that has failed before
  attempting to remount it with raidhotadd.
 
 Generally such a burn-in won't achieve any more benefit than just doing a 
 new raidhotadd.  Although it has worked once for me and is something to keep 
 in mind.

I tried this with a drive that had been faulted out of an array. I ran fsck 
-c -c on it before I ran raidhotadd. The drive is one that I has given 
trouble in the past.

It took a long time for the fsck to completed (about 24 hours) but the 
drive might not have had dma active at the time.

In this instance it did not help. The drive has faulted out again after 
about a weeks operation. It seems this device is on a slow inevitable 
slide to total failure. I have done a raidhotadd again, but I think I 
must organize a new drive.

Regards

Ian
-
Ian Forbes ZSD
http://www.zsd.co.za
Office: +27 21 683-1388  Fax: +27 21 674-1106
Snail Mail: P.O. Box 46827, Glosderry, 7702, South Africa
-





Re: Bad Blocks in IDE software Raid 1

2003-04-18 Thread Russell Coker
On Thu, 17 Apr 2003 18:48, I. Forbes wrote:
 Am I correct in assuming that every time a bad block is discovered
 and remapped on a software raid1 system:

 - there is some data loss

I believe that if drive-0 in the array returns a read error then the data is 
read from drive-1 and there is no data loss.  Of course if the drive returns 
bad data and claims it to be good data then you are stuffed.

 - one of the drives is failed out of the array

Yes.

 I assume there are repeated attempts at reading the bad block, before
 the above actions are triggerd.

Yes, this unfortunately causes things to block for a while...

 Hopefully these will trigger remapping
 at the firmware level before the above happens.

My experience is that IBM drives don't do this.  It could be done but would 
require more advanced drive firmware.

 Do you think there would be any benefit gained from burning in a
 new drive, perhaps by running fsck -c -c, in order to find marginal
 blocks and get them mapped out before the drive is put onto an array?

Maybe.

 What about doing this on a aray drive that has failed before
 attempting to remount it with raidhotadd.

Generally such a burn-in won't achieve any more benefit than just doing a 
new raidhotadd.  Although it has worked once for me and is something to keep 
in mind.

-- 
http://www.coker.com.au/selinux/   My NSA Security Enhanced Linux packages
http://www.coker.com.au/bonnie++/  Bonnie++ hard drive benchmark
http://www.coker.com.au/postal/Postal SMTP/POP benchmark
http://www.coker.com.au/~russell/  My home page




Re: Bad Blocks in IDE software Raid 1

2003-04-17 Thread I. Forbes
Hello Russell 

On 15 Apr 2003 at 20:21, Russell Coker wrote:

 If you do a write and something goes wrong then the data will be re-mapped.  
 I 
 don't know how many (if any) drives do read after write verification.  If 
 they don't then it's likely that an error will only be discovered some time 
 later when you want to read the data (and this can happen even if the data is 
 verified).
 
 Then the drive will return a read error.  If you then write to the bad block 
 the drive will usually perform a re-mapping and after that things will be 
 fine.
 
 If using software RAID then a raidhotadd operation will usually trigger a 
 re-mapping on the sector that caused the disk in question to be removed from 
 the array.

Am I correct in assuming that every time a bad block is discovered 
and remapped on a software raid1 system:

- there is some data loss

- one of the drives is failed out of the array

I assume there are repeated attempts at reading the bad block, before 
the above actions are triggerd. Hopefully these will trigger remapping 
at the firmware level before the above happens.

Do you think there would be any benefit gained from burning in a 
new drive, perhaps by running fsck -c -c, in order to find marginal 
blocks and get them mapped out before the drive is put onto an array?

What about doing this on a aray drive that has failed before 
attempting to remount it with raidhotadd.

Thanks

Ian
-
Ian Forbes ZSD
http://www.zsd.co.za
Office: +27 21 683-1388  Fax: +27 21 674-1106
Snail Mail: P.O. Box 46827, Glosderry, 7702, South Africa
-





Re: Bad Blocks in IDE software Raid 1

2003-04-15 Thread Russell Coker
On Tue, 15 Apr 2003 19:45, I. Forbes wrote:
 As far as I know, with modern IDE drives the formated drive includes
 spare blocks and the drive firmware will automatically re-map the drive
 to replace bad blocks with ones from the spare space. This all
 happens transparently without any feedback to the system log files.

True.  The drive does that wherever possible.

If you do a write and something goes wrong then the data will be re-mapped.  I 
don't know how many (if any) drives do read after write verification.  If 
they don't then it's likely that an error will only be discovered some time 
later when you want to read the data (and this can happen even if the data is 
verified).

Then the drive will return a read error.  If you then write to the bad block 
the drive will usually perform a re-mapping and after that things will be 
fine.

If using software RAID then a raidhotadd operation will usually trigger a 
re-mapping on the sector that caused the disk in question to be removed from 
the array.

 This would imply that bad blocks on one drive in an array are mapped
 out by the firmware, until a point is reached where there are no spare
 blocks on that drive. Further bad blocks would result in disk errors and
 the drive would be failed out of the array.

That should not happen for a long time.  You can use SMART to determine how 
many re-mapping events have occurred.  Expect to be able to remap at least 
1000 blocks before running out.

 The ext2 file system also handles mapping out of bad blocks. These
 can be detected during the initial formating of the drive, or during
 subsequent fsck runs.

True, although I've never detected bad blocks during fsck and I don't recall 
the last time I detected them during format (I haven't even done mkfs -c for 
years).

 Can ext2 file systems actively map out bad blocks during normal
 operation?

I don't think so, and I don't think it's desirable with modern IDE and SCSI 
drives.

 Finally, if an ext2 filesystem is mounted on a Linux software raid1
 device, and a file system error occurs, will a portion of that device be
 mapped out as a bad block, or will one of the drives be failed out of
 the array?

One of the drives will be removed from the array and the file system drivers 
won't know the difference.

 If ext2 maps out a bad block, I assume the same block on both the
 good and bad drives gets mapped out.

True.

 If one of the drives is failed it would explain why the failure rate on
 raid drives seems higher than that in single drive machines. ie Raid
 fails the drive, while in a single drive machine ext2 caries on, hiding
 the problem from the end user who is not watching the log files.

It won't be hidden.  It may even result in a kernel panic.  But you are 
correct that there are situations where software RAID will make errors more 
obvious, this is a good thing IMHO.

-- 
http://www.coker.com.au/selinux/   My NSA Security Enhanced Linux packages
http://www.coker.com.au/bonnie++/  Bonnie++ hard drive benchmark
http://www.coker.com.au/postal/Postal SMTP/POP benchmark
http://www.coker.com.au/~russell/  My home page




Re: Bad Blocks in IDE software Raid 1

2003-04-15 Thread Adrian 'Dagurashibanipal' von Bidder
On Tuesday 15 April 2003 11:45, I. Forbes wrote:
 Hello All

 I have had a number of cases with disk's reporting as failed on
 systems with IDE drives in software RAID 1 configuration.

 I suppose the good news is you can change the drive with minimal
 downtime and no loss of data. But some of my customers are
 querying the apparent high failure rate.


Could it be that all failures are with a certain series of IBM disks? We've 
had a failure rate of 3 out of 10 disks within two or three months, all of 
them IBM and of the same series. (No, I can't remember which models, I don't 
work at that place anymore). No RAID setups, just normal use in workstations.

It was discussed quite a bit in many places, I think on the linux kernel list 
is a reference that it was actually a firmware problem of those disks.

cheers
-- vbi

-- 
random link of the day: http://fortytwo.ch/sienapei/zafeigah


pgpOK0PDpYEzV.pgp
Description: signature