beginner error detection

2007-02-23 Thread Tomka Gergely
Hi!

I have a simple sw raid1, over two sata disks. One of the disks started to 
complain (s.m.a.r.t. errors). I think in the near future i witness a disk 
failure. But i don't know how this thing is happening with raid1, so i 
have some questions. If these questions answered somewhere (faq, manpage, 
url), then feel free to redirect me to this source(s). 

Can the raid1 detect and handle disk errors? If one block goes wrong, how 
can the raid1 driver choose which was the correct, original value?

Sata systems can die gracefully? When in a good scsi system happens a 
total disk failure, then the scsi makes the disk fail, mdadm removes 
the disk from the array, and in the morning i see a nice e-mail. When a 
PATA disk dies, the system goes down, so i need to call a cab. I dont know 
how the sata behaves in this situation.

The kernel 2.6.20 (with skas patch), the controller :
nVidia Corporation CK804 Serial ATA Controller (rev f3)

Thanks.

-- 
Tomka Gergely, [EMAIL PROTECTED]
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Reshaping raid0/10

2007-02-23 Thread Jan Engelhardt

On Feb 22 2007 06:59, Neil Brown wrote:
On Wednesday February 21, [EMAIL PROTECTED] wrote:
 
 are there any plans to support reshaping
 on raid0 and raid10?
 

No concrete plans.  It largely depends on time and motivation.
I expect that the various flavours of raid5/raid6 reshape will come
first.
Then probably converting raid0-raid5.

I really haven't given any thought to how you might reshape a
raid10...

It should not be any different from raid0/raid5 reshaping, should it?



Jan
-- 
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


2.6.20: stripe_cache_size goes boom with 32mb

2007-02-23 Thread Justin Piszcz
Each of these are averaged over three runs with 6 SATA disks in a SW RAID 
5 configuration:


(dd if=/dev/zero of=file_1 bs=1M count=2000)

128k_stripe: 69.2MB/s
256k_stripe: 105.3MB/s
512k_stripe: 142.0MB/s
1024k_stripe: 144.6MB/s
2048k_stripe: 208.3MB/s
4096k_stripe: 223.6MB/s
8192k_stripe: 226.0MB/s
16384k_stripe: 215.0MB/s

When I tried a 32768k stripe, this happened:
p34:~# echo 32768  /sys/block/md4/md/stripe_cache_size
Connection to p34 closed

I was able to Alt-SysRQ+b but I could not access the console/X/etc, it 
appeared to be frozen.


FYI.

Justin.

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 006 of 6] md: Add support for reshape of a raid6

2007-02-23 Thread Helge Hafting

Andrew Morton wrote:

On Thu, 22 Feb 2007 13:39:56 +1100 Neil Brown [EMAIL PROTECTED] wrote:

  

I must right code that Andrew can read.



That's write.

But more importantly, things that people can immediately see and understand
help reduce the possibility of mistakes.  Now and in the future.

If we did all loops like that, then it'd be the the best way to do it in new 
code,
because people's eyes and brains are locked into that idiom and we just
don't have to think about it when we see it.

I have done lots of loops like that and understood it immediately.
Nice, short, _clear_ and no - a loop that counts down instead of
up is not difficult at all. 
Testing i-- instead of i = 0 is also something I consider trivial,

even though I don't code that much.  If this is among the worst you
see, then the kernel source must be in great shape ;-)

Helge Hafting
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2.6.20: stripe_cache_size goes boom with 32mb

2007-02-23 Thread Justin Piszcz
I have 2GB On this machine.  For me, 8192 seems to be the sweet spot, I 
will probably keep it at 8mb.


On Fri, 23 Feb 2007, Jason Rainforest wrote:


Hi Justin,

I'm not a RAID or kernel developer, but .. do you have enough RAM to
support a 32mb stripe_cache_size?! Here on my 7*250Gb SW RAID5 array,
decreasing a stripe_cache_size of 8192 to 4096 frees up no less than
120mb of RAM. Using that as a calculation tool, a 32mb stripe_cache_size
would require approximately 960mb of RAM! My RAID box only has 1Gb of
RAM, so I'm not game to test such a thing. Others on these lists would
definitely have a good idea on what's happening :-)

Cheers,
Jason


On Fri, 2007-02-23 at 06:41 -0500, Justin Piszcz wrote:

Each of these are averaged over three runs with 6 SATA disks in a SW RAID
5 configuration:

(dd if=/dev/zero of=file_1 bs=1M count=2000)

128k_stripe: 69.2MB/s
256k_stripe: 105.3MB/s
512k_stripe: 142.0MB/s
1024k_stripe: 144.6MB/s
2048k_stripe: 208.3MB/s
4096k_stripe: 223.6MB/s
8192k_stripe: 226.0MB/s
16384k_stripe: 215.0MB/s

When I tried a 32768k stripe, this happened:
p34:~# echo 32768  /sys/block/md4/md/stripe_cache_size
Connection to p34 closed

I was able to Alt-SysRQ+b but I could not access the console/X/etc, it
appeared to be frozen.

FYI.

Justin.

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html



-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2.6.20: stripe_cache_size goes boom with 32mb

2007-02-23 Thread Jason Rainforest
Hi Justin,

I'm not a RAID or kernel developer, but .. do you have enough RAM to
support a 32mb stripe_cache_size?! Here on my 7*250Gb SW RAID5 array,
decreasing a stripe_cache_size of 8192 to 4096 frees up no less than
120mb of RAM. Using that as a calculation tool, a 32mb stripe_cache_size
would require approximately 960mb of RAM! My RAID box only has 1Gb of
RAM, so I'm not game to test such a thing. Others on these lists would
definitely have a good idea on what's happening :-)

Cheers,
Jason


On Fri, 2007-02-23 at 06:41 -0500, Justin Piszcz wrote:
 Each of these are averaged over three runs with 6 SATA disks in a SW RAID 
 5 configuration:
 
 (dd if=/dev/zero of=file_1 bs=1M count=2000)
 
 128k_stripe: 69.2MB/s
 256k_stripe: 105.3MB/s
 512k_stripe: 142.0MB/s
 1024k_stripe: 144.6MB/s
 2048k_stripe: 208.3MB/s
 4096k_stripe: 223.6MB/s
 8192k_stripe: 226.0MB/s
 16384k_stripe: 215.0MB/s
 
 When I tried a 32768k stripe, this happened:
 p34:~# echo 32768  /sys/block/md4/md/stripe_cache_size
 Connection to p34 closed
 
 I was able to Alt-SysRQ+b but I could not access the console/X/etc, it 
 appeared to be frozen.
 
 FYI.
 
 Justin.
 
 -
 To unsubscribe from this list: send the line unsubscribe linux-raid in
 the body of a message to [EMAIL PROTECTED]
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2.6.20: stripe_cache_size goes boom with 32mb

2007-02-23 Thread Jan Engelhardt

On Feb 23 2007 06:41, Justin Piszcz wrote:

 I was able to Alt-SysRQ+b but I could not access the console/X/etc, it 
 appeared
 to be frozen.

No sysrq+t? (Ah, unblanking might hang.) Well, netconsole/serial to the rescue,
then ;-)


Jan
-- 
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: PATA/SATA Disk Reliability paper

2007-02-23 Thread Al Boldi
Stephen C Woods wrote:
   So drives do need to be ventilated, not so much wory about exploding,
 but rather subtle distortion of the case as the atmospheric preasure
 changed.

I have a '94 Caviar without any apparent holes; and as a bonus, the drive 
still works.

In contrast, ever since these holes appeared, drive failures became the norm.

Doe anyone rememnber that you had to let you drives acclimate to your
 machine room for a day or so before you used them.

The problem is, that's not enough; the room temperature/humidity has to be 
controlled too.  In a desktop environment, that's not really feasible.


Thanks!

--
Al

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Linux Software RAID a bit of a weakness?

2007-02-23 Thread Colin Simpson
Hi, 

We had a small server here that was configured with a RAID 1 mirror,
using two IDE disks. 

Last week one of the drives failed in this. So we replaced the drive and
set the array to rebuild. The good disk then found a bad block and the
mirror failed.

Now I presume that the good disk must have had an underlying bad block
in either unallocated space or a file I never access. Now as RAID works
at the block level you only ever see this on an array rebuild when it's
often catastrophic. Is this a bit of a flaw? 

I know there is the definite probability of two drives failing within a
short period of time. But this is a bit different as it's the
probability of two drives failing but over a much larger time scale if
one of the flaws is hidden in unallocated space (maybe a dirt particle
finds it's way onto the surface or something). This would make RAID buy
you a lot less in reliability, I'd have thought. 

I seem to remember seeing in the log file for a Dell perc something
about scavenging for bad blocks. Do hardware RAID systems have a
mechanism that at times of low activity search the disks for bad blocks
to help guard against this sort of failure (so a disk error is reported
early)?

On Software RAID, I was thinking apart from a three way mirror, which I
don't think is at present supported. Is there any merit in say, cat'ing
the whole disk devices to /dev/null every so often to check that the
whole surface is readable (I presume just reading the raw device won't
upset thing, don't worry I don't plan on trying it on a production
system). 

Any thoughts? As I presume people have thought of this before and I must
be missing something.

Colin


This email and any files transmitted with it are confidential and are intended 
solely for the use of the individual or entity to whom they are addressed.  If 
you are not the original recipient or the person responsible for delivering the 
email to the intended recipient, be advised that you have received this email 
in error, and that any use, dissemination, forwarding, printing, or copying of 
this email is strictly prohibited. If you received this email in error, please 
immediately notify the sender and delete the original.


-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Linux Software RAID a bit of a weakness?

2007-02-23 Thread Steve Cousins

Colin Simpson wrote:
Hi, 


We had a small server here that was configured with a RAID 1 mirror,
using two IDE disks. 


Last week one of the drives failed in this. So we replaced the drive and
set the array to rebuild. The good disk then found a bad block and the
mirror failed.

Now I presume that the good disk must have had an underlying bad block
in either unallocated space or a file I never access. Now as RAID works
at the block level you only ever see this on an array rebuild when it's
often catastrophic. Is this a bit of a flaw? 


I know there is the definite probability of two drives failing within a
short period of time. But this is a bit different as it's the
probability of two drives failing but over a much larger time scale if
one of the flaws is hidden in unallocated space (maybe a dirt particle
finds it's way onto the surface or something). This would make RAID buy
you a lot less in reliability, I'd have thought. 


I seem to remember seeing in the log file for a Dell perc something
about scavenging for bad blocks. Do hardware RAID systems have a
mechanism that at times of low activity search the disks for bad blocks
to help guard against this sort of failure (so a disk error is reported
early)?

On Software RAID, I was thinking apart from a three way mirror, which I
don't think is at present supported. Is there any merit in say, cat'ing
the whole disk devices to /dev/null every so often to check that the
whole surface is readable (I presume just reading the raw device won't
upset thing, don't worry I don't plan on trying it on a production
system). 


Any thoughts? As I presume people have thought of this before and I must
be missing something.


Yes, this is an important thing to keep on top of, both for hardware 
RAID and software RAID.  For md:


echo check  /sys/block/md0/md/sync_action

This should be done regularly. I have cron do it once a week.

Check out: http://neil.brown.name/blog/20050727141521-002

Good luck,

Steve
--
__
 Steve Cousins, Ocean Modeling GroupEmail: [EMAIL PROTECTED]
 Marine Sciences, 452 Aubert Hall   http://rocky.umeoce.maine.edu
 Univ. of Maine, Orono, ME 04469Phone: (207) 581-4302


-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Reshaping raid0/10

2007-02-23 Thread Neil Brown
On Friday February 23, [EMAIL PROTECTED] wrote:
 
 On Feb 22 2007 06:59, Neil Brown wrote:
 On Wednesday February 21, [EMAIL PROTECTED] wrote:
  
  are there any plans to support reshaping
  on raid0 and raid10?
  
 
 No concrete plans.  It largely depends on time and motivation.
 I expect that the various flavours of raid5/raid6 reshape will come
 first.
 Then probably converting raid0-raid5.
 
 I really haven't given any thought to how you might reshape a
 raid10...
 
 It should not be any different from raid0/raid5 reshaping, should it?

Depends on what level you look at.

If I wanted to reshape a raid0, I would just morph it into a raid4
with a missing parity drive, then use the raid5 code to restripe it.
Then morph it back to regular raid0.

With raid10 I cannot do that.  I would need to do the restriping
inside the raid10 module.  But raid10 doesn't have a stripe-cache like
raid5 does, and the stripe cache is a very integral part of the
restripe process.

So there would be a substantial amount of design and coding to effect
a raid10 reshape - at least as much as the work to produce the initial
raid5 reshape and probably more.

So conceptually it might be very similar, but at the code level, it is
likely to be very different.

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Linux Software RAID a bit of a weakness?

2007-02-23 Thread Neil Brown
On Friday February 23, [EMAIL PROTECTED] wrote:
 Hi, 
 
 We had a small server here that was configured with a RAID 1 mirror,
 using two IDE disks. 
 
 Last week one of the drives failed in this. So we replaced the drive and
 set the array to rebuild. The good disk then found a bad block and the
 mirror failed.
 
 Now I presume that the good disk must have had an underlying bad block
 in either unallocated space or a file I never access. Now as RAID works
 at the block level you only ever see this on an array rebuild when it's
 often catastrophic. Is this a bit of a flaw? 

Certainly can be unfortunate.

 
 I know there is the definite probability of two drives failing within a
 short period of time. But this is a bit different as it's the
 probability of two drives failing but over a much larger time scale if
 one of the flaws is hidden in unallocated space (maybe a dirt particle
 finds it's way onto the surface or something). This would make RAID buy
 you a lot less in reliability, I'd have thought. 
 
 I seem to remember seeing in the log file for a Dell perc something
 about scavenging for bad blocks. Do hardware RAID systems have a
 mechanism that at times of low activity search the disks for bad blocks
 to help guard against this sort of failure (so a disk error is reported
 early)?
 

As has been mentioned, this can be done with md/raid too.  Some
distros (debian/testing at least) schedule a 'check' of all arrays
once a month.

 On Software RAID, I was thinking apart from a three way mirror, which I
 don't think is at present supported. Is there any merit in say, cat'ing
 the whole disk devices to /dev/null every so often to check that the
 whole surface is readable (I presume just reading the raw device won't
 upset thing, don't worry I don't plan on trying it on a production
 system). 

Three-way mirroring has always been supported.  You can do N way
mirroring if you have N drives.

Reading the whole device would not be sufficient as it would only read
one copy of every block rather than all copies.
The 'check' process reads all copies and compares them with one
another,  If there is a difference it is reported.  If you use
'repair' instead of 'check', the difference is arbitrarily corrected.
If a read error is detected during the 'check', md/raid1 will attempt
to write the data from the good drive to the bad drive, then read it
back.  If this works, the drive is assumed to be fixed.  If not, the
bad drive is failed out of the array.

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2.6.20: stripe_cache_size goes boom with 32mb

2007-02-23 Thread Dan Williams

On 2/23/07, Justin Piszcz [EMAIL PROTECTED] wrote:

I have 2GB On this machine.  For me, 8192 seems to be the sweet spot, I
will probably keep it at 8mb.


Just a note stripe_cache_size = 8192 = 192MB with six disks.

The calculation is:
stripe_cache_size * num_disks * PAGE_SIZE = stripe_cache_size_bytes

--
Dan
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


nonzero mismatch_cnt with no earlier error

2007-02-23 Thread Eyal Lebedinsky
I run a 'check' weekly, and yesterday it came up with a non-zero
mismatch count (184). There were no earlier RAID errors logged
and the count was zero after the run a week ago.

Now, the interesting part is that there was one i/o error logged
during the check *last week*, however the raid did not see it and
the count was zero at the end. No errors were logged during the
week since or during the check last night.

fsck (ext3 with logging) found no errors but I may have bad data
somewhere.

Should the raid have noticed the error, checked the offending
stripe and taken appropriate action? The messages from that error
are below.

Naturally, I do not know if the mismatch is related to the failure
last week, it could be from a number of other reasons (bad memory?
kernel bug?).


system details:
  2.6.20 vanilla
  /dev/sd[ab]: on motherboard
IDE interface: Intel Corp. 82801EB (ICH5) Serial ATA 150 Storage Controller 
(rev 02)
  /dev/sd[cdef]: Promise SATA-II-150-TX4
Unknown mass storage controller: Promise Technology, Inc.: Unknown device 
3d18 (rev 02)
  All 6 disks are WD 320GB SATA of similar models

Tail of dmesg, showing all messages since last week 'check':

*** last week check start:
[927080.617744] md: data-check of RAID array md0
[927080.630783] md: minimum _guaranteed_  speed: 24000 KB/sec/disk.
[927080.648734] md: using maximum available idle IO bandwidth (but not more 
than 20 KB/sec) for data-check.
[927080.678103] md: using 128k window, over a total of 312568576 blocks.
*** last week error:
[937567.332751] ata3.00: exception Emask 0x10 SAct 0x0 SErr 0x4190002 action 0x2
[937567.354094] ata3.00: cmd b0/d5:01:09:4f:c2/00:00:00:00:00/00 tag 0 cdb 0x0 
data 512 in
[937567.354096]  res 51/04:83:45:00:00/00:00:00:00:00/a0 Emask 0x10 
(ATA bus error)
[937568.120783] ata3: soft resetting port
[937568.282450] ata3: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
[937568.306693] ata3.00: configured for UDMA/100
[937568.319733] ata3: EH complete
[937568.361223] SCSI device sdc: 625142448 512-byte hdwr sectors (320073 MB)
[937568.397207] sdc: Write Protect is off
[937568.408620] sdc: Mode Sense: 00 3a 00 00
[937568.453522] SCSI device sdc: write cache: enabled, read cache: enabled, 
doesn't support DPO or FUA
*** last week check end:
[941696.843935] md: md0: data-check done.
[941697.246454] RAID5 conf printout:
[941697.256366]  --- rd:6 wd:6
[941697.264718]  disk 0, o:1, dev:sda1
[941697.275146]  disk 1, o:1, dev:sdb1
[941697.285575]  disk 2, o:1, dev:sdc1
[941697.296003]  disk 3, o:1, dev:sdd1
[941697.306432]  disk 4, o:1, dev:sde1
[941697.316862]  disk 5, o:1, dev:sdf1
*** this week check start:
[1530647.746383] md: data-check of RAID array md0
[1530647.759677] md: minimum _guaranteed_  speed: 24000 KB/sec/disk.
[1530647.778041] md: using maximum available idle IO bandwidth (but not more 
than 20 KB/sec) for data-check.
[1530647.807663] md: using 128k window, over a total of 312568576 blocks.
*** this week check end:
[1545248.680745] md: md0: data-check done.
[1545249.266727] RAID5 conf printout:
[1545249.276930]  --- rd:6 wd:6
[1545249.285542]  disk 0, o:1, dev:sda1
[1545249.296228]  disk 1, o:1, dev:sdb1
[1545249.306923]  disk 2, o:1, dev:sdc1
[1545249.317613]  disk 3, o:1, dev:sdd1
[1545249.328292]  disk 4, o:1, dev:sde1
[1545249.338981]  disk 5, o:1, dev:sdf1

-- 
Eyal Lebedinsky ([EMAIL PROTECTED]) http://samba.org/eyal/
attach .zip as .dat
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Linux Software RAID a bit of a weakness?

2007-02-23 Thread Richard Scobie

Neil Brown wrote:


The 'check' process reads all copies and compares them with one
another,  If there is a difference it is reported.  If you use
'repair' instead of 'check', the difference is arbitrarily corrected.
If a read error is detected during the 'check', md/raid1 will attempt
to write the data from the good drive to the bad drive, then read it
back.  If this works, the drive is assumed to be fixed.  If not, the
bad drive is failed out of the array.



One thing to note here is that 'repair' was broken for RAID1 until 
recently - see


http://marc.theaimsgroup.com/?l=linux-raidm=116951242005315w=2

As this patch was submitted just prior to the release of 2.6.20, this 
may be the first fixed kernel, but I have not checked.


Regards,

Richard

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


end to end error recovery musings

2007-02-23 Thread Ric Wheeler
In the IO/FS workshop, one idea we kicked around is the need to provide 
better and more specific error messages between the IO stack and the 
file system layer.


My group has been working to stabilize a relatively up to date libata + 
MD based box, so I can try to lay out at least one appliance like 
typical configuration to help frame the issue. We are working on a 
relatively large appliance, but you can buy similar home appliance (or 
build them) that use linux to provide a NAS in a box for end users.


The use case that we have is on an ICH6R/AHCI box with 4 large (500+ GB) 
drives, with some of the small system partitions on a 4-way RAID1 
device. The libata version we have is back port of 2.6.18 onto SLES10, 
so the error handling at the libata level is a huge improvement over 
what we had before.


Each box has a watchdog timer that can be set to fire after at most 2 
minutes.


(We have a second flavor of this box with an ICH5 and P-ATA drives using 
the non-libata drivers that has a similar use case).


Using the patches that Mark sent around recently for error injection, we 
inject media errors into one or more drives and try to see how smoothly 
error handling runs and, importantly, whether or not the error handling 
will complete before the watchdog fires and reboots the box.  If you 
want to be especially mean, inject errors into the RAID superblocks on 3 
out of the 4 drives.


We still have the following challenges:

   (1) read-ahead often means that we will  retry every bad sector at 
least twice from the file system level. The first time, the fs read 
ahead request triggers a speculative read that includes the bad sector 
(triggering the error handling mechanisms) right before the real 
application triggers a read does the same thing.  Not sure what the 
answer is here since read-ahead is obviously a huge win in the normal case.


   (2) the patches that were floating around on how to make sure that 
we effectively handle single sector errors in a large IO request are 
critical. On one hand, we want to combine adjacent IO requests into 
larger IO's whenever possible. On the other hand, when the combined IO 
fails, we need to isolate the error to the correct range, avoid 
reissuing a request that touches that sector again and communicate up 
the stack to file system/MD what really failed.  All of this needs to 
complete in tens of seconds, not multiple minutes.


   (3) The timeout values on the failed IO's need to be tuned well (as 
was discussed in an earlier linux-ide thread). We cannot afford to hang 
for 30 seconds, especially in the MD case, since you might need to fail 
more than one device for a single IO.  Prompt error prorogation (say 
that 4 times quickly!) can allow MD to mask the underlying errors as you 
would hope, hanging on too long will almost certainly cause a watchdog 
reboot...


   (4) The newish libata+SCSI stack is pretty good at handling disk 
errors, but adding in MD actually can reduce the reliability of your 
system unless you tune the error handling correctly.


We will follow up with specific issues as they arise, but I wanted to 
lay out a use case that can help frame part of the discussion.  I also 
want to encourage people to inject real disk errors with the Mark 
patches so we can share the pain ;-)


ric



-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: end to end error recovery musings

2007-02-23 Thread H. Peter Anvin

Ric Wheeler wrote:


We still have the following challenges:

   (1) read-ahead often means that we will  retry every bad sector at 
least twice from the file system level. The first time, the fs read 
ahead request triggers a speculative read that includes the bad sector 
(triggering the error handling mechanisms) right before the real 
application triggers a read does the same thing.  Not sure what the 
answer is here since read-ahead is obviously a huge win in the normal case.




Probably the only sane thing to do is to remember the bad sectors and 
avoid attempting reading them; that would mean marking automatic 
versus explicitly requested requests to determine whether or not to 
filter them against a list of discovered bad blocks.


-hpa
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: end to end error recovery musings

2007-02-23 Thread Andreas Dilger
On Feb 23, 2007  16:03 -0800, H. Peter Anvin wrote:
 Ric Wheeler wrote:
(1) read-ahead often means that we will  retry every bad sector at 
 least twice from the file system level. The first time, the fs read 
 ahead request triggers a speculative read that includes the bad sector 
 (triggering the error handling mechanisms) right before the real 
 application triggers a read does the same thing.  Not sure what the 
 answer is here since read-ahead is obviously a huge win in the normal case.
 
 Probably the only sane thing to do is to remember the bad sectors and 
 avoid attempting reading them; that would mean marking automatic 
 versus explicitly requested requests to determine whether or not to 
 filter them against a list of discovered bad blocks.

And clearing this list when the sector is overwritten, as it will almost
certainly be relocated at the disk level.  For that matter, a huge win
would be to have the MD RAID layer rewrite only the bad sector (in hopes
of the disk relocating it) instead of failing the whiole disk.  Otherwise,
a few read errors on different disks in a RAID set can take the whole
system offline.  Apologies if this is already done in recent kernels...

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: end to end error recovery musings

2007-02-23 Thread H. Peter Anvin

Andreas Dilger wrote:

And clearing this list when the sector is overwritten, as it will almost
certainly be relocated at the disk level.


Certainly if the overwrite is successful.

-hpa
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: nonzero mismatch_cnt with no earlier error

2007-02-23 Thread Eyal Lebedinsky
I did a resync since, which ended up with the same mismatch_cnt of 184.
I noticed that the count *was* reset to zero when the resync started,
but ended up with 184 (same as after the check).

I thought that the resync just calculates fresh parity and does not
bother checking if it is different. So what does this final count mean?

This leads me to ask: why bother doing a check if I will always run
a resync after an error - better run a resync in the first place?

-- 
Eyal Lebedinsky ([EMAIL PROTECTED]) http://samba.org/eyal/
attach .zip as .dat
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html