Re: [PATCH 001 of 6] md: Fix an occasional deadlock in raid5

2008-01-15 Thread dean gaudet
On Tue, 15 Jan 2008, Andrew Morton wrote:

> On Tue, 15 Jan 2008 21:01:17 -0800 (PST) dean gaudet <[EMAIL PROTECTED]> 
> wrote:
> 
> > On Mon, 14 Jan 2008, NeilBrown wrote:
> > 
> > > 
> > > raid5's 'make_request' function calls generic_make_request on
> > > underlying devices and if we run out of stripe heads, it could end up
> > > waiting for one of those requests to complete.
> > > This is bad as recursive calls to generic_make_request go on a queue
> > > and are not even attempted until make_request completes.
> > > 
> > > So: don't make any generic_make_request calls in raid5 make_request
> > > until all waiting has been done.  We do this by simply setting
> > > STRIPE_HANDLE instead of calling handle_stripe().
> > > 
> > > If we need more stripe_heads, raid5d will get called to process the
> > > pending stripe_heads which will call generic_make_request from a
> > > different thread where no deadlock will happen.
> > > 
> > > 
> > > This change by itself causes a performance hit.  So add a change so
> > > that raid5_activate_delayed is only called at unplug time, never in
> > > raid5.  This seems to bring back the performance numbers.  Calling it
> > > in raid5d was sometimes too soon...
> > > 
> > > Cc: "Dan Williams" <[EMAIL PROTECTED]>
> > > Signed-off-by: Neil Brown <[EMAIL PROTECTED]>
> > 
> > probably doesn't matter, but for the record:
> > 
> > Tested-by: dean gaudet <[EMAIL PROTECTED]>
> > 
> > this time i tested with internal and external bitmaps and it survived 8h 
> > and 14h resp. under the parallel tar workload i used to reproduce the 
> > hang.
> > 
> > btw this should probably be a candidate for 2.6.22 and .23 stable.
> > 
> 
> hm, Neil said
> 
>   The first fixes a bug which could make it a candidate for 24-final. 
>   However it is a deadlock that seems to occur very rarely, and has been in
>   mainline since 2.6.22.  So letting it into one more release shouldn't be
>   a big problem.  While the fix is fairly simple, it could have some
>   unexpected consequences, so I'd rather go for the next cycle.
> 
> food fight!
> 

heheh.

it's really easy to reproduce the hang without the patch -- i could
hang the box in under 20 min on 2.6.22+ w/XFS and raid5 on 7x750GB.
i'll try with ext3... Dan's experiences suggest it won't happen with ext3
(or is even more rare), which would explain why this has is overall a
rare problem.

but it doesn't result in dataloss or permanent system hangups as long
as you can become root and raise the size of the stripe cache...

so OK i agree with Neil, let's test more... food fight over! :)

-dean
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 001 of 6] md: Fix an occasional deadlock in raid5

2008-01-15 Thread dean gaudet
On Mon, 14 Jan 2008, NeilBrown wrote:

> 
> raid5's 'make_request' function calls generic_make_request on
> underlying devices and if we run out of stripe heads, it could end up
> waiting for one of those requests to complete.
> This is bad as recursive calls to generic_make_request go on a queue
> and are not even attempted until make_request completes.
> 
> So: don't make any generic_make_request calls in raid5 make_request
> until all waiting has been done.  We do this by simply setting
> STRIPE_HANDLE instead of calling handle_stripe().
> 
> If we need more stripe_heads, raid5d will get called to process the
> pending stripe_heads which will call generic_make_request from a
> different thread where no deadlock will happen.
> 
> 
> This change by itself causes a performance hit.  So add a change so
> that raid5_activate_delayed is only called at unplug time, never in
> raid5.  This seems to bring back the performance numbers.  Calling it
> in raid5d was sometimes too soon...
> 
> Cc: "Dan Williams" <[EMAIL PROTECTED]>
> Signed-off-by: Neil Brown <[EMAIL PROTECTED]>

probably doesn't matter, but for the record:

Tested-by: dean gaudet <[EMAIL PROTECTED]>

this time i tested with internal and external bitmaps and it survived 8h 
and 14h resp. under the parallel tar workload i used to reproduce the 
hang.

btw this should probably be a candidate for 2.6.22 and .23 stable.

thanks
-dean
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2.6.24-rc6 reproducible raid5 hang

2008-01-10 Thread dean gaudet
On Fri, 11 Jan 2008, Neil Brown wrote:

> Thanks.
> But I suspect you didn't test it with a bitmap :-)
> I ran the mdadm test suite and it hit a problem - easy enough to fix.

damn -- i "lost" my bitmap 'cause it was external and i didn't have things 
set up properly to pick it up after a reboot :)

if you send an updated patch i'll give it another spin...

-dean
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2.6.24-rc6 reproducible raid5 hang

2008-01-10 Thread dean gaudet
On Thu, 10 Jan 2008, Neil Brown wrote:

> On Wednesday January 9, [EMAIL PROTECTED] wrote:
> > On Sun, 2007-12-30 at 10:58 -0700, dean gaudet wrote:
> > > i have evidence pointing to d89d87965dcbe6fe4f96a2a7e8421b3a75f634d1
> > > 
> > > http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=d89d87965dcbe6fe4f96a2a7e8421b3a75f634d1
> > > 
> > > which was Neil's change in 2.6.22 for deferring generic_make_request 
> > > until there's enough stack space for it.
> > > 
> > 
> > Commit d89d87965dcbe6fe4f96a2a7e8421b3a75f634d1 reduced stack utilization
> > by preventing recursive calls to generic_make_request.  However the
> > following conditions can cause raid5 to hang until 'stripe_cache_size' is
> > increased:
> > 
> 
> Thanks for pursuing this guys.  That explanation certainly sounds very
> credible.
> 
> The generic_make_request_immed is a good way to confirm that we have
> found the bug,  but I don't like it as a long term solution, as it
> just reintroduced the problem that we were trying to solve with the
> problematic commit.
> 
> As you say, we could arrange that all request submission happens in
> raid5d and I think this is the right way to proceed.  However we can
> still take some of the work into the thread that is submitting the
> IO by calling "raid5d()" at the end of make_request, like this.
> 
> Can you test it please?  Does it seem reasonable?
> 
> Thanks,
> NeilBrown
> 
> 
> Signed-off-by: Neil Brown <[EMAIL PROTECTED]>

it has passed 11h of the untar/diff/rm linux.tar.gz workload... that's 
pretty good evidence it works for me.  thanks!

Tested-by: dean gaudet <[EMAIL PROTECTED]>

> 
> ### Diffstat output
>  ./drivers/md/md.c|2 +-
>  ./drivers/md/raid5.c |4 +++-
>  2 files changed, 4 insertions(+), 2 deletions(-)
> 
> diff .prev/drivers/md/md.c ./drivers/md/md.c
> --- .prev/drivers/md/md.c 2008-01-07 13:32:10.0 +1100
> +++ ./drivers/md/md.c 2008-01-10 11:08:02.0 +1100
> @@ -5774,7 +5774,7 @@ void md_check_recovery(mddev_t *mddev)
>   if (mddev->ro)
>   return;
>  
> - if (signal_pending(current)) {
> + if (current == mddev->thread->tsk && signal_pending(current)) {
>   if (mddev->pers->sync_request) {
>   printk(KERN_INFO "md: %s in immediate safe mode\n",
>  mdname(mddev));
> 
> diff .prev/drivers/md/raid5.c ./drivers/md/raid5.c
> --- .prev/drivers/md/raid5.c  2008-01-07 13:32:10.0 +1100
> +++ ./drivers/md/raid5.c  2008-01-10 11:06:54.0 +1100
> @@ -3432,6 +3432,7 @@ static int chunk_aligned_read(struct req
>   }
>  }
>  
> +static void raid5d (mddev_t *mddev);
>  
>  static int make_request(struct request_queue *q, struct bio * bi)
>  {
> @@ -3547,7 +3548,7 @@ static int make_request(struct request_q
>   goto retry;
>   }
>   finish_wait(&conf->wait_for_overlap, &w);
> - handle_stripe(sh, NULL);
> + set_bit(STRIPE_HANDLE, &sh->state);
>   release_stripe(sh);
>   } else {
>   /* cannot get stripe for read-ahead, just give-up */
> @@ -3569,6 +3570,7 @@ static int make_request(struct request_q
> test_bit(BIO_UPTODATE, &bi->bi_flags)
>   ? 0 : -EIO);
>   }
> + raid5d(mddev);
>   return 0;
>  }
>  
> -
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to [EMAIL PROTECTED]
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2.6.24-rc6 reproducible raid5 hang

2008-01-09 Thread dean gaudet
On Thu, 10 Jan 2008, Neil Brown wrote:

> On Wednesday January 9, [EMAIL PROTECTED] wrote:
> > On Sun, 2007-12-30 at 10:58 -0700, dean gaudet wrote:
> > > i have evidence pointing to d89d87965dcbe6fe4f96a2a7e8421b3a75f634d1
> > > 
> > > http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=d89d87965dcbe6fe4f96a2a7e8421b3a75f634d1
> > > 
> > > which was Neil's change in 2.6.22 for deferring generic_make_request 
> > > until there's enough stack space for it.
> > > 
> > 
> > Commit d89d87965dcbe6fe4f96a2a7e8421b3a75f634d1 reduced stack utilization
> > by preventing recursive calls to generic_make_request.  However the
> > following conditions can cause raid5 to hang until 'stripe_cache_size' is
> > increased:
> > 
> 
> Thanks for pursuing this guys.  That explanation certainly sounds very
> credible.
> 
> The generic_make_request_immed is a good way to confirm that we have
> found the bug,  but I don't like it as a long term solution, as it
> just reintroduced the problem that we were trying to solve with the
> problematic commit.
> 
> As you say, we could arrange that all request submission happens in
> raid5d and I think this is the right way to proceed.  However we can
> still take some of the work into the thread that is submitting the
> IO by calling "raid5d()" at the end of make_request, like this.
> 
> Can you test it please?  Does it seem reasonable?


i've got this running now (against 2.6.24-rc6)... it has passed ~25 
minutes of testing so far, which is a good sign.  i'll report back 
tomorrow and hopefully we'll have survived 8h+ of testing.

thanks!

w.r.t. dan's cfq comments -- i really don't know the details, but does 
this mean cfq will misattribute the IO to the wrong user/process?  or is 
it just a concern that CPU time will be spent on someone's IO?  the latter 
is fine to me... the former seems sucky because with today's multicore 
systems CPU time seems cheap compared to IO.

-dean
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Raid 1, can't get the second disk added back in.

2008-01-09 Thread dean gaudet
On Tue, 8 Jan 2008, Bill Davidsen wrote:

> Neil Brown wrote:
> > On Monday January 7, [EMAIL PROTECTED] wrote:
> >   
> > > Problem is not raid, or at least not obviously raid related.  The problem
> > > is that the whole disk, /dev/hdb is unavailable. 
> > 
> > Maybe check /sys/block/hdb/holders ?  lsof /dev/hdb ?
> > 
> > good luck :-)
> > 
> >   
> losetup -a may help, lsof doesn't seem to show files used in loop mounts. Yes,
> long shot...

and don't forget "dmsetup ls"... (followed immediately by "apt-get remove 
evms" if you're on an unfortunate version of ubuntu which helpfully 
installed that partition-stealing service for you.)

-dean
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch] improve stripe_cache_size documentation

2007-12-30 Thread dean gaudet
On Sun, 30 Dec 2007, dean gaudet wrote:

> On Sun, 30 Dec 2007, Thiemo Nagel wrote:
> 
> > >stripe_cache_size  (currently raid5 only)
> > 
> > As far as I have understood, it applies to raid6, too.
> 
> good point... and raid4.
> 
> here's an updated patch.

and once again with a typo fix.  oops.

-dean

Signed-off-by: dean gaudet <[EMAIL PROTECTED]>

Index: linux/Documentation/md.txt
===
--- linux.orig/Documentation/md.txt 2007-12-29 13:01:25.0 -0800
+++ linux/Documentation/md.txt  2007-12-30 14:30:40.0 -0800
@@ -435,8 +435,14 @@
 
 These currently include
 
-  stripe_cache_size  (currently raid5 only)
+  stripe_cache_size  (raid4, raid5 and raid6)
   number of entries in the stripe cache.  This is writable, but
   there are upper and lower limits (32768, 16).  Default is 128.
-  strip_cache_active (currently raid5 only)
+
+  The stripe cache memory is locked down and not available for other uses.
+  The total size of the stripe cache is determined by this formula:
+
+PAGE_SIZE * raid_disks * stripe_cache_size
+
+  stripe_cache_active (raid4, raid5 and raid6)
   number of active entries in the stripe cache

-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch] improve stripe_cache_size documentation

2007-12-30 Thread dean gaudet
On Sun, 30 Dec 2007, Thiemo Nagel wrote:

> >stripe_cache_size  (currently raid5 only)
> 
> As far as I have understood, it applies to raid6, too.

good point... and raid4.

here's an updated patch.

-dean

Signed-off-by: dean gaudet <[EMAIL PROTECTED]>

Index: linux/Documentation/md.txt
===
--- linux.orig/Documentation/md.txt 2007-12-29 13:01:25.0 -0800
+++ linux/Documentation/md.txt  2007-12-30 10:16:58.0 -0800
@@ -435,8 +435,14 @@
 
 These currently include
 
-  stripe_cache_size  (currently raid5 only)
+  stripe_cache_size  (raid4, raid5 and raid6)
   number of entries in the stripe cache.  This is writable, but
   there are upper and lower limits (32768, 16).  Default is 128.
-  strip_cache_active (currently raid5 only)
+
+  The stripe cache memory is locked down and not available for other uses.
+  The total size of the stripe cache is determined by this formula:
+
+PAGE_SIZE * raid_disks * stripe_cache_size
+
+  strip_cache_active (raid4, raid5 and raid6)
   number of active entries in the stripe cache
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2.6.24-rc6 reproducible raid5 hang

2007-12-30 Thread dean gaudet
On Sat, 29 Dec 2007, Dan Williams wrote:

> On Dec 29, 2007 1:58 PM, dean gaudet <[EMAIL PROTECTED]> wrote:
> > On Sat, 29 Dec 2007, Dan Williams wrote:
> >
> > > On Dec 29, 2007 9:48 AM, dean gaudet <[EMAIL PROTECTED]> wrote:
> > > > hmm bummer, i'm doing another test (rsync 3.5M inodes from another box) 
> > > > on
> > > > the same 64k chunk array and had raised the stripe_cache_size to 1024...
> > > > and got a hang.  this time i grabbed stripe_cache_active before bumping
> > > > the size again -- it was only 905 active.  as i recall the bug we were
> > > > debugging a year+ ago the active was at the size when it would hang.  so
> > > > this is probably something new.
> > >
> > > I believe I am seeing the same issue and am trying to track down
> > > whether XFS is doing something unexpected, i.e. I have not been able
> > > to reproduce the problem with EXT3.  MD tries to increase throughput
> > > by letting some stripe work build up in batches.  It looks like every
> > > time your system has hung it has been in the 'inactive_blocked' state
> > > i.e. > 3/4 of stripes active.  This state should automatically
> > > clear...
> >
> > cool, glad you can reproduce it :)
> >
> > i have a bit more data... i'm seeing the same problem on debian's
> > 2.6.22-3-amd64 kernel, so it's not new in 2.6.24.
> >
> 
> This is just brainstorming at this point, but it looks like xfs can
> submit more requests in the bi_end_io path such that it can lock
> itself out of the RAID array.  The sequence that concerns me is:
> 
> return_io->xfs_buf_end_io->xfs_buf_io_end->xfs_buf_iodone_work->xfs_buf_iorequest->make_request->
> 
> I need verify whether this path is actually triggering, but if we are
> in an inactive_blocked condition this new request will be put on a
> wait queue and we'll never get to the release_stripe() call after
> return_io().  It would be interesting to see if this is new XFS
> behavior in recent kernels.


i have evidence pointing to d89d87965dcbe6fe4f96a2a7e8421b3a75f634d1

http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=d89d87965dcbe6fe4f96a2a7e8421b3a75f634d1

which was Neil's change in 2.6.22 for deferring generic_make_request
until there's enough stack space for it.

with my git tree sync'd to that commit my test cases fail in under 20
minutes uptime (i rebooted and tested 3x).  sync'd to the commit previous
to it i've got 8h of run-time now without the problem.

this isn't definitive of course since it does seem to be timing
dependent, but since all failures have occured much earlier than that
for me so far i think this indicates this change is either the cause of
the problem or exacerbates an existing raid5 problem.

given that this problem looks like a very rare problem i saw with 2.6.18
(raid5+xfs there too) i'm thinking Neil's commit may just exacerbate an
existing problem... not that i have evidence either way.

i've attached a new kernel log with a hang at d89d87965d... and the
reduced config file i was using for the bisect.  hopefully the hang
looks the same as what we were seeing at 2.6.24-rc6.  let me know.

-dean

kern.log.d89d87965d.bz2
Description: Binary data


config-2.6.21-b1.bz2
Description: Binary data


Re: 2.6.24-rc6 reproducible raid5 hang

2007-12-29 Thread dean gaudet

On Sat, 29 Dec 2007, dean gaudet wrote:

> On Sat, 29 Dec 2007, Justin Piszcz wrote:
> 
> > Curious btw what kind of filesystem size/raid type (5, but defaults I 
> > assume,
> > nothing special right? (right-symmetric vs. left-symmetric, etc?)/cache
> > size/chunk size(s) are you using/testing with?
> 
> mdadm --create --level=5 --chunk=64 -n7 -x1 /dev/md2 /dev/sd[a-h]1
> mkfs.xfs -f /dev/md2
> 
> otherwise defaults

hmm i missed a few things, here's exactly how i created the array:

mdadm --create --level=5 --chunk=64 -n7 -x1 --assume-clean /dev/md2 
/dev/sd[a-h]1

it's reassembled automagically each reboot, but i do this each reboot:

mkfs.xfs -f /dev/md2
mount -o noatime /dev/md2 /mnt/new
./dma_thrasher linux.tar.gz /mnt/new

the --assume-clean and noatime probably make no difference though...

on the bisection front it looks like it's new behaviour between 2.6.21.7 
and 2.6.22.15 (stock kernels now, not debian).

i've got to step out for a while, but i'll go at it again later, probably 
with git bisect unless someone has some cherry picked changes to suggest.

-dean
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2.6.24-rc6 reproducible raid5 hang

2007-12-29 Thread dean gaudet
On Sat, 29 Dec 2007, Justin Piszcz wrote:

> Curious btw what kind of filesystem size/raid type (5, but defaults I assume,
> nothing special right? (right-symmetric vs. left-symmetric, etc?)/cache
> size/chunk size(s) are you using/testing with?

mdadm --create --level=5 --chunk=64 -n7 -x1 /dev/md2 /dev/sd[a-h]1
mkfs.xfs -f /dev/md2

otherwise defaults

> The script you sent out earlier, you are able to reproduce it easily with 31
> or so kernel tar decompressions?

not sure, the point of the script is to untar more than there is RAM.  it 
happened with a single rsync running though -- 3.5M indoes from a remote 
box.  it also happens with the single 10GB dd write... although i've been 
using the tar method for testing different kernel revs.

-dean
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[patch] improve stripe_cache_size documentation

2007-12-29 Thread dean gaudet
Document the amount of memory used by the stripe cache and the fact that 
it's tied down and unavailable for other purposes (right?).  thanks to Dan 
Williams for the formula.

-dean

Signed-off-by: dean gaudet <[EMAIL PROTECTED]>

Index: linux/Documentation/md.txt
===
--- linux.orig/Documentation/md.txt 2007-12-29 13:01:25.0 -0800
+++ linux/Documentation/md.txt  2007-12-29 13:04:17.0 -0800
@@ -438,5 +438,11 @@
   stripe_cache_size  (currently raid5 only)
   number of entries in the stripe cache.  This is writable, but
   there are upper and lower limits (32768, 16).  Default is 128.
+
+  The stripe cache memory is locked down and not available for other uses.
+  The total size of the stripe cache is determined by this formula:
+
+PAGE_SIZE * raid_disks * stripe_cache_size
+
   strip_cache_active (currently raid5 only)
   number of active entries in the stripe cache
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2.6.24-rc6 reproducible raid5 hang

2007-12-29 Thread dean gaudet
On Sat, 29 Dec 2007, Dan Williams wrote:

> On Dec 29, 2007 9:48 AM, dean gaudet <[EMAIL PROTECTED]> wrote:
> > hmm bummer, i'm doing another test (rsync 3.5M inodes from another box) on
> > the same 64k chunk array and had raised the stripe_cache_size to 1024...
> > and got a hang.  this time i grabbed stripe_cache_active before bumping
> > the size again -- it was only 905 active.  as i recall the bug we were
> > debugging a year+ ago the active was at the size when it would hang.  so
> > this is probably something new.
> 
> I believe I am seeing the same issue and am trying to track down
> whether XFS is doing something unexpected, i.e. I have not been able
> to reproduce the problem with EXT3.  MD tries to increase throughput
> by letting some stripe work build up in batches.  It looks like every
> time your system has hung it has been in the 'inactive_blocked' state
> i.e. > 3/4 of stripes active.  This state should automatically
> clear...

cool, glad you can reproduce it :)

i have a bit more data... i'm seeing the same problem on debian's 
2.6.22-3-amd64 kernel, so it's not new in 2.6.24.

i'm doing some more isolation but just grabbing kernels i have precompiled 
so far -- a 2.6.19.7 kernel doesn't show the problem, and early 
indications are a 2.6.21.7 kernel also doesn't have the problem but i'm 
giving it longer to show its head.

i'll try a stock 2.6.22 next depending on how the 2.6.21 test goes, just 
so we get the debian patches out of the way.

i was tempted to blame async api because it's newish :)  but according to 
the dmesg output it doesn't appear the 2.6.22-3-amd64 kernel used async 
API, and it still hung, so async is probably not to blame.

anyhow the test case i'm using is the dma_thrasher script i attached... it 
takes about an hour to give me confidence there's no problems so this will 
take a while.

-dean
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Linux RAID Partition Offset 63 cylinders / 30% performance hit?

2007-12-29 Thread dean gaudet
On Tue, 25 Dec 2007, Bill Davidsen wrote:

> The issue I'm thinking about is hardware sector size, which on modern drives
> may be larger than 512b and therefore entail a read-alter-rewrite (RAR) cycle
> when writing a 512b block.

i'm not sure any shipping SATA disks have larger than 512B sectors yet... 
do you know of any?  (or is this thread about SCSI which i don't pay 
attention to...)

on a brand new WDC WD7500AAKS-00RBA0 with this partition layout:

255 heads, 63 sectors/track, 91201 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes

so sda1 starts at a non-multiple of 4096 into the disk.

i ran some random seek+write experiments using
, here are the results using 512 byte
and 4096 byte writes (fsync after each write), 8 threads, on sda1:

# ./randomio /dev/sda1 8 1 1 512 10 6
  total |  read: latency (ms)   |  write:latency (ms)
   iops |   iops   minavgmax   sdev |   iops   minavgmax   sdev
+---+--
  148.5 |0.0   infnan0.0nan |  148.5   0.2   53.7   89.3   19.5
  129.2 |0.0   infnan0.0nan |  129.2  37.2   61.9   96.79.3
  131.2 |0.0   infnan0.0nan |  131.2  40.3   61.0   90.49.3
  132.0 |0.0   infnan0.0nan |  132.0  39.6   60.6   89.39.1
  130.7 |0.0   infnan0.0nan |  130.7  39.8   61.3   98.18.9
  131.4 |0.0   infnan0.0nan |  131.4  40.0   60.8  101.09.6
# ./randomio /dev/sda1 8 1 1 4096 10 6
  total |  read: latency (ms)   |  write:latency (ms)
   iops |   iops   minavgmax   sdev |   iops   minavgmax   sdev
+---+--
  141.7 |0.0   infnan0.0nan |  141.7   0.3   56.3   99.3   21.1
  132.4 |0.0   infnan0.0nan |  132.4  43.3   60.4   91.88.5
  131.6 |0.0   infnan0.0nan |  131.6  41.4   60.9  111.09.6
  131.8 |0.0   infnan0.0nan |  131.8  41.4   60.7   85.38.6
  130.6 |0.0   infnan0.0nan |  130.6  41.7   61.3   95.09.4
  131.4 |0.0   infnan0.0nan |  131.4  42.2   60.8   90.58.4


i think the anomalous results in the first 10s samples are perhaps the drive
coming out of a standby state.

and here are the results aligned using the sda raw device itself:

# ./randomio /dev/sda 8 1 1 512 10 6
  total |  read: latency (ms)   |  write:latency (ms)
   iops |   iops   minavgmax   sdev |   iops   minavgmax   sdev
+---+--
  147.3 |0.0   infnan0.0nan |  147.3   0.3   54.1   93.7   20.1
  132.4 |0.0   infnan0.0nan |  132.4  37.4   60.6   91.89.2
  132.5 |0.0   infnan0.0nan |  132.5  37.7   60.3   93.79.3
  131.8 |0.0   infnan0.0nan |  131.8  39.4   60.7   92.79.0
  133.9 |0.0   infnan0.0nan |  133.9  41.7   59.8   90.78.5
  130.2 |0.0   infnan0.0nan |  130.2  40.8   61.5   88.68.9
# ./randomio /dev/sda 8 1 1 4096 10 6
  total |  read: latency (ms)   |  write:latency (ms)
   iops |   iops   minavgmax   sdev |   iops   minavgmax   sdev
+---+--
  145.4 |0.0   infnan0.0nan |  145.4   0.3   54.9   94.0   20.1
  130.3 |0.0   infnan0.0nan |  130.3  36.0   61.4   92.79.6
  130.6 |0.0   infnan0.0nan |  130.6  38.2   61.2   96.79.2
  132.1 |0.0   infnan0.0nan |  132.1  39.0   60.5   93.59.2
  131.8 |0.0   infnan0.0nan |  131.8  43.1   60.8   93.89.1
  129.0 |0.0   infnan0.0nan |  129.0  40.2   62.0   96.48.8

it looks pretty much the same to me...

-dean
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2.6.24-rc6 reproducible raid5 hang

2007-12-29 Thread dean gaudet
hmm bummer, i'm doing another test (rsync 3.5M inodes from another box) on 
the same 64k chunk array and had raised the stripe_cache_size to 1024... 
and got a hang.  this time i grabbed stripe_cache_active before bumping 
the size again -- it was only 905 active.  as i recall the bug we were 
debugging a year+ ago the active was at the size when it would hang.  so 
this is probably something new.

anyhow raising it to 2048 got it unstuck, but i'm guessing i'll be able to 
hit that limit too if i try harder :)

btw what units are stripe_cache_size/active in?  is the memory consumed 
equal to (chunk_size * raid_disks * stripe_cache_size) or (chunk_size * 
raid_disks * stripe_cache_active)?

-dean

On Thu, 27 Dec 2007, dean gaudet wrote:

> hmm this seems more serious... i just ran into it with chunksize 64KiB and 
> while just untarring a bunch of linux kernels in parallel... increasing 
> stripe_cache_size did the trick again.
> 
> -dean
> 
> On Thu, 27 Dec 2007, dean gaudet wrote:
> 
> > hey neil -- remember that raid5 hang which me and only one or two others 
> > ever experienced and which was hard to reproduce?  we were debugging it 
> > well over a year ago (that box has 400+ day uptime now so at least that 
> > long ago :)  the workaround was to increase stripe_cache_size... i seem to 
> > have a way to reproduce something which looks much the same.
> > 
> > setup:
> > 
> > - 2.6.24-rc6
> > - system has 8GiB RAM but no swap
> > - 8x750GB in a raid5 with one spare, chunksize 1024KiB.
> > - mkfs.xfs default options
> > - mount -o noatime
> > - dd if=/dev/zero of=/mnt/foo bs=4k count=2621440
> > 
> > that sequence hangs for me within 10 seconds... and i can unhang / rehang 
> > it by toggling between stripe_cache_size 256 and 1024.  i detect the hang 
> > by watching "iostat -kx /dev/sd? 5".
> > 
> > i've attached the kernel log where i dumped task and timer state while it 
> > was hung... note that you'll see at some point i did an xfs mount with 
> > external journal but it happens with internal journal as well.
> > 
> > looks like it's using the raid456 module and async api.
> > 
> > anyhow let me know if you need more info / have any suggestions.
> > 
> > -dean
> -
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to [EMAIL PROTECTED]
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2.6.24-rc6 reproducible raid5 hang

2007-12-27 Thread dean gaudet
On Thu, 27 Dec 2007, Justin Piszcz wrote:

> With that high of a stripe size the stripe_cache_size needs to be greater than
> the default to handle it.

i'd argue that any deadlock is a bug...

regardless i'm still seeing deadlocks with the default chunk_size of 64k 
and stripe_cache_size of 256... in this case it's with a workload which is 
untarring 34 copies of the linux kernel at the same time.  it's a variant 
of doug ledford's memtest, and i've attached it.

-dean#!/usr/bin/perl

# Copyright (c) 2007 dean gaudet <[EMAIL PROTECTED]>
#
# Permission is hereby granted, free of charge, to any person obtaining a
# copy of this software and associated documentation files (the "Software"),
# to deal in the Software without restriction, including without limitation
# the rights to use, copy, modify, merge, publish, distribute, sublicense,
# and/or sell copies of the Software, and to permit persons to whom the
# Software is furnished to do so, subject to the following conditions:
#
# The above copyright notice and this permission notice shall be included
# in all copies or substantial portions of the Software.
#
# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
# THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR
# OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE,
# ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR
# OTHER DEALINGS IN THE SOFTWARE.

# this idea shamelessly stolen from doug ledford

use warnings;
use strict;

# ensure stdout is not buffered
select(STDOUT); $| = 1;

my $usage = "usage: $0 linux.tar.gz /path1 [/path2 ...]\n";
defined(my $tarball = shift) or die $usage;
-f $tarball or die "$tarball does not exist or is not a file\n";

my @paths = @ARGV;
$#paths >= 0 or die "$usage";

# determine size of uncompressed tarball
open(GZIP, "-|") || exec "gzip", "--quiet", "--list", $tarball;
my $line = ;
my ($tarball_size) = $line =~ m#^\s*\d+\s*(\d+)#;
defined($tarball_size) or die "unexpected result from gzip --quiet --list 
$tarball\n";
close(GZIP);

# determine amount of memory
open(MEMINFO, ") {
  if (/^MemTotal:\s*(\d+)\s*kB/) {
$total_mem = $1;
last;
  }
}
defined($total_mem) or die "did not find MemTotal line in /proc/meminfo\n";
close(MEMINFO);
$total_mem *= 1024;

print "total memory: $total_mem\n";
print "uncompressed tarball: $tarball_size\n";
my $nr_simultaneous = int(1.2 * $total_mem / $tarball_size);
print "nr simultaneous processes: $nr_simultaneous\n";

sub system_or_die {
  my @args = @_;
  system(@args);
  if ($? == 1) {
my $msg = sprintf("%s failed to exec %s: $!\n", scalar(localtime), 
$args[0]);
  }
  elsif ($? & 127) {
my $msg = sprintf("%s %s died with signal %d, %s coredump\n",
scalar(localtime), $args[0], ($? & 127), ($? & 128) ? "with" : 
"without");
die $msg;
  }
  elsif (($? >> 8) != 0) {
my $msg = sprintf("%s %s exited with non-zero exit code %d\n",
scalar(localtime), $args[0], $? >> 8);
die $msg;
  }
}

sub untar($) {
  mkdir($_[0]) or die localtime()." unable to mkdir($_[0]): $!\n";
  system_or_die("tar", "-xzf", $tarball, "-C", $_[0]);
}

print localtime()." untarring golden copy\n";
my $golden = $paths[0]."/dma_tmp.$$.gold";
untar($golden);

my $pass_no = 0;
while (1) {
  print localtime()." pass $pass_no: extracting\n";
  my @outputs;
  foreach my $n (1..$nr_simultaneous) {
# treat paths in a round-robin manner
my $dir = shift(@paths);
push(@paths, $dir);

$dir .= "/dma_tmp.$$.$n";
push(@outputs, $dir);

my $pid = fork;
defined($pid) or die localtime()." unable to fork: $!\n";
if ($pid == 0) {
  untar($dir);
  exit(0);
}
  }

  # wait for the children
  while (wait != -1) {}

  print localtime()." pass $pass_no: diffing\n";
  foreach my $dir (@outputs) {
my $pid = fork;
defined($pid) or die localtime()." unable to fork: $!\n";
if ($pid == 0) {
  system_or_die("diff", "-U", "3", "-rN", $golden, $dir);
  system_or_die("rm", "-fr", $dir);
  exit(0);
}
  }

  # wait for the children
  while (wait != -1) {}

  ++$pass_no;
}


Re: 2.6.24-rc6 reproducible raid5 hang

2007-12-27 Thread dean gaudet
hmm this seems more serious... i just ran into it with chunksize 64KiB and 
while just untarring a bunch of linux kernels in parallel... increasing 
stripe_cache_size did the trick again.

-dean

On Thu, 27 Dec 2007, dean gaudet wrote:

> hey neil -- remember that raid5 hang which me and only one or two others 
> ever experienced and which was hard to reproduce?  we were debugging it 
> well over a year ago (that box has 400+ day uptime now so at least that 
> long ago :)  the workaround was to increase stripe_cache_size... i seem to 
> have a way to reproduce something which looks much the same.
> 
> setup:
> 
> - 2.6.24-rc6
> - system has 8GiB RAM but no swap
> - 8x750GB in a raid5 with one spare, chunksize 1024KiB.
> - mkfs.xfs default options
> - mount -o noatime
> - dd if=/dev/zero of=/mnt/foo bs=4k count=2621440
> 
> that sequence hangs for me within 10 seconds... and i can unhang / rehang 
> it by toggling between stripe_cache_size 256 and 1024.  i detect the hang 
> by watching "iostat -kx /dev/sd? 5".
> 
> i've attached the kernel log where i dumped task and timer state while it 
> was hung... note that you'll see at some point i did an xfs mount with 
> external journal but it happens with internal journal as well.
> 
> looks like it's using the raid456 module and async api.
> 
> anyhow let me know if you need more info / have any suggestions.
> 
> -dean
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: external bitmaps.. and more

2007-12-11 Thread dean gaudet
On Thu, 6 Dec 2007, Michael Tokarev wrote:

> I come across a situation where external MD bitmaps
> aren't usable on any "standard" linux distribution
> unless special (non-trivial) actions are taken.
> 
> First is a small buglet in mdadm, or two.
> 
> It's not possible to specify --bitmap= in assemble
> command line - the option seems to be ignored.  But
> it's honored when specified in config file.

i think neil fixed this at some point -- i ran into it / reported 
essentially the same problems here a while ago.


> The thing is that when a external bitmap is being used
> for an array, and that bitmap resides on another filesystem,
> all common distributions fails to start/mount and to
> shutdown/umount arrays/filesystems properly, because
> all starts/stops is done in one script, and all mounts/umounts
> in another, but for bitmaps to work the two should be intermixed
> with each other.

so i've got a debian unstable box which has uptime 402 days (to give you 
an idea how long ago i last tested the reboot sequence).  it has raid1 
root and raid5 /home.  /home has an external bitmap on the root partition.

i have /etc/default/mdadm set with INITRDSTART to start only the root 
raid1 during initrd... this manages to work out later when the external 
bitmap is required.

but it is fragile... and i think it's only possible to get things to work 
with an initrd and the external bitmap on the root fs or by having custom 
initrd and/or rc.d scripts.

-dean
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Raid array is not automatically detected.

2007-07-17 Thread dean gaudet


On Mon, 16 Jul 2007, David Greaves wrote:

> Bryan Christ wrote:
> > I do have the type set to 0xfd.  Others have said that auto-assemble only
> > works on RAID 0 and 1, but just as Justin mentioned, I too have another box
> > with RAID5 that gets auto assembled by the kernel (also no initrd).  I
> > expected the same behavior when I built this array--again using mdadm
> > instead of raidtools.
> 
> Any md arrays with partition type 0xfd using a 0.9 superblock should be
> auto-assembled by a standard kernel.

no... debian (and probably ubuntu) do not build md into the kernel, they 
build it as a module, and the module does not auto-detect 0xfd.  i don't 
know anything about slackware, but i just felt it worth commenting that "a 
standard kernel" is not really descriptive enough.

-dean
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: limits on raid

2007-06-17 Thread dean gaudet
On Sun, 17 Jun 2007, Wakko Warner wrote:

> What benefit would I gain by using an external journel and how big would it
> need to be?

i don't know how big the journal needs to be... i'm limited by xfs'
maximum journal size of 128MiB.

i don't have much benchmark data -- but here are some rough notes i took
when i was evaluating a umem NVRAM card.  since the pata disks in the
raid1 have write caching enabled it's somewhat of an unfair comparison,
but the important info is the 88 seconds for internal journal vs. 81
seconds for external journal.

-dean

time sh -c 'tar xf /var/tmp/linux-2.6.20.tar; sync'

xfs journal raid5 bitmaptimes
internalnone0.18s user 2.14s system 2% cpu 1:27.95 total
internalinternal0.16s user 2.16s system 1% cpu 2:01.12 total
raid1   none0.07s user 2.02s system 2% cpu 1:20.62 total
raid1   internal0.14s user 2.01s system 1% cpu 1:55.18 total
raid1   raid1   0.14s user 2.03s system 2% cpu 1:20.61 total
umemnone0.13s user 2.07s system 2% cpu 1:20.77 total
umeminternal0.15s user 2.16s system 2% cpu 1:51.28 total
umemumem0.12s user 2.13s system 2% cpu 1:20.50 total


raid5:
- 4x seagate 7200.10 400GB on marvell MV88SX6081
- mdadm --create --level=5 --raid-devices=4 /dev/md4 /dev/sd[abcd]1

raid1:
- 2x maxtor 6Y200P0 on 3ware 7504
- two 128MiB partitions starting at cyl 1
- mdadm --create --level=1 --raid-disks=2 --auto=yes --assume-clean /dev/md1 
/dev/sd[fg]1
- mdadm --create --level=1 --raid-disks=2 --auto=yes --assume-clean /dev/md2 
/dev/sd[fg]2
- md1 is used for external xfs journal
- md2 has an ext3 filesystem for the external md4 bitmap

xfs:
- mkfs.xfs issued before each run using the defaults (aside from -l 
logdev=/dev/md1)
- mount -o noatime,nodiratime[,logdev=/dev/md1] 

umem:
- 512MiB Micro Memory MM-5415CN
- 2 partitions similar to the raid1 setup
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: limits on raid

2007-06-17 Thread dean gaudet
On Sun, 17 Jun 2007, Wakko Warner wrote:

> dean gaudet wrote:
> > On Sat, 16 Jun 2007, Wakko Warner wrote:
> > 
> > > When I've had an unclean shutdown on one of my systems (10x 50gb raid5) 
> > > it's
> > > always slowed the system down when booting up.  Quite significantly I must
> > > say.  I wait until I can login and change the rebuild max speed to slow it
> > > down while I'm using it.   But that is another thing.
> > 
> > i use an external write-intent bitmap on a raid1 to avoid this... you 
> > could use internal bitmap but that slows down i/o too much for my tastes.  
> > i also use an external xfs journal for the same reason.  2 disk raid1 for 
> > root/journal/bitmap, N disk raid5 for bulk storage.  no spindles in 
> > common.
> 
> I must remember this if I have to rebuild the array.  Although I'm
> considering moving to a hardware raid solution when I upgrade my storage.

you can do it without a rebuild -- that's in fact how i did it the first 
time.

to add an external bitmap:

mdadm --grow --bitmap /bitmapfile /dev/mdX

plus add "bitmap=/bitmapfile" to mdadm.conf... as in:

ARRAY /dev/md4 bitmap=/bitmap.md4 UUID=dbc3be0b:b5853930:a02e038c:13ba8cdc

you can also easily move an ext3 journal to an external journal with 
tune2fs (see man page).

if you use XFS it's a bit more of a challenge to convert from internal to 
external, but see this thread:

http://marc.theaimsgroup.com/?l=linux-xfs&m=106929781232520&w=2

i found that i had to do "sb 1", "sb 2", ..., "sb N" for all sb rather 
than just the "sb 0" that email instructed me to do.

-dean
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: limits on raid

2007-06-16 Thread dean gaudet
On Sat, 16 Jun 2007, Wakko Warner wrote:

> When I've had an unclean shutdown on one of my systems (10x 50gb raid5) it's
> always slowed the system down when booting up.  Quite significantly I must
> say.  I wait until I can login and change the rebuild max speed to slow it
> down while I'm using it.   But that is another thing.

i use an external write-intent bitmap on a raid1 to avoid this... you 
could use internal bitmap but that slows down i/o too much for my tastes.  
i also use an external xfs journal for the same reason.  2 disk raid1 for 
root/journal/bitmap, N disk raid5 for bulk storage.  no spindles in 
common.

-dean
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: limits on raid

2007-06-16 Thread dean gaudet
On Sat, 16 Jun 2007, David Greaves wrote:

> Neil Brown wrote:
> > On Friday June 15, [EMAIL PROTECTED] wrote:
> >  
> > >   As I understand the way
> > > raid works, when you write a block to the array, it will have to read all
> > > the other blocks in the stripe and recalculate the parity and write it
> > > out.
> > 
> > Your understanding is incomplete.
> 
> Does this help?
> [for future reference so you can paste a url and save the typing for code :) ]
> 
> http://linux-raid.osdl.org/index.php/Initial_Array_Creation

i fixed a typo and added one more note which i think is quite fair:

It is also safe to use --assume-clean if you are performing
performance measurements of different raid configurations. Just
be sure to rebuild your array without --assume-clean when you
decide on your final configuration.

-dean
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: XFS sunit/swidth for raid10

2007-03-25 Thread dean gaudet
On Fri, 23 Mar 2007, Peter Rabbitson wrote:

> dean gaudet wrote:
> > On Thu, 22 Mar 2007, Peter Rabbitson wrote:
> > 
> > > dean gaudet wrote:
> > > > On Thu, 22 Mar 2007, Peter Rabbitson wrote:
> > > > 
> > > > > Hi,
> > > > > How does one determine the XFS sunit and swidth sizes for a software
> > > > > raid10
> > > > > with 3 copies?
> > > > mkfs.xfs uses the GET_ARRAY_INFO ioctl to get the data it needs from
> > > > software raid and select an appropriate sunit/swidth...
> > > > 
> > > > although i'm not sure i agree entirely with its choice for raid10:
> > > So do I, especially as it makes no checks for the amount of copies (3 in
> > > my
> > > case, not 2).
> > > 
> > > > it probably doesn't matter.
> > > This was essentially my question. For an array -pf3 -c1024 I get swidth =
> > > 4 *
> > > sunit = 4MiB. Is it about right and does it matter at all?
> > 
> > how many drives?
> > 
> 
> Sorry. 4 drives, 3 far copies (so any 2 drives can fail), 1M chunk.

my mind continues to be blown by linux raid10.

so that's like raid1 on 4 disks except the copies are offset by 1/4th of 
the disk?

i think swidth = 4*sunit is the right config then -- 'cause a read of 4MiB 
will stride all 4 disks...

-dean
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: XFS sunit/swidth for raid10

2007-03-22 Thread dean gaudet
On Thu, 22 Mar 2007, Peter Rabbitson wrote:

> dean gaudet wrote:
> > On Thu, 22 Mar 2007, Peter Rabbitson wrote:
> > 
> > > Hi,
> > > How does one determine the XFS sunit and swidth sizes for a software
> > > raid10
> > > with 3 copies?
> > 
> > mkfs.xfs uses the GET_ARRAY_INFO ioctl to get the data it needs from
> > software raid and select an appropriate sunit/swidth...
> > 
> > although i'm not sure i agree entirely with its choice for raid10:
> 
> So do I, especially as it makes no checks for the amount of copies (3 in my
> case, not 2).
> 
> > it probably doesn't matter.
> 
> This was essentially my question. For an array -pf3 -c1024 I get swidth = 4 *
> sunit = 4MiB. Is it about right and does it matter at all?

how many drives?

-dean
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: XFS sunit/swidth for raid10

2007-03-22 Thread dean gaudet
On Thu, 22 Mar 2007, Peter Rabbitson wrote:

> Hi,
> How does one determine the XFS sunit and swidth sizes for a software raid10
> with 3 copies?

mkfs.xfs uses the GET_ARRAY_INFO ioctl to get the data it needs from 
software raid and select an appropriate sunit/swidth...

although i'm not sure i agree entirely with its choice for raid10:

*sunit = md.chunk_size >> 9;
*swidth = *sunit * md.raid_disks;

i'd think it would depend on the layout of the raid10 (near, far, 
offset)... for near2 on 4 disks i'd expect swidth to be only 2*sunit... 
but for far2 on 4 disks i'd expect 4*sunit... but i'm not sure.  it 
probably doesn't matter.

-dean
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: mdadm: raid1 with ext3 - filesystem size differs?

2007-03-20 Thread dean gaudet
it looks like you created the filesystem on the component device before 
creating the raid.

-dean

On Fri, 16 Mar 2007, Hanno Meyer-Thurow wrote:

> Hi all!
> Please CC me on answers since I am not subscribed to this list, thanks.
> 
> When I try to build a raid1 system with mdadm 2.6.1 the filesystem size
> recorded in superblock differs from physical size of device.
> 
> System:
> ana ~ # uname -a
> Linux ana 2.6.20-gentoo-r2 #4 SMP PREEMPT Sat Mar 10 16:25:46 CET 2007 x86_64 
> Intel(R) Core(TM)2 CPU  6600  @ 2.40GHz GenuineIntel GNU/Linux
> 
> ana ~ # mdadm --create /dev/md1 --level=1 --raid-devices=2 /dev/sda1 /dev/sdb1
> mdadm: /dev/sda1 appears to contain an ext2fs file system
> size=48152K  mtime=Thu Mar 15 17:27:07 2007
> mdadm: /dev/sda1 appears to be part of a raid array:
> level=raid1 devices=2 ctime=Thu Mar 15 17:25:52 2007
> mdadm: /dev/sdb1 appears to contain an ext2fs file system
> size=48152K  mtime=Thu Mar 15 17:27:07 2007
> mdadm: /dev/sdb1 appears to be part of a raid array:
> level=raid1 devices=2 ctime=Thu Mar 15 17:25:52 2007
> Continue creating array? y
> 
> mdadm: array /dev/md1 started.
> ana ~ # cat /proc/mdstat
> md1 : active raid1 sdb1[1] sda1[0]
>   48064 blocks [2/2] [UU]
> 
> ana ~ # mdadm --misc --detail /dev/md1
> /dev/md1:
> Version : 00.90.03
>   Creation Time : Thu Mar 15 17:37:35 2007
>  Raid Level : raid1
>  Array Size : 48064 (46.95 MiB 49.22 MB)
>   Used Dev Size : 48064 (46.95 MiB 49.22 MB)
>Raid Devices : 2
>   Total Devices : 2
> Preferred Minor : 1
> Persistence : Superblock is persistent
> 
> Update Time : Thu Mar 15 17:38:27 2007
>   State : clean
>  Active Devices : 2
> Working Devices : 2
>  Failed Devices : 0
>   Spare Devices : 0
> 
>UUID : cf0478ee:7e60a40e:20a5e204:cc7bc2c9
>  Events : 0.4
> 
> Number   Major   Minor   RaidDevice State
>0   810  active sync   /dev/sda1
>1   8   171  active sync   /dev/sdb1
> 
> ana ~ # LC_ALL=C fsck.ext3 /dev/md1
> e2fsck 1.39 (29-May-2006)
> The filesystem size (according to the superblock) is 48152 blocks
> The physical size of the device is 48064 blocks
> Either the superblock or the partition table is likely to be corrupt!
> Abort? yes
> 
> 
> 
> Any ideas what could be wrong? Thank you in advance for help!
> 
> 
> Regards,
> Hanno
> -
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to [EMAIL PROTECTED]
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Replace drive in RAID5 without losing redundancy?

2007-03-05 Thread dean gaudet


On Tue, 6 Mar 2007, Neil Brown wrote:

> On Monday March 5, [EMAIL PROTECTED] wrote:
> > 
> > Is it possible to mark a disk as "to be replaced by an existing spare",
> > then migrate to the spare disk and kick the old disk _after_ migration
> > has been done? Or not even kick - but mark as new spare.
> 
> No, this is not possible yet.
> You can get nearly all the way there by:
> 
>   - add an internal bitmap.
>   - fail one drive
>   - --build a raid1 with that drive (and the other missing)
>   - re-add the raid1 into the raid5
>   - add the new drive to the raid1
>   - wait for resync

i have an example at 
... plus 
discussion as to why this isn't the best solution.

-dean
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Linux Software RAID Bitmap Question

2007-02-28 Thread dean gaudet
On Mon, 26 Feb 2007, Neil Brown wrote:

> On Sunday February 25, [EMAIL PROTECTED] wrote:
> > I believe Neil stated that using bitmaps does incur a 10% performance 
> > penalty.  If one's box never (or rarely) crashes, is a bitmap needed?
> 
> I think I said it "can" incur such a penalty.  The actual cost is very
> dependant on work-load.

i did a crude benchmark recently... to get some data for a common setup
i use (external journals and bitmaps on raid1, xfs fs on raid5).

emphasis on "crude":

time sh -c 'tar xf /var/tmp/linux-2.6.20.tar; sync'

xfs journal raid5 bitmaptimes
internalnone0.18s user 2.14s system 2% cpu 1:27.95 total
internalinternal0.16s user 2.16s system 1% cpu 2:01.12 total
raid1   none0.07s user 2.02s system 2% cpu 1:20.62 total
raid1   internal0.14s user 2.01s system 1% cpu 1:55.18 total
raid1   raid1   0.14s user 2.03s system 2% cpu 1:20.61 total


raid5:
- 4x seagate 7200.10 400GB on marvell MV88SX6081
- mdadm --create --level=5 --raid-devices=4 /dev/md4 /dev/sd[abcd]1

raid1:
- 2x maxtor 6Y200P0 on 3ware 7504
- two 128MiB partitions starting at cyl 1
- mdadm --create --level=1 --raid-disks=2 --auto=yes --assume-clean /dev/md1 
/dev/sd[fg]1
- mdadm --create --level=1 --raid-disks=2 --auto=yes --assume-clean /dev/md2 
/dev/sd[fg]2
- md1 is used for external xfs journal
- md2 has an ext3 filesystem for the external md4 bitmap

xfs:
- mkfs.xfs issued before each run using the defaults (aside from -l 
logdev=/dev/md1)
- mount -o noatime,nodiratime[,logdev=/dev/md1] 

system:
- dual opteron 848 (2.2ghz), 8GiB ddr 266
- tyan s2882
- 2.6.20

-dean
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Reshaping raid0/10

2007-02-21 Thread dean gaudet
On Thu, 22 Feb 2007, Neil Brown wrote:

> On Wednesday February 21, [EMAIL PROTECTED] wrote:
> > Hello,
> > 
> > 
> > 
> > are there any plans to support reshaping
> > on raid0 and raid10?
> > 
> 
> No concrete plans.  It largely depends on time and motivation.
> I expect that the various flavours of raid5/raid6 reshape will come
> first.
> Then probably converting raid0->raid5.
> 
> I really haven't given any thought to how you might reshape a
> raid10...

i've got a 4x250 near2 i want to turn into a 4x750 near2.  i was 
considering doing straight dd from each of the 250 to the respective 750 
then doing an mdadm --create on the 750s (in the same ordering as the 
original array)... so i'd end up with a new array with more stripes.  it 
seems like this should work.

the same thing should work for all nearN with a multiple of N disks... and 
offsetN should work as well right?  but farN sounds like a nightmare.

if we had a generic "proactive disk replacement" method it could handle 
the 4x250->4x750 step.  (i haven't decided yet if i want to try my hacky 
bitmap method of doing proactive replacement... i'm not sure what'll 
happen if i add a 750GB disk back into an array with 250s... i suppose 
it'll work... i'll have to experiment.)

-dean
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: md autodetect only detects one disk in raid1

2007-01-27 Thread dean gaudet
take a look at your mdadm.conf ... both on your root fs and in your 
initrd... look for a DEVICES line and make sure it says "DEVICES 
partitions"... anything else is likely to cause problems like below.

also make sure each array is specified by UUID rather than device.

and then rebuild your initrd.  (dpkg-reconfigure linux-image-`uname -r` on 
debuntu).

that "something else in the system claim use of the device" problem makes 
me guess you're on ubuntu pre-edgy... where for whatever reason they 
included evms in the default install and for whatever inane reason evms 
steals every damn device in the system when it starts up.  
uninstall/deactivate evms if you're not using it.

-dean

On Sat, 27 Jan 2007, kenneth johansson wrote:

> I run raid1 on my root partition /dev/md0. Now I had a bad disk so I had
> to replace it but did not notice until I got home that I got a SATA
> instead of a PATA. Since I had a free sata interface I just put in in
> that. I had no problem adding the disk to the raid1 device that is until
> I rebooted the computer. 
> 
> both the PATA disk and the SATA disk are detected before md start up the
> raid but only the PATA disk is activated. So the raid device is always
> booting in degraded mode. since this is the root disk I use the
> autodetect feature with partition type fd.
> 
> Also Something else in the system claim use of the device since I can
> not add the SATA disk after the system has done a complete boot. I guess
> it has something to do with device mapper and LVM that I also run on the
> data disks but I'm not sure. any tip on what it can be??
> 
> If I add the SATA disk to md0 early enough in the boot it works but why
> is it not autodetected ?
> 
> 
> 
> -
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to [EMAIL PROTECTED]
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: bad performance on RAID 5

2007-01-18 Thread dean gaudet
On Wed, 17 Jan 2007, Sevrin Robstad wrote:

> I'm suffering from bad performance on my RAID5.
> 
> a "echo check >/sys/block/md0/md/sync_action"
> 
> gives a speed at only about 5000K/sec , and HIGH load average :
> 
> # uptime
> 20:03:55 up 8 days, 19:55,  1 user,  load average: 11.70, 4.04, 1.52

iostat -kx /dev/sd? 10  ... and sum up the total IO... 

also try increasing sync_speed_min/max

and a loadavg jump like that suggests to me you have other things 
competing for the disk at the same time as the "check".

-dean
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: raid5 software vs hardware: parity calculations?

2007-01-15 Thread dean gaudet
On Mon, 15 Jan 2007, Mr. James W. Laferriere wrote:

>   Hello Dean ,
> 
> On Mon, 15 Jan 2007, dean gaudet wrote:
> ...snip...
> > it should just be:
> > 
> > echo check >/sys/block/mdX/md/sync_action
> > 
> > if you don't have a /sys/block/mdX/md/sync_action file then your kernel is
> > too old... or you don't have /sys mounted... (or you didn't replace X with
> > the raid number :)
> > 
> > iirc there were kernel versions which had the sync_action file but didn't
> > yet support the "check" action (i think possibly even as recent as 2.6.17
> > had a small bug initiating one of the sync_actions but i forget which
> > one).  if you can upgrade to 2.6.18.x it should work.
> > 
> > debian unstable (and i presume etch) will do this for all your arrays
> > automatically once a month.
> > 
> > -dean
> 
>   Being able to run a 'check' is a good thing (tm) .  But without a
> method to acquire statii & data back from the check ,  Seems rather bland .
> Is there a tool/file to poll/... where data & statii can be acquired ?

i'm not 100% certain what you mean, but i generally just monitor dmesg for 
the md read error message (mind you the message pre-2.6.19 or .20 isn't 
very informative but it's obvious enough).

there is also a file mismatch_cnt in the same directory as sync_action ... 
the Documentation/md.txt (in 2.6.18) refers to it incorrectly as 
mismatch_count... but anyhow why don't i just repaste the relevant portion 
of md.txt.

-dean

...

Active md devices for levels that support data redundancy (1,4,5,6)
also have

   sync_action
 a text file that can be used to monitor and control the rebuild
 process.  It contains one word which can be one of:
   resync- redundancy is being recalculated after unclean
   shutdown or creation
   recover   - a hot spare is being built to replace a
   failed/missing device
   idle  - nothing is happening
   check - A full check of redundancy was requested and is
   happening.  This reads all block and checks
   them. A repair may also happen for some raid
   levels.
   repair- A full check and repair is happening.  This is
   similar to 'resync', but was requested by the
   user, and the write-intent bitmap is NOT used to
   optimise the process.

  This file is writable, and each of the strings that could be
  read are meaningful for writing.

   'idle' will stop an active resync/recovery etc.  There is no
   guarantee that another resync/recovery may not be automatically
   started again, though some event will be needed to trigger
   this.
'resync' or 'recovery' can be used to restart the
   corresponding operation if it was stopped with 'idle'.
'check' and 'repair' will start the appropriate process
   providing the current state is 'idle'.

   mismatch_count
  When performing 'check' and 'repair', and possibly when
  performing 'resync', md will count the number of errors that are
  found.  The count in 'mismatch_cnt' is the number of sectors
  that were re-written, or (for 'check') would have been
  re-written.  As most raid levels work in units of pages rather
  than sectors, this my be larger than the number of actual errors
  by a factor of the number of sectors in a page.

-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: raid5 software vs hardware: parity calculations?

2007-01-15 Thread dean gaudet
On Mon, 15 Jan 2007, berk walker wrote:

> dean gaudet wrote:
> > echo check >/sys/block/mdX/md/sync_action
> > 
> > it'll read the entire array (parity included) and correct read errors as
> > they're discovered.

> 
> Could I get a pointer as to how I can do this "check" in my FC5 [BLAG] system?
> I can find no appropriate "check", nor "md" available to me.  It would be a
> "good thing" if I were able to find potentially weak spots, rewrite them to
> good, and know that it might be time for a new drive.
> 
> All of my arrays have drives of approx the same mfg date, so the possibility
> of more than one showing bad at the same time can not be ignored.

it should just be:

echo check >/sys/block/mdX/md/sync_action

if you don't have a /sys/block/mdX/md/sync_action file then your kernel is 
too old... or you don't have /sys mounted... (or you didn't replace X with 
the raid number :)

iirc there were kernel versions which had the sync_action file but didn't 
yet support the "check" action (i think possibly even as recent as 2.6.17 
had a small bug initiating one of the sync_actions but i forget which 
one).  if you can upgrade to 2.6.18.x it should work.

debian unstable (and i presume etch) will do this for all your arrays 
automatically once a month.

-dean
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: raid5 software vs hardware: parity calculations?

2007-01-15 Thread dean gaudet
On Mon, 15 Jan 2007, Robin Bowes wrote:

> I'm running RAID6 instead of RAID5+1 - I've had a couple of instances
> where a drive has failed in a RAID5+1 array and a second has failed
> during the rebuild after the hot-spare had kicked in.

if the failures were read errors without losing the entire disk (the 
typical case) then new kernels are much better -- on read error md will 
reconstruct the sectors from the other disks and attempt to write it back.

you can also run monthly "checks"...

echo check >/sys/block/mdX/md/sync_action

it'll read the entire array (parity included) and correct read errors as 
they're discovered.

-dean
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: raid5 software vs hardware: parity calculations?

2007-01-13 Thread dean gaudet
On Sat, 13 Jan 2007, Robin Bowes wrote:

> Bill Davidsen wrote:
> >
> > There have been several recent threads on the list regarding software
> > RAID-5 performance. The reference might be updated to reflect the poor
> > write performance of RAID-5 until/unless significant tuning is done.
> > Read that as tuning obscure parameters and throwing a lot of memory into
> > stripe cache. The reasons for hardware RAID should include "performance
> > of RAID-5 writes is usually much better than software RAID-5 with
> > default tuning.
> 
> Could you point me at a source of documentation describing how to
> perform such tuning?
> 
> Specifically, I have 8x500GB WD STAT drives on a Supermicro PCI-X 8-port
> SATA card configured as a single RAID6 array (~3TB available space)

linux sw raid6 small write performance is bad because it reads the entire 
stripe, merges the small write, and writes back the changed disks.  
unlike raid5 where a small write can get away with a partial stripe read 
(i.e. the smallest raid5 write will read the target disk, read the parity, 
write the target, and write the updated parity)... afaik this optimization 
hasn't been implemented in raid6 yet.

depending on your use model you might want to go with raid5+spare.  
benchmark if you're not sure.

for raid5/6 i always recommend experimenting with moving your fs journal 
to a raid1 device instead (on separate spindles -- such as your root 
disks).

if this is for a database or fs requiring lots of small writes then 
raid5/6 are generally a mistake... raid10 is the only way to get 
performance.  (hw raid5/6 with nvram support can help a bit in this area, 
but you just can't beat raid10 if you need lots of writes/s.)

beyond those config choices you'll want to become friendly with /sys/block 
and all the myriad of subdirectories and options under there.

in particular:

/sys/block/*/queue/scheduler
/sys/block/*/queue/read_ahead_kb
/sys/block/*/queue/nr_requests
/sys/block/mdX/md/stripe_cache_size

for * = any of the component disks or the mdX itself...

some systems have an /etc/sysfs.conf you can place these settings in to 
have them take effect on reboot.  (sysfsutils package on debuntu)

-dean
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: raid5 software vs hardware: parity calculations?

2007-01-12 Thread dean gaudet
On Thu, 11 Jan 2007, James Ralston wrote:

> I'm having a discussion with a coworker concerning the cost of md's
> raid5 implementation versus hardware raid5 implementations.
> 
> Specifically, he states:
> 
> > The performance [of raid5 in hardware] is so much better with the
> > write-back caching on the card and the offload of the parity, it
> > seems to me that the minor increase in work of having to upgrade the
> > firmware if there's a buggy one is a highly acceptable trade-off to
> > the increased performance.  The md driver still commits you to
> > longer run queues since IO calls to disk, parity calculator and the
> > subsequent kflushd operations are non-interruptible in the CPU.  A
> > RAID card with write-back cache releases the IO operation virtually
> > instantaneously.
> 
> It would seem that his comments have merit, as there appears to be
> work underway to move stripe operations outside of the spinlock:
> 
> http://lwn.net/Articles/184102/
> 
> What I'm curious about is this: for real-world situations, how much
> does this matter?  In other words, how hard do you have to push md
> raid5 before doing dedicated hardware raid5 becomes a real win?

hardware with battery backed write cache is going to beat the software at 
small write traffic latency essentially all the time but it's got nothing 
to do with the parity computation.

-dean
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Shrinking a RAID1--superblock problems

2006-12-12 Thread dean gaudet
On Tue, 12 Dec 2006, Jonathan Terhorst wrote:

> I need to shrink a RAID1 array and am having trouble with the
> persistent superblock; namely, mdadm --grow doesn't seem to relocate
> it. If I downsize the array and then shrink the corresponding
> partitions, the array fails since the superblock (which is normally
> located near the end of the device) now lays outside of the
> partitions. Is there any easier way to deal with this than digging
> into the mdadm source, manually calculating the superblock offset and
> dd'ing it to the right spot?

i'd think it'd be easier to recreate the array using --assume-clean after 
the shrink.  for raid1 it's extra easy because you don't need to get the 
disk ordering correct.

in fact with raid1 you don't even need to use mdadm --grow... you could do 
something like the following (assuming you've already shrunk the 
filesystem):

mdadm --stop /dev/md0
mdadm --zero-superblock /dev/sda1
mdadm --zero-superblock /dev/sdb1
fdisk /dev/sda  ... shrink partition
fdisk /dev/sdb  ... shrink partition
mdadm --create --assume-clean --level=1 -n2 /dev/md0 /dev/sd[ab]1

heck that same technique works for raid0/4/5/6 and raid10 "near" and 
"offset" layouts as well, doesn't it?  raid10 "far" layout definitely 
needs blocks rearranged to shrink.  in these other modes you'd need to be 
careful about recreating the array with the correct ordering of disks.

the zero-superblock step is an important defense against future problems 
with "assemble every array i find"-types of initrds that are unfortunately 
becomming common (i.e. debian and ubuntu).

-dean
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Observations of a failing disk

2006-11-27 Thread dean gaudet
On Tue, 28 Nov 2006, Richard Scobie wrote:

> Anyway, my biggest concern is why
> 
> echo repair > /sys/block/md5/md/sync_action
> 
> appeared to have no effect at all, when I understand that it should re-write
> unreadable sectors?

i've had the same thing happen on a seagate 7200.8 pata 400GB... and went 
through the same sequence of operations you described, and the dd fixed 
it.

one theory was that i lucked out and the pending sectors in the unused 
disk near the md superblock... but since that's in general only about 90KB 
of disk i was kind of skeptical.  it's certainly possible, but seems 
unlikely.

another theory is that a pending sector doesn't always result in a read 
error -- i.e. depending on temperature?  but the question is, why wouldn't 
the disk try rewriting it if it does get a successful read.

i wish hard drives were a little less voodoo.

-dean
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Raid 1 (non) performance

2006-11-19 Thread dean gaudet
On Wed, 15 Nov 2006, Magnus Naeslund(k) wrote:

> # cat /proc/mdstat
> Personalities : [raid1]
> md2 : active raid1 sda3[0] sdb3[1]
>   236725696 blocks [2/2] [UU]
> 
> md1 : active raid1 sda2[0] sdb2[1]
>   4192896 blocks [2/2] [UU]
> 
> md0 : active raid1 sda1[0] sdb1[1]
>   4192832 blocks [2/2] [UU]

i see you have split /var and / on the same spindle... if your /home is on 
/ then you're causing extra seek action by having two active filesystems 
on the same spindles.  another option to consider is to make / small and 
mostly read-only and move /home to /var/home (and use a symlink or mount 
--bind to place it at /home).

or just put everything in one big / filesystem.

hopefully your swap isn't being used much anyhow.

try "iostat -kx /dev/sd* 5" and see if the split is causing you troubles 
-- i/o activity on more than one partition at once.


> I've tried to modify the queuing by doing this, to disable the write cache 
> and enable CFQ. The CFQ choice is rather random.
> 
> for disk in sda sdb; do
>   blktool /dev/$disk wcache off
>   hdparm -q -W 0 /dev/$disk

turning off write caching is a recipe for disasterous performance on most 
ata disks... unfortunately.  better to buy a UPS and set up nut or apcupsd 
or something to handle shutdown.  or just take your chances.

-dean
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: safest way to swap in a new physical disk

2006-11-18 Thread dean gaudet
On Tue, 14 Nov 2006, Will Sheffler wrote:

> Hi.
> 
> What is the safest way to switch out a disk in a software raid array created
> with mdadm? I'm not talking about replacing a failed disk, I want to take a
> healthy disk in the array and swap it for another physical disk. Specifically,
> I have an array made up of 10 250gb software-raid partitions on 8 300gb disks
> and 2 250gb disks, plus a hot spare. I want to switch the 250s to new 300gb
> disks so everything matches. Is there a way to do this without risking a
> rebuild? I can't back everything up, so I want to be as risk-free as possible.
> 
> I guess what I want is to do something like this:
> 
> (1) Unmount the array
> (2) Un-create the array
> (3) Somehow exactly duplicate partition X to a partition Y on a new disk
> (4) Re-create array with X gone and Y in it's place
> (5) Check if the array is OK without changing/activating it
> (6) If there is a problem, switch from Y back to X and have it as though
> nothing changed
> 
> The part I'm worried about is (3), as I've tried duplicating partition images
> before and it never works right. Is there a way to do this with mdadm?

if you have a recent enough kernel (2.6.15 i think) and recent enough 
mdadm (2.2.x i think) you can do this all online without losing redundancy 
for more than a few seconds... i placed a copy of instructions and further 
discussions of what types of problems this method has here:

http://arctic.org/~dean/proactive-raid5-disk-replacement.txt

it's actually perfect for your situation.

-dean
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: raid5 hang on get_active_stripe

2006-11-15 Thread dean gaudet
and i haven't seen it either... neil do you think your latest patch was 
hiding the bug?  'cause there was an iteration of an earlier patch which 
didn't produce much spam in dmesg but the bug was still there, then there 
is the version below which spams dmesg a fair amount but i didn't see the 
bug in ~30 days.

btw i've upgraded that box to 2.6.18.2 without the patch (it had some 
conflicts)... haven't seen the bug yet though (~10 days so far).

hmm i wonder if i could reproduce it more rapidly if i lowered 
/sys/block/mdX/md/stripe_cache_size.  i'll give that a go.

-dean


On Tue, 14 Nov 2006, Chris Allen wrote:

> You probably guessed that no matter what I did, I never, ever saw the problem
> when your
> trace was installed. I'd guess at some obscure timing-related problem. I can
> still trigger it
> consistently with a vanilla 2.6.17_SMP though, but again only when bitmaps are
> turned on.
> 
> 
> 
> Neil Brown wrote:
> > On Tuesday October 10, [EMAIL PROTECTED] wrote:
> >   
> > > Very happy to. Let me know what you'd like me to do.
> > > 
> > 
> > Cool thanks.
> > 
> > At the end is a patch against 2.6.17.11, though it should apply against
> > any later 2.6.17 kernel.
> > Apply this and reboot.
> > 
> > Then run
> > 
> >while true
> >do cat /sys/block/mdX/md/stripe_cache_active
> >   sleep 10
> >done > /dev/null
> > 
> > (maybe write a little script or whatever).  Leave this running. It
> > effects the check for "has raid5 hung".  Make sure to change "mdX" to
> > whatever is appropriate.
> > 
> > Occasionally look in the kernel logs for
> >plug problem:
> > 
> > if you find that, send me the surrounding text - there should be about
> > a dozen lines following this one.
> > 
> > Hopefully this will let me know which is last thing to happen: a plug
> > or an unplug.
> > If the last is a "plug", then the timer really should still be
> > pending, but isn't (this is impossible).  So I'll look more closely at
> > that option.
> > If the last is an "unplug", then the 'Plugged' flag should really be
> > clear but it isn't (this is impossible).  So I'll look more closely at
> > that option.
> > 
> > Dean is running this, but he only gets the hang every couple of
> > weeks.  If you get it more often, that would help me a lot.
> > 
> > Thanks,
> > NeilBrown
> > 
> > 
> > diff ./.patches/orig/block/ll_rw_blk.c ./block/ll_rw_blk.c
> > --- ./.patches/orig/block/ll_rw_blk.c   2006-08-21 09:52:46.0 
> > +1000
> > +++ ./block/ll_rw_blk.c 2006-10-05 11:33:32.0 +1000
> > @@ -1546,6 +1546,7 @@ static int ll_merge_requests_fn(request_
> >   * This is called with interrupts off and no requests on the queue and
> >   * with the queue lock held.
> >   */
> > +static atomic_t seq = ATOMIC_INIT(0);
> >  void blk_plug_device(request_queue_t *q)
> >  {
> > WARN_ON(!irqs_disabled());
> > @@ -1558,9 +1559,16 @@ void blk_plug_device(request_queue_t *q)
> > return;
> > if (!test_and_set_bit(QUEUE_FLAG_PLUGGED, &q->queue_flags)) {
> > +   q->last_plug = jiffies;
> > +   q->plug_seq = atomic_read(&seq);
> > +   atomic_inc(&seq);
> > mod_timer(&q->unplug_timer, jiffies + q->unplug_delay);
> > blk_add_trace_generic(q, NULL, 0, BLK_TA_PLUG);
> > -   }
> > +   } else
> > +   q->last_plug_skip = jiffies;
> > +   if (!timer_pending(&q->unplug_timer) &&
> > +   !q->unplug_work.pending)
> > +   printk("Neither Timer or work are pending\n");
> >  }
> >   EXPORT_SYMBOL(blk_plug_device);
> > @@ -1573,10 +1581,17 @@ int blk_remove_plug(request_queue_t *q)
> >  {
> > WARN_ON(!irqs_disabled());
> >  -  if (!test_and_clear_bit(QUEUE_FLAG_PLUGGED, &q->queue_flags))
> > +   if (!test_and_clear_bit(QUEUE_FLAG_PLUGGED, &q->queue_flags)) {
> > +   q->last_unplug_skip = jiffies;
> > return 0;
> > +   }
> > del_timer(&q->unplug_timer);
> > +   q->last_unplug = jiffies;
> > +   q->unplug_seq = atomic_read(&seq);
> > +   atomic_inc(&seq);
> > +   if (test_bit(QUEUE_FLAG_PLUGGED, &q->queue_flags))
> > +   printk("queue still (or again) plugged\n");
> > return 1;
> >  }
> >  @@ -1635,7 +1650,7 @@ static void blk_backing_dev_unplug(struc
> >  static void blk_unplug_work(void *data)
> >  {
> > request_queue_t *q = data;
> > -
> > +   q->last_unplug_work = jiffies;
> > blk_add_trace_pdu_int(q, BLK_TA_UNPLUG_IO, NULL,
> > q->rq.count[READ] + q->rq.count[WRITE]);
> >  @@ -1649,6 +1664,7 @@ static void blk_unplug_timeout(unsigned
> > blk_add_trace_pdu_int(q, BLK_TA_UNPLUG_TIMER, NULL,
> > q->rq.count[READ] + q->rq.count[WRITE]);
> >  +  q->last_unplug_timeout = jiffies;
> > kblockd_schedule_work(&q->unplug_work);
> >  }
> >  
> > diff ./.patches/orig/drivers/md/raid1.c ./drivers/md/raid1.c
> > --- ./.patches/orig/drivers/md/raid1.c  2006-08-10 17:28:01.0
> > +1000
> > +++ ./drivers/md/raid1.c

Re: RAID5 array showing as degraded after motherboard replacement

2006-11-07 Thread dean gaudet
On Wed, 8 Nov 2006, James Lee wrote:

> > However I'm still seeing the error messages in my dmesg (the ones I
> > posted earlier), and they suggest that there is some kind of hardware
> > fault (based on a quick Google of the error codes).  So I'm a little
> > confused.

the fact that the error is in a geometry command really makes me wonder...

did you compare the number of blocks on the device vs. what seems to be 
available when it's on the weird raid card?

-dean
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 001 of 6] md: Send online/offline uevents when an md array starts/stops.

2006-11-06 Thread dean gaudet
On Mon, 6 Nov 2006, Neil Brown wrote:

> This creates a deep disconnect between udev and md.
> udev expects a device to appear first, then it created the
> device-special-file in /dev.
> md expect the device-special-file to exist first, and then created the
> device on the first open.

could you create a special /dev/mdx device which is used to 
assemble/create arrays only?  i mean literally "mdx" not "mdX" where X is 
a number.  mdx would always be there if md module is loaded... so udev 
would see the driver appear and then create the /dev/mdx.  then mdadm 
would use /dev/mdx to do assemble/creates/whatever and cause other devices 
to appear/disappear in a manner which udev is happy with.

(much like how /dev/ptmx is used to create /dev/pts/N entries.)

doesn't help legacy mdadm binaries... but seems like it fits the New World 
Order.

or hm i suppose the New World Order is to eschew binary interfaces and 
suggest a /sys/class/md/ hierarchy with a bunch of files you have to splat 
ascii data into to cause an array to be created/assembled.

-dean
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RAID5 array showing as degraded after motherboard replacement

2006-11-06 Thread dean gaudet


On Mon, 6 Nov 2006, James Lee wrote:

> Thanks for the reply Dean.  I looked through dmesg output from the
> boot up, to check whether this was just an ordering issue during the
> system start up (since both evms and mdadm attempt to activate the
> array, which could cause things to go wrong...).
> 
> Looking through the dmesg output though, it looks like the 'missing'
> disk is being detected before the array is assembled, but that the
> disk is throwing up errors.  I've attached the full output of dmesg;
> grepping it for "hde" gives the following:
> 
> [17179574.084000] ide2: BM-DMA at 0xd400-0xd407, BIOS settings:
> hde:DMA, hdf:DMA
> [17179574.38] hde: NetCell SyncRAID(TM) SR5000 JBOD, ATA DISK drive
> [17179575.312000] hde: max request size: 512KiB
> [17179575.312000] hde: 625134827 sectors (320069 MB), CHS=38912/255/63, (U)DMA
> [17179575.312000] hde: set_geometry_intr: status=0x51 { DriveReady
> SeekComplete Error }
> [17179575.312000] hde: set_geometry_intr: error=0x04 { DriveStatusError }
> [17179575.312000] hde: cache flushes supported

is it possible that the "NetCell SyncRAID" implementation is stealing some 
of the sectors (even though it's marked JBOD)?  anyhow it could be the 
disk is bad, but i'd still be tempted to see if the problem stays with the 
controller if you swap the disk with another in the array.

-dean
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: mdadm 2.5.5 external bitmap assemble problem

2006-11-06 Thread dean gaudet
On Mon, 6 Nov 2006, Neil Brown wrote:

> > hey i have another related question... external bitmaps seem to pose a bit 
> > of a chicken-and-egg problem.  all of my filesystems are md devices. with 
> > an external bitmap i need at least one of the arrays to start, then have 
> > filesystems mounted, then have more arrays start... it just happens to 
> > work OK if i let debian unstale initramfs try to start all my arrays, 
> > it'll fail for the ones needing bitmap.  then later /etc/init.d/mdadm-raid 
> > should start the array.  (well it would if the bitmap= in mdadm.conf 
> > worked :)
> > 
> > is it possible to put bitmaps on devices instead of files?  mdadm seems to 
> > want a --force for that (because the device node exists already) and i 
> > haven't tried forcing it.  although i suppose a 200KB partition would be 
> > kind of tiny but i could place the bitmap right beside the external 
> > transaction log for the filesystem on the raid5.
> 
> Create the root filesystem with --bitmap=internal, and store all the
> other bitmaps on that filesystem maybe?

yeah i only have the one external bitmap (it's for a large raid5)... so 
things will work fine once i apply your patch.  thanks.

> I don't know if it would work to have a bitmap on a device, but you
> can always mkfs the device, mount it, and put a bitmap on a file
> there??

yeah this was the first thing i tried after i found mdadm -b /dev/foo 
wasn't accepted...

without modifying startup scripts there's no way to use any filesystem 
other than root... it's just due to ordering of init scripts:

# ls /etc/rcS.d | grep -i 'mount\|raid'
S02mountkernfs.sh
S04mountdevsubfs.sh
S25mdadm-raid
S35mountall.sh
S36mountall-bootclean.sh
S45mountnfs.sh
S46mountnfs-bootclean.sh

i'd need to run another mdadm-raid after the S35mountall, and then another 
mountall.

anyhow, i don't think you need to change anything (except maybe a note in 
the docs somewhere), i'm just bringing it up as part of the experience of 
trying external bitmap.  i suspect that in the wild and crazy direction 
debian and ubuntu are heading (ditching sysvinit for event-driven systems) 
it'll be "easy" to express the boot dependencies.

-dean
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Is my RAID broken?

2006-11-06 Thread dean gaudet
On Mon, 6 Nov 2006, Mikael Abrahamsson wrote:

> On Mon, 6 Nov 2006, Neil Brown wrote:
> 
> > So it looks like you machine recently crashed (power failure?) and it is
> > restarting.
> 
> Or upgrade some part of the OS and now it'll do resync every week or so (I
> think this is debian default nowadays, don't know the interval though).

it should be only once a month... and it's just a "check" -- it reads 
everything and corrects errors.

i think it's a great thing actually... way more useful than smart long 
self-tests because md can reconstruct read errors immediately -- before 
you lose redundancy in that stripe.

-dean

% cat /etc/cron.d/mdadm
#
# cron.d/mdadm -- schedules periodic redundancy checks of MD devices
#
# Copyright © martin f. krafft <[EMAIL PROTECTED]>
# distributed under the terms of the Artistic Licence 2.0
#
# $Id: mdadm.cron.d 147 2006-08-30 09:26:11Z madduck $
#

# By default, run at 01:06 on every Sunday, but do nothing unless the day of
# the month is less than or equal to 7. Thus, only run on the first Sunday of
# each month. crontab(5) sucks, unfortunately, in this regard; therefore this
# hack (see #380425).
6 1 * * 0 root [ -x /usr/share/mdadm/checkarray ] && [ $(date +\%d) -le 7 ] && 
/usr/share/mdadm/checkarray --cron --all --quiet

Re: RAID5 array showing as degraded after motherboard replacement

2006-11-05 Thread dean gaudet
On Sun, 5 Nov 2006, James Lee wrote:

> Hi there,
> 
> I'm running a 5-drive software RAID5 array across two controllers.
> The motherboard in that PC recently died - I sent the board back for
> RMA.  When I refitted the motherboard, connected up all the drives,
> and booted up I found that the array was being reported as degraded
> (though all the data on it is intact).  I have 4 drives on the on
> board controller and 1 drive on an XFX Revo 64 SATA controller card.
> The drive which is being reported as not being in the array is the one
> connected to the XFX controller.
> 
> The OS can see that drive fine, and "mdadm --examine" on that drive
> shows that it is part of the array and that there are 5 active devices
> in the array.  Doing "mdadm --examine" on one of the other four drives
> shows that the array has 4 active drives and one failed.  "mdadm
> --detail" for the array also shows 4 active and one failed.

that means the array was assembled without the 5th disk and is currently 
degraded.


> Now I haven't lost any data here and I know I can just force a resync
> of the array which is fine.  However I'm concerned about how this has
> happened.  One worry is that the XFX SATA controller is doing
> something funny to the drive.  I've noticed that it's BIOS has
> defaulted to RAID0 mode (even though there's only one drive on it) - I
> can't see how this would cause any particular problems here though.  I
> guess it's possible that some data on the drive got corrupted when the
> motherboard failed...

no it's more likely the devices were renamed or the 5th device didn't come 
up before the array was assembled.

it's possible that a different bios setting lead to the device using a 
different driver than is in your initrd... but i'm just guessing.

> Any ideas what could cause mdadm to report as I've described above
> (I've attached the output of these three commands)?  I'm running
> Ubuntu Edgy, which is a 2.17.x kernel, and mdadm 2.4.1.  In case it's
> relevant here, I created the array using EVMS...

i've never created an array with evms... but my guess is that it may have 
used "mapped" device names instead of the normal device names.  take a 
look at /proc/mdstat and see what devices are in the array and use those 
as a template to find the name of the missing device.  below i'll use 
/dev/sde1 as the example missing device and /dev/md0 as the example array.

first thing i'd try is something like this:

mdadm /dev/md0 -a /dev/sde1

which hotadds the device into the array... which will start a resync.

when the resync is done (cat /proc/mdstat) do this.

mdadm -Gb internal /dev/md0

which will add write-intent bitmaps to your device... which will avoid 
another long wait for a resync after the next reboot if the fix below 
doesn't help.

then do this:

dpkg-reconfigure linux-image-`uname -r`

which will rebuild the initrd for your kernel ... and if it was a driver 
change this should include the new driver into the initrd.

then reboot and see if it comes up fine.  if it doesn't, you can repeat 
the "-a /dev/sde1" command above... the resync will be quick this time due 
to the bitmap... and we'll have to investigate further.

-dean
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Checking individual drive state

2006-11-05 Thread dean gaudet
On Sun, 5 Nov 2006, Bradshaw wrote:

> I've recently built a smallish RAID5 box as a storage area for my home
> network, using mdadm. However, one of the drives will not remain in the array
> for longer that around two days before it is removed. Readding it to the array
> does not throw any errors, leading me to believe that it's probably a problem
> with the controller, which is an add-in SATA card, as well as the other drive
> connected to it failing once.
> 
> I don't know how to scan the one disk for bad sectors, stopping the array and
> doing an fsck or similar throws errors, so I need help in determining whether
> the disc itself is faulty.

try swapping the cable first.  after that swap ports with another disk and 
see if the problem follows the port or the disk.

you can see if smartctl -a (from smartmontools) tells you anything 
interesting.  (it can be quite difficult, to impossible, to understand 
smartctl -a output though.  but if you've got errors in the SMART error 
log that's a good place to start.)


> If the controller is to be replaced, how would I go about migrating the two
> discs to the new controller whilst maintaining the array?

it depends on which method you're using to assemble the array at boot 
time.  in most cases if these aren't your root disks then a swap of two 
disks won't result in any troubles reassembling the array.  other device 
renames may cause problems depending on your distribution though -- but 
generally when two devices swap names within an array you should be fine.

you'll want to do the disk swap with the array offline (either shutdown 
the box or mdadm --stop the array).

-dean
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RAID5/10 chunk size and ext2/3 stride parameter

2006-11-04 Thread dean gaudet
On Sat, 4 Nov 2006, martin f krafft wrote:

> also sprach dean gaudet <[EMAIL PROTECTED]> [2006.11.03.2019 +0100]:
> > > I cannot find authoritative information about the relation between
> > > the RAID chunk size and the correct stride parameter to use when
> > > creating an ext2/3 filesystem.
> > 
> > you know, it's interesting -- mkfs.xfs somehow gets the right sunit/swidth 
> > automatically from the underlying md device.
> 
> i don't know enough about xfs to be able to agree or disagree with
> you on that.
> 
> > # mdadm --create --level=5 --raid-devices=4 --assume-clean --auto=yes 
> > /dev/md0 /dev/sd[abcd]1
> > mdadm: array /dev/md0 started.
> 
> with 64k chunks i assume...

yup.


> > # mkfs.xfs /dev/md0
> > meta-data=/dev/md0   isize=256agcount=32, agsize=9157232 
> > blks
> >  =   sectsz=4096  attr=0
> > data =   bsize=4096   blocks=293031424, imaxpct=25
> >  =   sunit=16 swidth=48 blks, unwritten=1
> 
> sunit seems like the stride width i determined (64k chunks / 4k
> bzise), but what is swidth? Is it 64 * 3/4 because of the four
> device RAID5?

yup.

and for a raid6 mkfs.xfs correctly gets sunit=16 swidth=32.


> > # mdadm --create --level=10 --layout=f2 --raid-devices=4 --assume-clean 
> > --auto=yes /dev/md0 /dev/sd[abcd]1
> > mdadm: array /dev/md0 started.
> > # mkfs.xfs -f /dev/md0
> > meta-data=/dev/md0   isize=256agcount=32, agsize=6104816 
> > blks
> >  =   sectsz=512   attr=0
> > data =   bsize=4096   blocks=195354112, imaxpct=25
> >  =   sunit=16 swidth=64 blks, unwritten=1
> 
> okay, so as before, 16 stride size and 64 stripe width, because
> we're now dealing with mirrors.
> 
> > # mdadm --create --level=10 --layout=n2 --raid-devices=4 --assume-clean 
> > --auto=yes /dev/md0 /dev/sd[abcd]1
> > mdadm: array /dev/md0 started.
> > # mkfs.xfs -f /dev/md0
> > meta-data=/dev/md0   isize=256agcount=32, agsize=6104816 
> > blks
> >  =   sectsz=512   attr=0
> > data =   bsize=4096   blocks=195354112, imaxpct=25
> >  =   sunit=16 swidth=64 blks, unwritten=1
> 
> why not? in this case, -n2 and -f2 aren't any different, are they?

they're different in that with f2 you get essentially 4 disk raid0 read 
performance because the copies of each byte are half a disk away... so it 
looks like a raid0 on the first half of the disks, and another raid0 on 
the second half.

in n2 the two copies are at the same offset... so it looks more like a 2 
disk raid0 for reading and writing.

i'm not 100% certain what xfs uses them for -- you can actually change the 
values at mount time.  so it probably uses them for either read scheduling 
or write layout or both.

-dean
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


mdadm 2.5.5 external bitmap assemble problem

2006-11-04 Thread dean gaudet
i think i've got my mdadm.conf set properly for an external bitmap -- but 
it doesn't seem to work.  i can assemble from the command-line fine 
though:

# grep md4 /etc/mdadm/mdadm.conf
ARRAY /dev/md4 bitmap=/bitmap.md4 UUID=dbc3be0b:b5853930:a02e038c:13ba8cdc

# mdadm -A /dev/md4
mdadm: Could not open bitmap file

# mdadm -A --uuid=dbc3be0b:b5853930:a02e038c:13ba8cdc --bitmap=/bitmap.md4 
/dev/md4
mdadm: /dev/md4 has been started with 5 drives and 1 spare.

# mdadm --version
mdadm - v2.5.5 - 23 October 2006

(this is on debian unstale)

btw -- mdadm seems to create the bitmap file with world readable perms.
i doubt it matters, but 600 would seem like a better mode.

hey i have another related question... external bitmaps seem to pose a bit 
of a chicken-and-egg problem.  all of my filesystems are md devices. with 
an external bitmap i need at least one of the arrays to start, then have 
filesystems mounted, then have more arrays start... it just happens to 
work OK if i let debian unstale initramfs try to start all my arrays, 
it'll fail for the ones needing bitmap.  then later /etc/init.d/mdadm-raid 
should start the array.  (well it would if the bitmap= in mdadm.conf 
worked :)

is it possible to put bitmaps on devices instead of files?  mdadm seems to 
want a --force for that (because the device node exists already) and i 
haven't tried forcing it.  although i suppose a 200KB partition would be 
kind of tiny but i could place the bitmap right beside the external 
transaction log for the filesystem on the raid5.

-dean
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RAID5/10 chunk size and ext2/3 stride parameter

2006-11-03 Thread dean gaudet
On Tue, 24 Oct 2006, martin f krafft wrote:

> Hi,
> 
> I cannot find authoritative information about the relation between
> the RAID chunk size and the correct stride parameter to use when
> creating an ext2/3 filesystem.

you know, it's interesting -- mkfs.xfs somehow gets the right sunit/swidth 
automatically from the underlying md device.

for example, on a box i'm testing:

# mdadm --create --level=5 --raid-devices=4 --assume-clean --auto=yes /dev/md0 
/dev/sd[abcd]1
mdadm: array /dev/md0 started.
# mkfs.xfs /dev/md0
meta-data=/dev/md0   isize=256agcount=32, agsize=9157232 
blks
 =   sectsz=4096  attr=0
data =   bsize=4096   blocks=293031424, imaxpct=25
 =   sunit=16 swidth=48 blks, unwritten=1
naming   =version 2  bsize=4096
log  =internal log   bsize=4096   blocks=32768, version=2
 =   sectsz=4096  sunit=1 blks
realtime =none   extsz=196608 blocks=0, rtextents=0

# mdadm --stop /dev/md0
mdadm: stopped /dev/md0
# mdadm --zero-superblock /dev/sd[abcd]1
# mdadm --create --level=10 --layout=f2 --raid-devices=4 --assume-clean 
--auto=yes /dev/md0 /dev/sd[abcd]1
mdadm: array /dev/md0 started.
# mkfs.xfs -f /dev/md0
meta-data=/dev/md0   isize=256agcount=32, agsize=6104816 blks
 =   sectsz=512   attr=0
data =   bsize=4096   blocks=195354112, imaxpct=25
 =   sunit=16 swidth=64 blks, unwritten=1
naming   =version 2  bsize=4096
log  =internal log   bsize=4096   blocks=32768, version=1
 =   sectsz=512   sunit=0 blks
realtime =none   extsz=262144 blocks=0, rtextents=0


i wonder if the code could be copied into mkfs.ext3?

although hmm, i don't think it gets raid10 "n2" correct:

# mdadm --stop /dev/md0
mdadm: stopped /dev/md0
# mdadm --zero-superblock /dev/sd[abcd]1
# mdadm --create --level=10 --layout=n2 --raid-devices=4 --assume-clean 
--auto=yes /dev/md0 /dev/sd[abcd]1
mdadm: array /dev/md0 started.
# mkfs.xfs -f /dev/md0
meta-data=/dev/md0   isize=256agcount=32, agsize=6104816 blks
 =   sectsz=512   attr=0
data =   bsize=4096   blocks=195354112, imaxpct=25
 =   sunit=16 swidth=64 blks, unwritten=1
naming   =version 2  bsize=4096
log  =internal log   bsize=4096   blocks=32768, version=1
 =   sectsz=512   sunit=0 blks
realtime =none   extsz=262144 blocks=0, rtextents=0


in a "near 2" layout i would expect sunit=16, swidth=32 ...  but swidth=64
probably doesn't hurt.


> My understanding is that (block * stride) == (chunk). So if I create
> a default RAID5/10 with 64k chunks, and create a filesystem with 4k
> blocks on it, I should choose stride 64k/4k = 16.

that's how i think it works -- i don't think ext[23] have a concept of "stripe
width" like xfs does.  they just want to know how to avoid putting all the
critical data on one disk (which needs only the chunk size).  but you should
probably ask on the linux-ext4 mailing list.

> Is the chunk size of an array equal to the stripe size? Or is it
> (n-1)*chunk size for RAID5 and (n/2)*chunk size for a plain near=2
> RAID10?

> Also, I understand that it makes no sense to use stride for RAID1 as
> there are no stripes in that sense. But for RAID10 it makes sense,
> right?

yep.

-dean
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: md array numbering is messed up

2006-10-30 Thread dean gaudet
On Mon, 30 Oct 2006, Brad Campbell wrote:

> Michael Tokarev wrote:
> > My guess is that it's using mdrun shell script - the same as on Debian.
> > It's a long story, the thing is quite ugly and messy and does messy things
> > too, but they says it's compatibility stuff and continue shipping it.
...
> 
> I'd suggest you are probably correct. By default on Ubuntu 6.06
> 
> [EMAIL PROTECTED]:~$ cat /etc/init.d/mdadm-raid
> #!/bin/sh
> #
> # Start any arrays which are described in /etc/mdadm/mdadm.conf and which are
> # not running already.
> #
> # Copyright (c) 2001-2004 Mario Jou/3en <[EMAIL PROTECTED]>
> # Distributable under the terms of the GNU GPL version 2.
> 
> MDADM=/sbin/mdadm
> MDRUN=/sbin/mdrun

fwiw mdrun is finally on its way out.  the debian "unstable" mdadm package 
is full of new goodness (initramfs goodness, 2.5.x mdadm featurefulness, 
monthly full array check goodness).  ubuntu folks should copy it again 
before they finalize edgy.

-dean
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Raid 0 breakage (maybe)

2006-10-30 Thread dean gaudet
On Mon, 30 Oct 2006, Neil Brown wrote:

> > [EMAIL PROTECTED]:~# mdadm --assemble /dev/md0 /dev/hde /dev/hdi
> > mdadm: cannot open device /dev/hde: Device or resource busy
> 
> This is telling you that /dev/hde - or one of it's partitions - is
> "Busy".  This means more than just 'open'.  It means mounted or
> included in an md or dm array, or used for swap.
> You need to find out what is keeping it busy.
> Most likely dm or md.

it's probably evms+ubuntu+new kernel stupidity.

as far as i've determined, older (dapper) ubuntu had evms as part of the 
base package requirements.  evms was set up in some promiscuous manner 
where it goes and sets up dm linear maps for every damn partition and 
device.  nobody cared before because older kernels don't seem to stop 
people from using the underlying /dev/foo for md or mount or whatever. but 
newer kernels don't allow that any more.

it looks like they dropped the evms requirement in edgy -- probably 
because it has the newer kernel... but people who upgraded from dapper 
will run into this because they'll still have evms from before.


> I don't know much about dm - is there some 'pvlist' command or similar
> that will show all the phys volumes it is holding on to..

"dmsetup ls" does that...

for example on an ubuntu box i haven't fixed yet:

# dmsetup ls --tree
hda5 (253:1)
 `- (3:0)
hda1 (253:0)
 `- (3:0)

you can "dmsetup remove hda1" or whatever to get back the real device (but 
it's better to just remove or disable evms and reboot).

-dean

-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: why partition arrays?

2006-10-24 Thread dean gaudet
On Tue, 24 Oct 2006, Bill Davidsen wrote:

> My read on LVM is that (a) it's one more thing for the admin to learn, (b)
> because it's seldom used the admin will be working from documentation if it
> has a problem, and (c) there is no bug-free software, therefore the use of LVM
> on top of RAID will be less reliable than a RAID-only solution. I can't
> quantify that, the net effect may be too small to measure. However, the cost
> and chance of a finger check from (a) and (b) are significant.

this is essentially why i gave up on LVM as well.

add in the following tidbits:

- snapshots stopped working in 2.6.  may be fixed by now, but i gave up 
hope and this was the biggest feature i desired from LVM.

- it's way better for performance to have only one active filesystem on a 
group of spindles

- you can emulate pvmove with md superblockless raid1 sufficiently well 
for most purposes (although as we've discussed here it would be nice if md 
directly supported "proactive replacement")

and more i'm forgetting.

-dean
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Setting write-intent bitmap during array resync/create?

2006-10-10 Thread dean gaudet
On Tue, 10 Oct 2006, Eli Stair wrote:

> I gather this isn't currently possible, but I wonder if it's feasible to make
> it so?  This works fine once the array is marked 'clean', and I imagine it's
> simpler to just disallow the bitmap creation until it's in that state.  Would
> it be possible to allow creation of the bitmap by queueing the action until
> the array is done rebuilding?

why don't you add "-b internal" to the mdadm --create command line?  

-dean
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: USB and raid... Device names change

2006-09-18 Thread dean gaudet
On Tue, 19 Sep 2006, Eduardo Jacob wrote:

> DEVICE /dev/raid111 /dev/raid121
> ARRAY /dev/md0 level=raid1 num-devices=2 
> UUID=1369e13f:eb4fa45c:6d4b9c2a:8196aa1b

try using "DEVICE partitions"... then "mdadm -As /dev/md0" will scan all 
available partitions for raid components with 
UUID=1369e13f:eb4fa45c:6d4b9c2a:8196aa1b.  so it won't matter which sdX 
they are.

-dean
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: access *existing* array from knoppix

2006-09-12 Thread dean gaudet
On Tue, 12 Sep 2006, Dexter Filmore wrote:

> Am Dienstag, 12. September 2006 16:08 schrieb Justin Piszcz:
> > /dev/MAKEDEV /dev/md0
> >
> > also make sure the SW raid modules etc are loaded if necessary.
> 
> Won't work, MAKEDEV doesn't know how to create [/dev/]md0.

echo 'DEVICE partitions' >/tmp/mdadm.conf
mdadm --detail --scan --config=/tmp/mdadm.conf >>/tmp/mdadm.conf

take a look in /tmp/mdadm.conf ... your root array should be listed.

mdadm --assemble --config=/tmp/mdadm.conf --auto=yes /dev/md0

-dean
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Care and feeding of RAID?

2006-09-09 Thread dean gaudet
On Tue, 5 Sep 2006, Paul Waldo wrote:

> What about bitmaps?  Nobody has mentioned them.  It is my understanding that
> you just turn them on with "mdadm /dev/mdX -b internal".  Any caveats for
> this?

bitmaps have been working great for me on a raid5 and raid1.  it makes it 
that much more tolerable when i accidentally crash the box and don't have 
to wait forever for a resync.

i don't notice the extra write traffic all that much... under heavy 
traffic i see about 3 writes/s to the spare disk in the raid5 -- i assume 
those are all due to the bitmap in the superblock on the spare.

i've considered using an external bitmap, i forget why i didn't do that 
initially.  the filesystem on the raid5 already has an external journal on 
raid1.

-dean
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: UUID's

2006-09-09 Thread dean gaudet
On Sat, 9 Sep 2006, Richard Scobie wrote:

> To remove all doubt about what is assembled where, I though going to:
> 
> DEVICE partitions
> MAILADDR root
> ARRAY /dev/md3 UUID=xyz etc.
> 
> would be more secure.
> 
> Is this correct thinking on my part?

yup.

mdadm can generate it all for you... there's an example on the man page.  
basically you just want to paste the output of "mdadm --detail --scan 
--config=partitions" into your mdadm.conf.

-dean
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: UUID's

2006-09-08 Thread dean gaudet


On Sat, 9 Sep 2006, Richard Scobie wrote:

> If I have specified an array in mdadm.conf using UUID's:
> 
> ARRAY /dev/md0 UUID=3aaa0122:29827cfa:5331ad66:ca767371
> 
> and I replace a failed drive in the array, will the new drive be given the
> previous UUID, or do I need to upate the mdadm.conf entry?

once you do the "mdadm /dev/mdX -a /dev/newdrive" the new drive will have 
the UUID.  no need to update the mdadm.conf for the UUID...

however if you're using "DEVICE foo" where foo is not "partitions" then 
you should make sure foo includes the new drive.  ("DEVICE partitions" is 
recommended.)

-dean
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: proactive-raid-disk-replacement

2006-09-08 Thread dean gaudet
On Fri, 8 Sep 2006, Michael Tokarev wrote:

> dean gaudet wrote:
> > On Fri, 8 Sep 2006, Michael Tokarev wrote:
> > 
> >> Recently Dean Gaudet, in thread titled 'Feature
> >> Request/Suggestion - "Drive Linking"', mentioned his
> >> document, http://arctic.org/~dean/proactive-raid5-disk-replacement.txt
> >>
> >> I've read it, and have some umm.. concerns.  Here's why:
> >>
> >> 
> >>> mdadm -Gb internal --bitmap-chunk=1024 /dev/md4
> 
> By the way, don't specify bitmap-chunk for internal bitmap.
> It's needed for file-based (external) bitmap.  With internal
> bitmap, we have fixed size in superblock for it, so bitmap-chunk
> is determined by dividing that size by size of the array.

yeah sorry that was with an older version of mdadm which didn't calculate 
the chunksize correct for an internal bitmap on a large enough array... i 
should have mentioned that in the post.  it's fixed in newer mdadm.


> > my practice is to run regular SMART long self tests, which tend to find 
> > Current_Pending_Sectors (which are generally read errors waiting to 
> > happen) and then launch a "repair" sync action... that generally drops the 
> > Current_Pending_Sector back to zero.  either through a realloc or just 
> > simply rewriting the block.  if it's a realloc then i consider if there's 
> > enough of them to warrant replacing the disk...
> > 
> > so for me the chances of a read error while doing the raid1 thing aren't 
> > as high as they could be...
> 
> So the whole thing goes this way:
>   0) do a SMART selftest ;)
>   1) do repair for the whole array
>   2) copy data from failing to new drive
> (using temporary superblock-less array)
>   2a) if step 2 failed still, probably due to new bad sectors,
>   go the "old way", removing the failing drive and adding
>   new one.
> 
> That's 2x or 3x (or 4x counting the selftest, but that should be
> done regardless) more work than just going the "old way" from the
> beginning, but still some chances to have it completed flawlessly
> in 2 steps, without losing redundancy.

well it's more "work" but i don't actually manually launch the SMART 
tests, smartd does that.  i just notice when i get mail indicating 
Current_Pending_Sectors has gone up.

but i'm starting to lean towards SMART short tests (in case they test 
something i can't test with a full surface read) and regular crontabbed 
rate-limited repair or check actions.


> 2)  The same, but not offlining the array.  Hot-remove a drive, make copy
>of it to new drive, flip necessary bitmap bits, and re-add the new drive,
>and let raid code to resync changed (during copy, while the array was
>still active, something might has changed) and missing blocks.
> 
> This variant still loses redundancy, but not much of it, provided the bitmap
> code works correctly.


i like this method.  it yields the minimal disk copy time because there's
no competition with the live traffic... and you can recover if another
disk has errors while you're doing the copy.


> 3)  The same as your way, with the difference that we tell md to *skip* and
>   ignore possible errors during resync (which is also not possible currently).

maybe we could hand it a bitmap to record the errors in... so we could
merge it with the raid5 bitmap later.

still not really the best solution though, is it?

we really want a solution similar to raid10...

-dean
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: proactive-raid-disk-replacement

2006-09-08 Thread dean gaudet
On Fri, 8 Sep 2006, Michael Tokarev wrote:

> Recently Dean Gaudet, in thread titled 'Feature
> Request/Suggestion - "Drive Linking"', mentioned his
> document, http://arctic.org/~dean/proactive-raid5-disk-replacement.txt
> 
> I've read it, and have some umm.. concerns.  Here's why:
> 
> 
> > mdadm -Gb internal --bitmap-chunk=1024 /dev/md4
> > mdadm /dev/md4 -r /dev/sdh1
> > mdadm /dev/md4 -f /dev/sde1 -r /dev/sde1
> > mdadm --build /dev/md5 -ayes --level=1 --raid-devices=2 /dev/sde1 missing
> > mdadm /dev/md4 --re-add /dev/md5
> > mdadm /dev/md5 -a /dev/sdh1
> >
> > ... wait a few hours for md5 resync...
> 
> And here's the problem.  While new disk, sdh1, are resynced from
> old, probably failing disk sde1, chances are high that there will
> be an unreadable block on sde1.  And this means the whole thing
> will not work -- md5 initially contained one working drive (sde1)
> and one spare (sdh1) which is being converted (resynced) to working
> disk.  But after read error on sde1, md5 will contain one failed
> drive and one spare -- for raid1 it's fatal combination.
> 
> While at the same time, it's perfectly easy to reconstruct this
> failing block from other component devices of md4.

this statement is an argument for native support for this type of activity 
in md itself.

> That to say: this way of replacing disk in a software raid array
> isn't much better than just removing old drive and adding new one.

hmm... i'm not sure i agree.  in your proposal you're guaranteed to have 
no redundancy while you wait for the new disk to sync in the raid5.

in my proposal the probability that you'll retain redundancy through the 
entire process is non-zero.  we can debate how non-zero it is, but 
non-zero is greater than zero.

i'll admit it depends a heck of a lot on how long you wait to replace your 
disks, but i prefer to replace mine well before they get to the point 
where just reading the entire disk is guaranteed to result in problems.


> And if the drive you're replacing is failing (according to SMART
> for example), this method is more likely to fail.

my practice is to run regular SMART long self tests, which tend to find 
Current_Pending_Sectors (which are generally read errors waiting to 
happen) and then launch a "repair" sync action... that generally drops the 
Current_Pending_Sector back to zero.  either through a realloc or just 
simply rewriting the block.  if it's a realloc then i consider if there's 
enough of them to warrant replacing the disk...

so for me the chances of a read error while doing the raid1 thing aren't 
as high as they could be...

but yeah you've convinced me this solution isn't good enough.

-dean
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Feature Request/Suggestion - "Drive Linking"

2006-09-04 Thread dean gaudet
On Mon, 4 Sep 2006, Bill Davidsen wrote:

> But I think most of the logic exists, the hardest part would be deciding what
> to do. The existing code looks as if it could be hooked to do this far more
> easily than writing new. In fact, several suggested recovery schemes involve
> stopping the RAID5, replacing the failing drive with a created RAID1, etc. So
> the method is valid, it would just be nice to have it happen without human
> intervention.

you don't actually have to stop the raid5 if you're using bitmaps... you 
can just remove the disk, create a (superblockless) raid1 and put the 
raid1 back in place.

the whole process could be handled a lot like mdadm handles spare groups 
already... there isn't a lot more kernel support required.

the largest problem is if a power failure occurs before the process 
finishes.  i'm 95% certain that even during a reconstruction, raid1 writes 
go to all copies even if the write is beyond the current sync position[1] 
-- so the raid5 superblock would definitely have been written to the 
partial disk... so that means on a reboot there'll be two disks which look 
like they're both the same (valid) component of the raid5, and one of them 
definitely isn't.

maybe there's some trick to handle this situation -- aside from ensuring 
the array won't come up automatically on reboot until after the process 
has finished.

one way to handle it would be to have an option for raid1 resync which 
suppresses writes which are beyond the resync position... then you could 
zero the new disk superblock to start with, and then start up the resync 
-- then it won't have a valid superblock until the entire disk is copied.

-dean

[1] there's normally a really good reason for raid1 to mirror all writes 
even if they're beyond the resync point... consider the case where you 
have a system crash and have 2 essentially idential mirrors which then 
need a resync... and the source disk dies during the resync.

if all writes have been mirrored then the other disk is already useable 
(in fact it's essentially arbitrary which of the mirrors was used for the 
resync source after the crash -- they're all equally (un)likely to have 
the most current data)... without bitmaps this sort of thing is a common 
scenario and certainly saved my data more than once.
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RAID-5 recovery

2006-09-03 Thread dean gaudet
On Sun, 3 Sep 2006, Clive Messer wrote:

> This leads me to a question. I understand from reading the linux-raid 
> archives 
> that the current behaviour when rebuilding with a single badblock on another 
> disk is for that disk to also be kicked from the array.

that's not quite the current behaviour.  since 2.6.14 or .15 or so md will 
reconstruct bad blocks from other disks and try writing them.  it's only 
when this fails repeatedly that it knocks the disk out of the array.

-dean

-- 
VGER BF report: H 0.347442
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Resize on dirty array?

2006-08-30 Thread dean gaudet
On Sun, 13 Aug 2006, dean gaudet wrote:

> On Fri, 11 Aug 2006, David Rees wrote:
> 
> > On 8/11/06, dean gaudet <[EMAIL PROTECTED]> wrote:
> > > On Fri, 11 Aug 2006, David Rees wrote:
> > > 
> > > > On 8/10/06, dean gaudet <[EMAIL PROTECTED]> wrote:
> > > > > - set up smartd to run long self tests once a month.   (stagger it 
> > > > > every
> > > > >   few days so that your disks aren't doing self-tests at the same 
> > > > > time)
> > > >
> > > > I personally prefer to do a long self-test once a week, a month seems
> > > > like a lot of time for something to go wrong.
> > > 
> > > unfortunately i found some drives (seagate 400 pata) had a rather negative
> > > effect on performance while doing self-test.
> > 
> > Interesting that you noted negative performance, but I typically
> > schedule the tests for off-hours anyway where performance isn't
> > critical.
> > 
> > How much of a performance hit did you notice?
> 
> i never benchmarked it explicitly.  iirc the problem was generally 
> metadata performance... and became less of an issue when i moved the 
> filesystem log off the raid5 onto a raid1.  unfortunately there aren't 
> really any "off hours" for this system.

the problem reappeared... so i can provide some data.  one of the 400GB 
seagates has been stuck at 20% of a SMART long self test for over 2 days 
now, and the self-test itself has been going for about 4.5 days total.

a typical "iostat -x /dev/sd[cdfgh] 30" sample looks like this:

Device: rrqm/s   wrqm/s   r/s   w/s   rsec/s   wsec/s avgrq-sz avgqu-sz 
  await  svctm  %util
sdc  90.94   137.52 14.70 25.76   841.32  1360.3554.43 0.94 
  23.30  10.30  41.68
sdd  93.67   140.52 14.96 22.06   863.98  1354.7559.93 0.91 
  24.50  12.17  45.05
sdf  92.84   136.85 15.36 26.39   857.85  1360.3553.13 0.88 
  21.04  10.59  44.21
sdg  87.74   137.82 14.23 24.86   807.73  1355.5555.35 0.85 
  21.86  11.25  43.99
sdh  87.20   134.56 14.96 28.29   810.13  1356.8850.10 1.90 
  43.72  20.02  86.60

those 5 are in a raid5, so their io should be relatively even... notice 
the await, svctm and %util of sdh compared to the other 4.  sdh is the one 
with the exceptionally slow going SMART long self-test.  i assume it's 
still making progress because the effect is measurable in iostat.

-dean
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Feature Request/Suggestion - "Drive Linking"

2006-08-29 Thread dean gaudet
On Wed, 30 Aug 2006, Neil Bortnak wrote:

> Hi Everybody,
> 
> I had this major recovery last week after a hardware failure monkeyed
> things up pretty badly. About half way though I had a couple of ideas
> and I thought I'd suggest/ask them.
> 
> 1) "Drive Linking": So let's say I have a 6 disk RAID5 array and I have
> reason to believe one of the drives will fail (funny noises, SMART
> warnings or it's *really* slow compared to the other drives, etc). It
> would be nice to put in a new drive, link it to the failing disk so that
> it copies all of the data to the new one and mirrors new writes as they
> happen.

http://arctic.org/~dean/proactive-raid5-disk-replacement.txt

works for any raid level actually.


> 2) This sort of brings up a subject I'm getting increasingly paranoid
> about. It seems to me that if disk 1 develops a unrecoverable error at
> block 500 and disk 4 develops one at 55,000 I'm going to get a double
> disk failure as soon as one of the bad blocks is read (or some other
> system problem ->makes it look like<- some random block is
> unrecoverable). Such an error should not bring the whole thing to a
> crashing halt. I know I can recover from that sort of error manually,
> but yuk.

Neil made some improvements in this area as of 2.6.15... when md gets a 
read error it won't knock the entire drive out immediately -- it first 
attempts to reconstruct the sectors from the other drives and write them 
back.  this covers a lot of the failure cases because the drive will 
either successfully complete the write in-place, or use its reallocation 
pool.  the kernel logs when it makes such a correction (but the log wasn't 
very informative until 2.6.18ish i think).

if you watch SMART data (either through smartd logging changes for you, or 
if you diff the output regularly) you can see this activity happen as 
well.

you can also use the check/repair sync_actions to force this to happen 
when you know a disk has a Current_Pending_Sector (i.e. pending read 
error).

-dean
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Is mdadm --create safe for existing arrays ?

2006-08-16 Thread dean gaudet
On Wed, 16 Aug 2006, Peter Greis wrote:

> So, how do I change / and /boot to make the super
> blocks persistent ? Is it safe to run "mdadm --create
> /dev/md0 --raid-devices=2 --level=1 /dev/sda1
> /dev/sdb1" without loosing any data ?

boot a rescue disk

shrink the filesystems by a few MB to accomodate the superblock

mdadm --create /dev/md0 --raid-devices=2 --level=1 /dev/sda1 missing
mdadm /dev/md0 -a /dev/sdb1

grow the filesystem

you could probably get away with an --assume-clean and no resync if you 
know the array is clean... just don't forget to shrink/grow the 
filesystem.

-dean
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Resize on dirty array?

2006-08-13 Thread dean gaudet
On Fri, 11 Aug 2006, David Rees wrote:

> On 8/11/06, dean gaudet <[EMAIL PROTECTED]> wrote:
> > On Fri, 11 Aug 2006, David Rees wrote:
> > 
> > > On 8/10/06, dean gaudet <[EMAIL PROTECTED]> wrote:
> > > > - set up smartd to run long self tests once a month.   (stagger it every
> > > >   few days so that your disks aren't doing self-tests at the same time)
> > >
> > > I personally prefer to do a long self-test once a week, a month seems
> > > like a lot of time for something to go wrong.
> > 
> > unfortunately i found some drives (seagate 400 pata) had a rather negative
> > effect on performance while doing self-test.
> 
> Interesting that you noted negative performance, but I typically
> schedule the tests for off-hours anyway where performance isn't
> critical.
> 
> How much of a performance hit did you notice?

i never benchmarked it explicitly.  iirc the problem was generally 
metadata performance... and became less of an issue when i moved the 
filesystem log off the raid5 onto a raid1.  unfortunately there aren't 
really any "off hours" for this system.

-dean
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Resize on dirty array?

2006-08-10 Thread dean gaudet
suggestions:

- set up smartd to run long self tests once a month.   (stagger it every 
  few days so that your disks aren't doing self-tests at the same time)

- run 2.6.15 or later so md supports repairing read errors from the other 
  drives...

- run 2.6.16 or later so you get the check and repair sync_actions in
  /sys/block/mdX/md/sync_action (i think 2.6.16.x still has a bug where
  you have to echo a random word other than repair to sync_action to get
  a repair to start... wrong sense on a strcmp, fixed in 2.6.17).

- run nightly diffs of smartctl -a output on all your drives so you see 
  when one of them reports problems in the smart self test or otherwise
  has a Current_Pending_Sectors or Realloc event... then launch a
  repair sync_action.

- proactively replace your disks every couple years (i prefer to replace 
  busy disks before 3 years).

-dean

On Wed, 9 Aug 2006, James Peverill wrote:

> 
> In this case the raid WAS the backup... however it seems it turned out to be
> less reliable than the single disks it was supporting.  In the future I think
> I'll make sure my disks have varying ages so they don't fail all at once.
> 
> James
> 
> > > RAID is no excuse for backups.
> PS: 
> -
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to [EMAIL PROTECTED]
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Converting Ext3 to Ext3 under RAID 1

2006-08-02 Thread dean gaudet
On Wed, 2 Aug 2006, Dan Graham wrote:

> Hello;
>   I have an existing, active ext3 filesystem which I would like to convert to
> a RAID 1 ext3 filesystem with minimal down time.  After casting about the web
> and experimenting some on a test system, I believe that I can accomplish this
> in the following manner.
> 
>   - Dismount the filesystem.
>   - Shrink the filesystem to leave room for the RAID superblock at the end
> while leaving the partition size untouched (shrinking by 16 blocks seems
> to
> work )
>   - Create a degraded array with only the partition carrying the shrunk ext3
> system
>   - start the array and mount the array.
>   - hot add the mirroring partitions.
> 
> The questions I have for those who know Linux-Raid better than I.
> 
>Is this scheme even half-way sane?

yes

>Is 16 blocks a large enough area?

i always err on the side of caution and take a few meg off then resize it 
back up to full size after creating the degraded raid1.  (hmm maybe mdadm 
has some way to tell you how large the resulting partitions would be... 
i've never looked.)

you pretty much have to do this all using a recovery or live CD...

don't forget to rebuild your initrds... all of them including older 
kernels... otherwise there could be one of them still mounting the 
filesystem without using the md device name (and destroying integrity).

don't forget to set the second disk boot partition active and install grub 
so that you can boot from it when the first fails... (after you've 
mirrored the boot or root partition).

-dean
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Still can't get md arrays that were started from an initrd to shutdown

2006-07-17 Thread dean gaudet
On Mon, 17 Jul 2006, Christian Pernegger wrote:

> The problem seems to affect only arrays that are started via an
> initrd, even if they do not have the root filesystem on them.
> That's all arrays if they're either managed by EVMS or the
> ramdisk-creator is initramfs-tools. For yaird-generated initrds only
> the array with root on it is affected.

with lvm you have to stop lvm before you can stop the arrays... i wouldn't 
be surprised if evms has the same issue... of course this *should* happen 
cleanly on shutdown assuming evms is also being shutdown... but maybe that 
gives you something to look for.

-dean
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: proactive raid5 disk replacement success (using bitmap + raid1)

2006-06-22 Thread dean gaudet
well that part is optional... i wasn't replacing the disk right away 
anyhow -- it had just exhibited its first surface error during SMART and i 
thought i'd try moving the data elsewhere just for the experience of it.

-dean

On Thu, 22 Jun 2006, Ming Zhang wrote:

> Hi Dean
> 
> Thanks a lot for sharing this.
> 
> I am not quite understand about these 2 commands. Why we want to add a
> pre-failing disk back to md4?
> 
> mdadm --zero-superblock /dev/sde1
> mdadm /dev/md4 -a /dev/sde1
> 
> Ming
> 
> 
> On Sun, 2006-04-23 at 18:40 -0700, dean gaudet wrote:
> > i had a disk in a raid5 which i wanted to clone onto the hot spare... 
> > without going offline and without long periods without redundancy.  a few 
> > folks have discussed using bitmaps and temporary (superblockless) raid1 
> > mappings to do this... i'm not sure anyone has tried / reported success 
> > though.  this is my success report.
> > 
> > setup info:
> > 
> > - kernel version 2.6.16.9 (as packaged by debian)
> > - mdadm version 2.4.1
> > - /dev/md4 is the raid5
> > - /dev/sde1 is the disk in md4 i want to clone from
> > - /dev/sdh1 is the hot spare from md4, and is the clone target
> > - /dev/md5 is an unused md device name
> > 
> > here are the exact commands i issued:
> > 
> > mdadm -Gb internal --bitmap-chunk=1024 /dev/md4
> > mdadm /dev/md4 -r /dev/sdh1
> > mdadm /dev/md4 -f /dev/sde1 -r /dev/sde1
> > mdadm --build /dev/md5 -ayes --level=1 --raid-devices=2 /dev/sde1 missing
> > mdadm /dev/md4 --re-add /dev/md5
> > mdadm /dev/md5 -a /dev/sdh1
> > 
> > ... wait a few hours for md5 resync...
> > 
> > mdadm /dev/md4 -f /dev/md5 -r /dev/md5
> > mdadm --stop /dev/md5
> > mdadm /dev/md4 --re-add /dev/sdh1
> > mdadm --zero-superblock /dev/sde1
> > mdadm /dev/md4 -a /dev/sde1
> > 
> > this sort of thing shouldn't be hard to script :)
> > 
> > the only times i was without full redundancy was briefly between the "-r" 
> > and "--re-add" commands... and with bitmap support the raid5 resync for 
> > each of those --re-adds was essentially zero.
> > 
> > thanks Neil (and others)!
> > 
> > -dean
> > 
> > p.s. it's absolutely necessary to use "--build" for the temporary raid1 
> > ... if you use --create mdadm will rightfully tell you it's already a raid 
> > component and if you --force it then you'll trash the raid5 superblock and 
> > it won't fit into the raid5 any more...
> > -
> > To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> > the body of a message to [EMAIL PROTECTED]
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: raid5 hang on get_active_stripe

2006-06-13 Thread dean gaudet
On Tue, 13 Jun 2006, Bill Davidsen wrote:

> Neil Brown wrote:
> 
> > On Friday June 2, [EMAIL PROTECTED] wrote:
> >  
> > > On Thu, 1 Jun 2006, Neil Brown wrote:
> > > 
> > >
> > > > I've got one more long-shot I would like to try first.  If you could
> > > > backout that change to ll_rw_block, and apply this patch instead.
> > > > Then when it hangs, just cat the stripe_cache_active file and see if
> > > > that unplugs things or not (cat it a few times).
> > > >  
> > > nope that didn't unstick it... i had to raise stripe_cache_size (from 256
> > > to 768... 512 wasn't enough)...
> > > 
> > > -dean
> > >
> > 
> > Ok, thanks.
> > I still don't know what is really going on, but I'm 99.9863% sure this
> > will fix it, and is a reasonable thing to do.
> > (Yes, I lose a ';'.  That is deliberate).
> > 
> > Please let me know what this proves, and thanks again for your
> > patience.
> > 
> > NeilBrown
> > 
> [...snip...]
> 
> Will that fix be in 2.6.17?

also -- is it appropriate enough for 2.6.16.x?

-dean
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: raid5 hang on get_active_stripe

2006-06-10 Thread dean gaudet
On Fri, 2 Jun 2006, Neil Brown wrote:

> On Friday June 2, [EMAIL PROTECTED] wrote:
> > On Thu, 1 Jun 2006, Neil Brown wrote:
> > 
> > > I've got one more long-shot I would like to try first.  If you could
> > > backout that change to ll_rw_block, and apply this patch instead.
> > > Then when it hangs, just cat the stripe_cache_active file and see if
> > > that unplugs things or not (cat it a few times).
> > 
> > nope that didn't unstick it... i had to raise stripe_cache_size (from 256 
> > to 768... 512 wasn't enough)...
> > 
> > -dean
> 
> Ok, thanks.
> I still don't know what is really going on, but I'm 99.9863% sure this
> will fix it, and is a reasonable thing to do.
> (Yes, I lose a ';'.  That is deliberate).

it's been running for a week now... and the freeze hasn't occured... it's 
possible the circumstances for reproducing it haven't occured again 
either, but i haven't really changed my disk usage behaviour so it's 
probably fixed.

let me know if you come up with some other solution you'd like tested.

thanks
-dean


> 
> Please let me know what this proves, and thanks again for your
> patience.
> 
> NeilBrown
> 
> 
> Signed-off-by: Neil Brown <[EMAIL PROTECTED]>
> 
> ### Diffstat output
>  ./drivers/md/raid5.c |5 -
>  1 file changed, 4 insertions(+), 1 deletion(-)
> 
> diff ./drivers/md/raid5.c~current~ ./drivers/md/raid5.c
> --- ./drivers/md/raid5.c~current~ 2006-05-28 21:56:56.0 +1000
> +++ ./drivers/md/raid5.c  2006-06-02 17:24:07.0 +1000
> @@ -285,7 +285,7 @@ static struct stripe_head *get_active_st
>< (conf->max_nr_stripes 
> *3/4)
>|| 
> !conf->inactive_blocked),
>   conf->device_lock,
> - unplug_slaves(conf->mddev);
> + 
> raid5_unplug_device(conf->mddev->queue)
>   );
>   conf->inactive_blocked = 0;
>   } else
> -
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to [EMAIL PROTECTED]
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: raid5 hang on get_active_stripe

2006-06-02 Thread dean gaudet
On Thu, 1 Jun 2006, Neil Brown wrote:

> I've got one more long-shot I would like to try first.  If you could
> backout that change to ll_rw_block, and apply this patch instead.
> Then when it hangs, just cat the stripe_cache_active file and see if
> that unplugs things or not (cat it a few times).

nope that didn't unstick it... i had to raise stripe_cache_size (from 256 
to 768... 512 wasn't enough)...

-dean
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: raid5 hang on get_active_stripe

2006-05-30 Thread dean gaudet
On Wed, 31 May 2006, Neil Brown wrote:

> On Tuesday May 30, [EMAIL PROTECTED] wrote:
> > 
> > actually i think the rate is higher... i'm not sure why, but klogd doesn't 
> > seem to keep up with it:
> > 
> > [EMAIL PROTECTED]:~# grep -c kblockd_schedule_work /var/log/messages
> > 31
> > [EMAIL PROTECTED]:~# dmesg | grep -c kblockd_schedule_work
> > 8192
> 
> # grep 'last message repeated' /var/log/messages
> ??

um hi, of course :)  the paste below is approximately correct.

-dean

[EMAIL PROTECTED]:~# egrep 'kblockd_schedule_work|last message repeated' 
/var/log/messages
May 30 17:05:09 localhost kernel: kblockd_schedule_work failed
May 30 17:05:59 localhost kernel: kblockd_schedule_work failed
May 30 17:08:16 localhost kernel: kblockd_schedule_work failed
May 30 17:10:51 localhost kernel: kblockd_schedule_work failed
May 30 17:11:51 localhost kernel: kblockd_schedule_work failed
May 30 17:12:46 localhost kernel: kblockd_schedule_work failed
May 30 17:12:56 localhost last message repeated 22 times
May 30 17:14:14 localhost kernel: kblockd_schedule_work failed
May 30 17:16:57 localhost kernel: kblockd_schedule_work failed
May 30 17:17:00 localhost last message repeated 83 times
May 30 17:17:02 localhost kernel: kblockd_schedule_work failed
May 30 17:17:33 localhost last message repeated 950 times
May 30 17:18:34 localhost last message repeated 2218 times
May 30 17:19:35 localhost last message repeated 1581 times
May 30 17:20:01 localhost last message repeated 579 times
May 30 17:20:02 localhost kernel: kblockd_schedule_work failed
May 30 17:20:02 localhost kernel: kblockd_schedule_work failed
May 30 17:20:02 localhost kernel: kblockd_schedule_work failed
May 30 17:20:02 localhost last message repeated 23 times
May 30 17:20:03 localhost kernel: kblockd_schedule_work failed
May 30 17:20:34 localhost last message repeated 1058 times
May 30 17:21:35 localhost last message repeated 2171 times
May 30 17:22:36 localhost last message repeated 2305 times
May 30 17:23:37 localhost last message repeated 2311 times
May 30 17:24:38 localhost last message repeated 1993 times
May 30 17:25:01 localhost last message repeated 702 times
May 30 17:25:02 localhost kernel: kblockd_schedule_work failed
May 30 17:25:02 localhost last message repeated 15 times
May 30 17:25:02 localhost kernel: kblockd_schedule_work failed
May 30 17:25:02 localhost last message repeated 12 times
May 30 17:25:03 localhost kernel: kblockd_schedule_work failed
May 30 17:25:34 localhost last message repeated 1061 times
May 30 17:26:35 localhost last message repeated 2009 times
May 30 17:27:36 localhost last message repeated 1941 times
May 30 17:28:37 localhost last message repeated 2345 times
May 30 17:29:38 localhost last message repeated 2367 times
May 30 17:30:01 localhost last message repeated 870 times
May 30 17:30:01 localhost kernel: kblockd_schedule_work failed
May 30 17:30:01 localhost last message repeated 45 times
May 30 17:30:02 localhost kernel: kblockd_schedule_work failed
May 30 17:30:33 localhost last message repeated 1180 times
May 30 17:31:34 localhost last message repeated 2062 times
May 30 17:32:34 localhost last message repeated 2277 times
May 30 17:32:36 localhost kernel: kblockd_schedule_work failed
May 30 17:33:07 localhost last message repeated 1114 times
May 30 17:34:08 localhost last message repeated 2308 times
May 30 17:35:01 localhost last message repeated 1941 times
May 30 17:35:01 localhost kernel: kblockd_schedule_work failed
May 30 17:35:02 localhost last message repeated 20 times
May 30 17:35:02 localhost kernel: kblockd_schedule_work failed
May 30 17:35:33 localhost last message repeated 1051 times
May 30 17:36:34 localhost last message repeated 2002 times
May 30 17:37:35 localhost last message repeated 1644 times
May 30 17:38:36 localhost last message repeated 1731 times
May 30 17:39:37 localhost last message repeated 1844 times
May 30 17:40:01 localhost last message repeated 817 times
May 30 17:40:02 localhost kernel: kblockd_schedule_work failed
May 30 17:40:02 localhost last message repeated 39 times
May 30 17:40:02 localhost kernel: kblockd_schedule_work failed
May 30 17:40:02 localhost last message repeated 12 times
May 30 17:40:03 localhost kernel: kblockd_schedule_work failed
May 30 17:40:34 localhost last message repeated 1051 times
May 30 17:41:35 localhost last message repeated 1576 times
May 30 17:42:36 localhost last message repeated 2000 times
May 30 17:43:37 localhost last message repeated 2058 times
May 30 17:44:15 localhost last message repeated 1337 times
May 30 17:44:15 localhost kernel: kblockd_schedule_work failed
May 30 17:44:46 localhost last message repeated 1016 times
May 30 17:45:01 localhost last message repeated 432 times
May 30 17:45:02 localhost kernel: kblockd_schedule_work failed
May 30 17:45:02 localhost kernel: kblockd_schedule_work failed
May 30 17:45:33 localhost last message repeated 1229 times
May 30 17:46:34 localhost last message repeated 2552 times
May 30 17:47:36 localhost la

Re: raid5 hang on get_active_stripe

2006-05-30 Thread dean gaudet
On Wed, 31 May 2006, Neil Brown wrote:

> On Tuesday May 30, [EMAIL PROTECTED] wrote:
> > On Tue, 30 May 2006, Neil Brown wrote:
> > 
> > > Could you try this patch please?  On top of the rest.
> > > And if it doesn't fail in a couple of days, tell me how regularly the
> > > message 
> > >kblockd_schedule_work failed
> > > gets printed.
> > 
> > i'm running this patch now ... and just after reboot, no freeze yet, i've 
> > already seen a handful of these:
> > 
> > May 30 17:05:09 localhost kernel: kblockd_schedule_work failed
> > May 30 17:05:59 localhost kernel: kblockd_schedule_work failed
> > May 30 17:08:16 localhost kernel: kblockd_schedule_work failed
> > May 30 17:10:51 localhost kernel: kblockd_schedule_work failed
> > May 30 17:11:51 localhost kernel: kblockd_schedule_work failed
> > May 30 17:12:46 localhost kernel: kblockd_schedule_work failed
> > May 30 17:14:14 localhost kernel: kblockd_schedule_work failed
> 
> 1 every minute or so.  That's probably more than I would have
> expected, but strongly lends evidence to the theory that this is the
> problem.

actually i think the rate is higher... i'm not sure why, but klogd doesn't 
seem to keep up with it:

[EMAIL PROTECTED]:~# grep -c kblockd_schedule_work /var/log/messages
31
[EMAIL PROTECTED]:~# dmesg | grep -c kblockd_schedule_work
8192

i don't have CONFIG_PRINTK_TIME=y ... so i can't read timestamps from 
dmesg.

but cool!  if the dmesg spam seems to be a problem i can just comment it 
out of the patch...

i'll let you know if it freezes again.

thanks
-dean
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: raid5 hang on get_active_stripe

2006-05-30 Thread dean gaudet
On Tue, 30 May 2006, Neil Brown wrote:

> Could you try this patch please?  On top of the rest.
> And if it doesn't fail in a couple of days, tell me how regularly the
> message 
>kblockd_schedule_work failed
> gets printed.

i'm running this patch now ... and just after reboot, no freeze yet, i've 
already seen a handful of these:

May 30 17:05:09 localhost kernel: kblockd_schedule_work failed
May 30 17:05:59 localhost kernel: kblockd_schedule_work failed
May 30 17:08:16 localhost kernel: kblockd_schedule_work failed
May 30 17:10:51 localhost kernel: kblockd_schedule_work failed
May 30 17:11:51 localhost kernel: kblockd_schedule_work failed
May 30 17:12:46 localhost kernel: kblockd_schedule_work failed
May 30 17:14:14 localhost kernel: kblockd_schedule_work failed

-dean

> 
> Thanks,
> NeilBrown
> 
> 
> Signed-off-by: Neil Brown <[EMAIL PROTECTED]>
> 
> ### Diffstat output
>  ./block/ll_rw_blk.c |6 +-
>  1 file changed, 5 insertions(+), 1 deletion(-)
> 
> diff ./block/ll_rw_blk.c~current~ ./block/ll_rw_blk.c
> --- ./block/ll_rw_blk.c~current~  2006-05-30 09:48:02.0 +1000
> +++ ./block/ll_rw_blk.c   2006-05-30 09:48:48.0 +1000
> @@ -1636,7 +1636,11 @@ static void blk_unplug_timeout(unsigned 
>  {
>   request_queue_t *q = (request_queue_t *)data;
>  
> - kblockd_schedule_work(&q->unplug_work);
> + if (!kblockd_schedule_work(&q->unplug_work)) {
> + /* failed to schedule the work, try again later */
> + printk("kblockd_schedule_work failed\n");
> + mod_timer(&q->unplug_timer, jiffies + q->unplug_delay);
> + }
>  }
>  
>  /**
> -
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to [EMAIL PROTECTED]
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: raid5 hang on get_active_stripe

2006-05-29 Thread dean gaudet
On Sun, 28 May 2006, Neil Brown wrote:

> The following patch adds some more tracing to raid5, and might fix a
> subtle bug in ll_rw_blk, though it is an incredible long shot that
> this could be affecting raid5 (if it is, I'll have to assume there is
> another bug somewhere).   It certainly doesn't break ll_rw_blk.
> Whether it actually fixes something I'm not sure.
> 
> If you could try with these on top of the previous patches I'd really
> appreciate it.
> 
> When you read from /stripe_cache_active, it should trigger a
> (cryptic) kernel message within the next 15 seconds.  If I could get
> the contents of that file and the kernel messages, that should help.

got the hang again... attached is the dmesg with the cryptic messages.  i 
didn't think to grab the task dump this time though.

hope there's a clue in this one :)  but send me another patch if you need 
more data.

-dean

neemlark:/sys/block/md4/md# cat stripe_cache_size 
256
neemlark:/sys/block/md4/md# cat stripe_cache_active 
251
0 preread
plugged
bitlist=0 delaylist=251
neemlark:/sys/block/md4/md# cat stripe_cache_active 
251
0 preread
plugged
bitlist=0 delaylist=251
neemlark:/sys/block/md4/md# echo 512 >stripe_cache_size 
neemlark:/sys/block/md4/md# cat stripe_cache_active
512
292 preread
not plugged
bitlist=0 delaylist=32
neemlark:/sys/block/md4/md# cat stripe_cache_active
512
292 preread
not plugged
bitlist=0 delaylist=32
neemlark:/sys/block/md4/md# cat stripe_cache_active
445
0 preread
not plugged
bitlist=0 delaylist=73
neemlark:/sys/block/md4/md# cat stripe_cache_active
480
0 preread
not plugged
bitlist=0 delaylist=0
neemlark:/sys/block/md4/md# cat stripe_cache_active
413
0 preread
not plugged
bitlist=0 delaylist=0
neemlark:/sys/block/md4/md# cat stripe_cache_active
13
0 preread
not plugged
bitlist=0 delaylist=0
neemlark:/sys/block/md4/md# cat stripe_cache_active
493
0 preread
not plugged
bitlist=0 delaylist=0
neemlark:/sys/block/md4/md# cat stripe_cache_active
487
0 preread
not plugged
bitlist=0 delaylist=0
neemlark:/sys/block/md4/md# cat stripe_cache_active
405
0 preread
not plugged
bitlist=0 delaylist=0
neemlark:/sys/block/md4/md# cat stripe_cache_active
512
1 preread
not plugged
bitlist=0 delaylist=28
neemlark:/sys/block/md4/md# cat stripe_cache_active
512
84 preread
not plugged
bitlist=0 delaylist=69
neemlark:/sys/block/md4/md# cat stripe_cache_active
512
69 preread
not plugged
bitlist=0 delaylist=56
neemlark:/sys/block/md4/md# cat stripe_cache_active
512
41 preread
not plugged
bitlist=0 delaylist=38
neemlark:/sys/block/md4/md# cat stripe_cache_active
512
10 preread
not plugged
bitlist=0 delaylist=0
neemlark:/sys/block/md4/md# cat stripe_cache_active
453
3 preread
not plugged
bitlist=0 delaylist=0
neemlark:/sys/block/md4/md# cat stripe_cache_active
480
0 preread
not plugged
bitlist=0 delaylist=0
neemlark:/sys/block/md4/md# cat stripe_cache_active
512
0 preread
not plugged
bitlist=0 delaylist=0
neemlark:/sys/block/md4/md# cat stripe_cache_active
512
14 preread
not plugged
bitlist=0 delaylist=0
neemlark:/sys/block/md4/md# cat stripe_cache_active
477
0 preread
not plugged
bitlist=0 delaylist=0
neemlark:/sys/block/md4/md# cat stripe_cache_active
476
0 preread
not plugged
bitlist=0 delaylist=0
neemlark:/sys/block/md4/md# cat stripe_cache_active
486
0 preread
not plugged
bitlist=0 delaylist=0
neemlark:/sys/block/md4/md# cat stripe_cache_active
480
0 preread
not plugged
bitlist=0 delaylist=0
neemlark:/sys/block/md4/md# cat stripe_cache_active
384
0 preread
not plugged
bitlist=0 delaylist=0
neemlark:/sys/block/md4/md# cat stripe_cache_active
512
0 preread
not plugged
bitlist=0 delaylist=0
neemlark:/sys/block/md4/md# cat stripe_cache_active
387
0 preread
not plugged
bitlist=0 delaylist=0
neemlark:/sys/block/md4/md# cat stripe_cache_active
462
0 preread
not plugged
bitlist=0 delaylist=0
neemlark:/sys/block/md4/md# cat stripe_cache_active
480
0 preread
not plugged
bitlist=0 delaylist=0
neemlark:/sys/block/md4/md# cat stripe_cache_active
448
0 preread
not plugged
bitlist=0 delaylist=0
neemlark:/sys/block/md4/md# cat stripe_cache_active
512
0 preread
not plugged
bitlist=0 delaylist=0
neemlark:/sys/block/md4/md# cat stripe_cache_active
501
0 preread
not plugged
bitlist=0 delaylist=0
neemlark:/sys/block/md4/md# cat stripe_cache_active
476
0 preread
not plugged
bitlist=0 delaylist=0
neemlark:/sys/block/md4/md# cat stripe_cache_active
512
0 preread
not plugged
bitlist=0 delaylist=0
neemlark:/sys/block/md4/md# cat stripe_cache_active
416
0 preread
not plugged
bitlist=0 delaylist=0
neemlark:/sys/block/md4/md# cat stripe_cache_active
386
0 preread
not plugged
bitlist=0 delaylist=0
neemlark:/sys/block/md4/md# cat stripe_cache_active
512
0 preread
not plugged
bitlist=0 delaylist=0
neemlark:/sys/block/md4/md# cat stripe_cache_active
434
0 preread
not plugged
bitlist=0 delaylist=0
neemlark:/sys/block/md4/md# cat stripe_cache_active
406
0 preread
not plugged
bitlist=0 delaylist=0
neemlark:/sys/block/md4/md# cat stripe_cache_active
447
0 preread
not plugge

Re: [PATCH] mdadm 2.5 (Was: ANNOUNCE: mdadm 2.5 - A tool for managing Soft RAID under Linux)

2006-05-28 Thread dean gaudet
On Sun, 28 May 2006, Luca Berra wrote:

> dietlibc rand() and random() are the same function.
> but random will throw a warning saying it is deprecated.

that's terribly obnoxious... it's never going to be deprecated, there are 
only approximately a bazillion programs using random().

-dean
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] mdadm 2.5 (Was: ANNOUNCE: mdadm 2.5 - A tool for managing Soft RAID under Linux)

2006-05-28 Thread dean gaudet
On Sun, 28 May 2006, Luca Berra wrote:

> - mdadm-2.5-rand.patch
> Posix dictates rand() versus bsd random() function, and dietlibc
> deprecated random(), so switch to srand()/rand() and make everybody
> happy.

fwiw... lots of rand()s tend to suck... and RAND_MAX may not be large 
enough for this use.  glibc rand() is the same as random().  do you know 
if dietlibc's rand() is good enough?

-dean
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: raid5 hang on get_active_stripe

2006-05-27 Thread dean gaudet
On Sat, 27 May 2006, Neil Brown wrote:

> Thanks.  This narrows it down quite a bit... too much infact:  I can
> now say for sure that this cannot possible happen :-)
> 
>   2/ The message.gz you sent earlier with the
>   echo t > /proc/sysrq-trigger
>  trace in it didn't contain information about md4_raid5 - the 

got another hang again this morning... full dmesg output attached.

-dean

neemlark:/sys/block/md4/md# cat stripe_cache_active
248
0 preread
bitlist=0 delaylist=248
neemlark:/sys/block/md4/md# cat stripe_cache_active
248
0 preread
bitlist=0 delaylist=248
neemlark:/sys/block/md4/md# cat stripe_cache_active
248
0 preread
bitlist=0 delaylist=248
neemlark:/sys/block/md4/md# cat stripe_cache_size
256
neemlark:/sys/block/md4/md# echo 512 >!$
echo 512 >stripe_cache_size
neemlark:/sys/block/md4/md# cat stripe_cache_active
511
254 preread
bitlist=0 delaylist=199
neemlark:/sys/block/md4/md# cat stripe_cache_active
511
148 preread
bitlist=0 delaylist=199
neemlark:/sys/block/md4/md# cat stripe_cache_active
435
95 preread
bitlist=0 delaylist=199
neemlark:/sys/block/md4/md# cat stripe_cache_active
512
11 preread
bitlist=0 delaylist=327
neemlark:/sys/block/md4/md# cat stripe_cache_active
511
11 preread
bitlist=0 delaylist=327
neemlark:/sys/block/md4/md# cat stripe_cache_active
494
359 preread
bitlist=0 delaylist=127
neemlark:/sys/block/md4/md# cat stripe_cache_active
191
67 preread
bitlist=0 delaylist=0
neemlark:/sys/block/md4/md# cat stripe_cache_active
512
272 preread
bitlist=0 delaylist=175
neemlark:/sys/block/md4/md# cat stripe_cache_active
512
32 preread
bitlist=0 delaylist=317
neemlark:/sys/block/md4/md# cat stripe_cache_active
512
398 preread
bitlist=0 delaylist=114
neemlark:/sys/block/md4/md# cat stripe_cache_active
512
398 preread
bitlist=0 delaylist=114
neemlark:/sys/block/md4/md# cat stripe_cache_active
512
398 preread
bitlist=0 delaylist=114
neemlark:/sys/block/md4/md# cat stripe_cache_active
512
17 preread
bitlist=0 delaylist=265
neemlark:/sys/block/md4/md# cat stripe_cache_active
512
17 preread
bitlist=0 delaylist=265
neemlark:/sys/block/md4/md# cat stripe_cache_active
442
124 preread
bitlist=0 delaylist=3
neemlark:/sys/block/md4/md# cat stripe_cache_active
127
0 preread
bitlist=0 delaylist=0
neemlark:/sys/block/md4/md# cat stripe_cache_active
512
154 preread
bitlist=0 delaylist=235
neemlark:/sys/block/md4/md# cat stripe_cache_active
389
321 preread
bitlist=0 delaylist=0
neemlark:/sys/block/md4/md# cat stripe_cache_active
512
162 preread
bitlist=0 delaylist=133
neemlark:/sys/block/md4/md# cat stripe_cache_active
385
24 preread
bitlist=0 delaylist=142
neemlark:/sys/block/md4/md# cat stripe_cache_active
109
3 preread
bitlist=0 delaylist=0
neemlark:/sys/block/md4/md# cat stripe_cache_active
0
0 preread
bitlist=0 delaylist=0
neemlark:/sys/block/md4/md# cat stripe_cache_active
0
0 preread
bitlist=0 delaylist=0

dmesg.gz
Description: Binary data


Re: raid5 hang on get_active_stripe

2006-05-26 Thread dean gaudet
On Sat, 27 May 2006, Neil Brown wrote:

> On Friday May 26, [EMAIL PROTECTED] wrote:
> > On Tue, 23 May 2006, Neil Brown wrote:
> > 
> > i applied them against 2.6.16.18 and two days later i got my first hang... 
> > below is the stripe_cache foo.
> > 
> > thanks
> > -dean
> > 
> > neemlark:~# cd /sys/block/md4/md/
> > neemlark:/sys/block/md4/md# cat stripe_cache_active 
> > 255
> > 0 preread
> > bitlist=0 delaylist=255
> > neemlark:/sys/block/md4/md# cat stripe_cache_active 
> > 255
> > 0 preread
> > bitlist=0 delaylist=255
> > neemlark:/sys/block/md4/md# cat stripe_cache_active 
> > 255
> > 0 preread
> > bitlist=0 delaylist=255
> 
> Thanks.  This narrows it down quite a bit... too much infact:  I can
> now say for sure that this cannot possible happen :-)

heheh.  fwiw the box has traditionally been rock solid.. it's ancient 
though... dual p3 750 w/440bx chipset and pc100 ecc memory... 3ware 7508 
w/seagate 400GB disks... i really don't suspect the hardware all that much 
because the freeze seems to be rather consistent as to time of day 
(overnight while i've got 3x rdiff-backup, plus bittorrent, plus updatedb 
going).  unfortunately it doesn't happen every time... but every time i've 
unstuck the box i've noticed those processes going.

other tidbits... md4 is a lvm2 PV ... there are two LVs, one with ext3
and one with xfs.


> Two things that might be helpful:
>   1/ Do you have any other patches on 2.6.16.18 other than the 3 I
> sent you?  If you do I'd like to see them, just in case.

it was just 2.6.16.18 plus the 3 you sent... i attached the .config
(it's rather full -- based off debian kernel .config).

maybe there's a compiler bug:

gcc version 4.0.4 20060507 (prerelease) (Debian 4.0.3-3)


>   2/ The message.gz you sent earlier with the
>   echo t > /proc/sysrq-trigger
>  trace in it didn't contain information about md4_raid5 - the 
>  controlling thread for that array.  It must have missed out
>  due to a buffer overflowing.  Next time it happens, could you
>  to get this trace again and see if you can find out what
>  what md4_raid5 is going.  Maybe do the 'echo t' several times.
>  I think that you need a kernel recompile to make the dmesg
>  buffer larger.

ok i'll set CONFIG_LOG_BUF_SHIFT=18 and rebuild ...

note that i'm going to include two more patches in this next kernel:

http://lkml.org/lkml/2006/5/23/42
http://arctic.org/~dean/patches/linux-2.6.16.5-no-treason.patch

the first was the Jens Axboe patch you mentioned here recently (for
accounting with i/o barriers)... and the second gets rid of the tcp
treason uncloaked messages.


> Thanks for your patience - this must be very frustrating for you.

fortunately i'm the primary user of this box... and the bug doesn't
corrupt anything... and i can unstick it easily :)  so it's not all that
frustrating actually.

-dean

config.gz
Description: Binary data


Re: raid5 hang on get_active_stripe

2006-05-26 Thread dean gaudet
On Tue, 23 May 2006, Neil Brown wrote:

> I've spent all morning looking at this and while I cannot see what is
> happening I did find a couple of small bugs, so that is good...
> 
> I've attached three patches.  The first fix two small bugs (I think).
> The last adds some extra information to
>   /sys/block/mdX/md/stripe_cache_active
> 
> They are against 2.6.16.11.
> 
> If you could apply them and if the problem recurs, report the content
> of stripe_cache_active several times before and after changing it,
> just like you did last time, that might help throw some light on the
> situation.

i applied them against 2.6.16.18 and two days later i got my first hang... 
below is the stripe_cache foo.

thanks
-dean

neemlark:~# cd /sys/block/md4/md/
neemlark:/sys/block/md4/md# cat stripe_cache_active 
255
0 preread
bitlist=0 delaylist=255
neemlark:/sys/block/md4/md# cat stripe_cache_active 
255
0 preread
bitlist=0 delaylist=255
neemlark:/sys/block/md4/md# cat stripe_cache_active 
255
0 preread
bitlist=0 delaylist=255
neemlark:/sys/block/md4/md# cat stripe_cache_active 
255
0 preread
bitlist=0 delaylist=255
neemlark:/sys/block/md4/md# cat stripe_cache_active 
255
0 preread
bitlist=0 delaylist=255
neemlark:/sys/block/md4/md# cat stripe_cache_size 
256
neemlark:/sys/block/md4/md# echo 512 >stripe_cache_size
neemlark:/sys/block/md4/md# cat stripe_cache_active
474
187 preread
bitlist=0 delaylist=222
neemlark:/sys/block/md4/md# cat stripe_cache_active
438
222 preread
bitlist=0 delaylist=72
neemlark:/sys/block/md4/md# cat stripe_cache_active
438
222 preread
bitlist=0 delaylist=72
neemlark:/sys/block/md4/md# cat stripe_cache_active
469
222 preread
bitlist=0 delaylist=72
neemlark:/sys/block/md4/md# cat stripe_cache_active
512
72 preread
bitlist=160 delaylist=103
neemlark:/sys/block/md4/md# cat stripe_cache_active
1
0 preread
bitlist=0 delaylist=0
neemlark:/sys/block/md4/md# cat stripe_cache_active
2
0 preread
bitlist=0 delaylist=0
neemlark:/sys/block/md4/md# cat stripe_cache_active
0
0 preread
bitlist=0 delaylist=0
neemlark:/sys/block/md4/md# cat stripe_cache_active
2
0 preread
bitlist=0 delaylist=0
neemlark:/sys/block/md4/md# 

md4 : active raid5 sdd1[0] sde1[5](S) sdh1[4] sdg1[3] sdf1[2] sdc1[1]
  1562834944 blocks level 5, 128k chunk, algorithm 2 [5/5] [U]
  bitmap: 10/187 pages [40KB], 1024KB chunk
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: raid5 hang on get_active_stripe

2006-05-17 Thread dean gaudet
On Thu, 11 May 2006, dean gaudet wrote:

> On Tue, 14 Mar 2006, Neil Brown wrote:
> 
> > On Monday March 13, [EMAIL PROTECTED] wrote:
> > > I just experienced some kind of lockup accessing my 8-drive raid5
> > > (2.6.16-rc4-mm2). The system has been up for 16 days running fine, but
> > > now processes that try to read the md device hang. ps tells me they are
> > > all sleeping in get_active_stripe. There is nothing in the syslog, and I
> > > can read from the individual drives fine with dd. mdadm says the state
> > > is "active".
...
> 
> i seem to be running into this as well... it has happenned several times 
> in the past three weeks.  i attached the kernel log output...

it happenned again...  same system as before...


> > You could try increasing the size of the stripe cache
> >   echo 512 > /sys/block/mdX/md/stripe_cache_size
> > (choose and appropriate 'X').
> 
> yeah that got things going again -- it took a minute or so maybe, i
> wasn't paying attention as to how fast things cleared up.

i tried 768 this time and it wasn't enough... 1024 did it again...

> 
> > Maybe check the content of
> >  /sys/block/mdX/md/stripe_cache_active
> > as well.
> 
> next time i'll check this before i increase stripe_cache_size... it's
> 0 now, but the raid5 is working again...

here's a sequence of things i did... not sure if it helps:

# cat /sys/block/md4/md/stripe_cache_active
435
# cat /sys/block/md4/md/stripe_cache_size
512
# echo 768 >/sys/block/md4/md/stripe_cache_size
# cat /sys/block/md4/md/stripe_cache_active
752
# cat /sys/block/md4/md/stripe_cache_active
752
# cat /sys/block/md4/md/stripe_cache_active
752
# cat /sys/block/md4/md/stripe_cache_active
752
# cat /sys/block/md4/md/stripe_cache_active
752
# cat /sys/block/md4/md/stripe_cache_active
752
# cat /sys/block/md4/md/stripe_cache_active
752
# echo 1024 >/sys/block/md4/md/stripe_cache_size
# cat /sys/block/md4/md/stripe_cache_active
927
# cat /sys/block/md4/md/stripe_cache_active
151
# cat /sys/block/md4/md/stripe_cache_active
66
# cat /sys/block/md4/md/stripe_cache_active
2
# cat /sys/block/md4/md/stripe_cache_active
1
# cat /sys/block/md4/md/stripe_cache_active
0
# cat /sys/block/md4/md/stripe_cache_active
3

and it's OK again... except i'm going to lower the stripe_cache_size to
256 again because i'm not sure i want to keep having to double it each
freeze :)

let me know if you want the task dump output from this one too.

-dean
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: raid5 hang on get_active_stripe

2006-05-11 Thread dean gaudet
On Tue, 14 Mar 2006, Neil Brown wrote:

> On Monday March 13, [EMAIL PROTECTED] wrote:
> > Hi all,
> > 
> > I just experienced some kind of lockup accessing my 8-drive raid5
> > (2.6.16-rc4-mm2). The system has been up for 16 days running fine, but
> > now processes that try to read the md device hang. ps tells me they are
> > all sleeping in get_active_stripe. There is nothing in the syslog, and I
> > can read from the individual drives fine with dd. mdadm says the state
> > is "active".
> 
> Hmmm... That's sad. That's going to be very hard to track down.
> 
> If you could
>   echo t > /proc/sysrq-trigger
> 
> and send me the dump that appears in the kernel log, I would
> appreciate it.  I doubt it will be very helpful, but it is the best
> bet I can come up with.

i seem to be running into this as well... it has happenned several times 
in the past three weeks.  i attached the kernel log output...

it's a debian 2.6.16 kernel, which is based mostly on 2.6.16.10.

md4 : active raid5 sdd1[0] sde1[5](S) sdh1[4] sdg1[3] sdf1[2] sdc1[1]
  1562834944 blocks level 5, 128k chunk, algorithm 2 [5/5] [U]
  bitmap: 3/187 pages [12KB], 1024KB chunk

those drives are on 3w- (7508 controller).  i'm using lvm2 and
xfs as the filesystem (although i'm pretty sure an ext3 fs on another lv
is hanging too -- but i forgot to check before i unwedged it).

let me know if anything else is useful and i can try to catch it next
time.


> You could try increasing the size of the stripe cache
>   echo 512 > /sys/block/mdX/md/stripe_cache_size
> (choose and appropriate 'X').

yeah that got things going again -- it took a minute or so maybe, i
wasn't paying attention as to how fast things cleared up.


> Maybe check the content of
>  /sys/block/mdX/md/stripe_cache_active
> as well.

next time i'll check this before i increase stripe_cache_size... it's
0 now, but the raid5 is working again...

-dean

messages.gz
Description: Binary data


proactive raid5 disk replacement success (using bitmap + raid1)

2006-04-23 Thread dean gaudet
i had a disk in a raid5 which i wanted to clone onto the hot spare... 
without going offline and without long periods without redundancy.  a few 
folks have discussed using bitmaps and temporary (superblockless) raid1 
mappings to do this... i'm not sure anyone has tried / reported success 
though.  this is my success report.

setup info:

- kernel version 2.6.16.9 (as packaged by debian)
- mdadm version 2.4.1
- /dev/md4 is the raid5
- /dev/sde1 is the disk in md4 i want to clone from
- /dev/sdh1 is the hot spare from md4, and is the clone target
- /dev/md5 is an unused md device name

here are the exact commands i issued:

mdadm -Gb internal --bitmap-chunk=1024 /dev/md4
mdadm /dev/md4 -r /dev/sdh1
mdadm /dev/md4 -f /dev/sde1 -r /dev/sde1
mdadm --build /dev/md5 -ayes --level=1 --raid-devices=2 /dev/sde1 missing
mdadm /dev/md4 --re-add /dev/md5
mdadm /dev/md5 -a /dev/sdh1

... wait a few hours for md5 resync...

mdadm /dev/md4 -f /dev/md5 -r /dev/md5
mdadm --stop /dev/md5
mdadm /dev/md4 --re-add /dev/sdh1
mdadm --zero-superblock /dev/sde1
mdadm /dev/md4 -a /dev/sde1

this sort of thing shouldn't be hard to script :)

the only times i was without full redundancy was briefly between the "-r" 
and "--re-add" commands... and with bitmap support the raid5 resync for 
each of those --re-adds was essentially zero.

thanks Neil (and others)!

-dean

p.s. it's absolutely necessary to use "--build" for the temporary raid1 
... if you use --create mdadm will rightfully tell you it's already a raid 
component and if you --force it then you'll trash the raid5 superblock and 
it won't fit into the raid5 any more...
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


forcing a read on a known bad block

2006-04-11 Thread dean gaudet
hey Neil...

i've been wanting to test out the reconstruct-on-read-error code... and 
i've had two chances to do so, but haven't be able to force md to read the 
appropriate block to trigger the code.

i had two disks with SMART Current_Pending_Sector > 0 (which indicates 
pending read error) and i did SMART long self-tests to find out where the 
bad block was (it should show the LBA in the SMART error log)...

one disk was in a raid1 -- and so it was kind of random which of the two 
disks would be read from if i tried to seek to that LBA and read... in 
theory with O_DIRECT i should have been able to randomly get the right 
disk, but that seems a bit clunky.  unfortunately i didn't think of the 
O_DIRECT trick until after i'd given up and decided to just resync the 
whole disk proactively.

the other disk was in a raid5 ... 5 disk raid5, so 20% chance of the bad 
block being in parity.  i copied the kernel code to be sure, and sure 
enough the bad block was in parity... just bad luck :)  so i can't force a 
read there any way that i know of...

anyhow this made me wonder if there's some other existing trick to force 
such reads/reconstructions to occur... or perhaps this might be a useful 
future feature.

on the raid5 disk i actually tried reading the LBA directly from the 
component device and it didn't trigger the read error, so now i'm a bit 
skeptical of the SMART log and/or my computation of the seek offset in the 
partition... but the above question is still interesting.

-dean
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: md/mdadm fails to properly run on 2.6.15 after upgrading from 2.6.11

2006-04-11 Thread dean gaudet


On Mon, 10 Apr 2006, Marc L. de Bruin wrote:

> dean gaudet wrote:
> > On Mon, 10 Apr 2006, Marc L. de Bruin wrote:
> > 
> > > However, all "preferred minor"s are correct, meaning that the output is in
> > > sync with what I expected it to be from /etc/mdadm/mdadm.conf.
> > > 
> > > Any other ideas? Just adding /etc/mdadm/mdadm.conf to the initrd does not
> > > seem
> > > to work, since mdrun seems to ignore it?!
> 
> > it seems to me "mdrun /dev" is about the worst thing possible to use in an
> > initrd.
> 
> :-)
> 
> I guess I'll have to change to yaird asap then. I can't think of any other
> solid solution...

yeah i've been yaird... it's not perfect -- take a look at 
<http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=351183> for a patch i 
use to improve the ability of a yaird initrd booting when you've moved 
devices or a device has failed.

-dean
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: md/mdadm fails to properly run on 2.6.15 after upgrading from 2.6.11

2006-04-10 Thread dean gaudet
On Mon, 10 Apr 2006, Marc L. de Bruin wrote:

> However, all "preferred minor"s are correct, meaning that the output is in
> sync with what I expected it to be from /etc/mdadm/mdadm.conf.
> 
> Any other ideas? Just adding /etc/mdadm/mdadm.conf to the initrd does not seem
> to work, since mdrun seems to ignore it?!

yeah it looks like "mdrun /dev" just seems to assign things in the order 
they're discovered without consulting the preferred minor.

it seems to me "mdrun /dev" is about the worst thing possible to use in an 
initrd.

i opened a bug yesterday 
 ... it seems 
really they should stop using mdrun entirely... when i get a chance i'll 
try updating the bug (or go ahead and add your own experiences to it).

oh hey take a look at this bug for debian mdadm package 
 ... he intends 
to deprecate mdrun.

-dean
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: md/mdadm fails to properly run on 2.6.15 after upgrading from 2.6.11

2006-04-10 Thread dean gaudet
On Mon, 10 Apr 2006, Marc L. de Bruin wrote:

> dean gaudet wrote:
> 
> > initramfs-tools generates an "mdrun /dev" which starts all the raids it can
> > find... but does not include the mdadm.conf in the initrd so i'm not sure it
> > will necessarily start them in the right minor devices.  try doing an "mdadm
> > --examine /dev/xxx" on some of your partitions to see if the "preferred
> > minor" is what you expect it to be...
> >  
> [EMAIL PROTECTED]:~# sudo mdadm --examine /dev/md[01234]

try running it on /dev/sda1 or whatever the component devices are for your 
array... not on the array devices.

-dean
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: md/mdadm fails to properly run on 2.6.15 after upgrading from 2.6.11

2006-04-09 Thread dean gaudet
On Sun, 9 Apr 2006, Marc L. de Bruin wrote:

...
> Okay, just pressing Control-D continues the boot process and AFAIK the root
> filesystemen actually isn't corrupt. Running e2fsck returns no errors and
> booting 2.6.11 works just fine, but I have no clue why it picked the wrong
> partitions to build md[01234].
> 
> What could have happened here?

i didn't know sarge had 2.6.11 or 2.6.15 packages... but i'm going to 
assume you've installed one of initramfs-tools or yaird in order to use 
the unstable 2.6.11 or 2.6.15 packages... so my comments might not apply.

initramfs-tools generates an "mdrun /dev" which starts all the raids it 
can find... but does not include the mdadm.conf in the initrd so i'm not 
sure it will necessarily start them in the right minor devices.  try doing 
an "mdadm --examine /dev/xxx" on some of your partitions to see if the 
"preferred minor" is what you expect it to be...

if the preferred minors are wrong there's some mdadm incantation to update 
them... see the man page.

or switch to yaird (you'll have to install yaird and purge 
initramfs-tools) and dpkg-reconfigure your kernel packages to cause the 
initrds to be rebuilt.  yaird starts only the raid required for the root 
filesystem, and specifies the correct minor for it.  then later after the 
initrd /etc/init.d/mdadm-raid will start the rest of your raids using your 
mdadm.conf.

-dean
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: raid5 high cpu usage during reads - oprofile results

2006-04-01 Thread dean gaudet


On Sat, 1 Apr 2006, Alex Izvorski wrote:

> On Sat, 2006-04-01 at 14:28 -0800, dean gaudet wrote:
> > i'm guessing there's a good reason for STRIPE_SIZE being 4KiB -- 'cause 
> > otherwise it'd be cool to run with STRIPE_SIZE the same as your raid 
> > chunksize... which would decrease the number of entries -- much more 
> > desirable than increasing the number of buckets.
> 
> Dean - that is an interesting thought.  I can't think of a reason why
> not, except that it is the same as the page size?  But offhand I don't
> see any reason why that is a particularly good choice either.  Would the
> code work with other sizes?  What about a variable (per array) size?
> How would that interact with small reads?

i don't understand the code well enough...


> Do you happen to know how many find_stripe calls there are for each
> read?  I rather suspect it is several (many) times per sector, since it
> uses up something on the order of several thousand clock cycles per
> *sector* (reading 400k sectors per second produces 80% load of 2x 2.4GHz
> cpus, of which get_active_stripe accounts for ~30% - that's 2.8k clock
> cycles per sector just in that one function). I really don't see any way
> a single hash lookup even in a table with ~30 entries per bucket could
> do anything close to that.

well the lists are all struct stripe_heads... which on i386 seem to be 
0x30 + 0x6c*(devs - 1) bytes each.  that's pretty big.  they're allocated 
in a slab, so they're relatively well packed into pages... but still, 
unless i've messed up somewhere that's 480 bytes for a 5 disk raid5.  so 
that's only 8 per page... so a chain of length 30 touches at least 4 
pages.  if you're hitting all 512 buckets, chains of length 30, then 
you're looking at somewhere on the order of 2048 pages...

that causes a lot of thrashing in the TLBs... and isn't so great on the 
cache either.

it's even worse on x86_64 ... it looks like 0xf8 + 0xb0*(devs - 1) bytes 
per stripe_head ... (i'm pulling these numbers from the call setup for 
kmem_cache_create in the disassembly of raid5.ko from kernels on my 
boxes).

oh btw you might get a small improvement by moving the "sector" field of 
struct stripe_head close to the hash field... right now the sector field 
is at 0x28 (x86_64) and so it's probably on a different cache line from 
the "hash" field at offset 0 about half the time (64 byte cache line).  if 
you move sector to right after the "hash" field it'll more likely be on 
the same cache line...

but still, i think the tlb is the problem.

oh you can probably ask oprofile to tell you if you're seeing cache miss 
or tlb miss stalls there (not sure on the syntax).


> Short of changing STRIPE_SIZE, it should be enough to make sure the
> average bucket occupancy is considerably less than one - as long as the
> occupancy is kept low the the speed of access is independent of the
> number of entries.  256 stripe cache entries and 512 hash buckets works
> well with a 0.5 max occupancy; we should ideally have at least 32k
> buckets (or 64 pages) for 16k entries.  Yeah, ok, it's quite a bit more
> memory than is used now, but considering that the box I'm running this
> on has 4GB, it's not that much ;)


i still don't understand all the code well enough... but if i assume 
there's a good reason for STRIPE_SIZE == PAGE_SIZE then it seems like you 
need to improve the cache locality of the hash chaining... a linked list 
of struct stripe_heads doesn't have very good locality because they're 
such large structures.

one possibility is a linked list of:

struct stripe_hash_entry {
struct hlist_node   hash;
sector_tsector;
struct stripe_head *sh;
};

but that's still 32 bytes on x86_64 ...

you can get it down to 16 bytes by getting rid of chaining and using open 
addressing...

eh ... this still isn't that hot... really there's too much pressure 
because there's a hash table entry per 4KiB of disk i/o...

anyhow i'm only eyeballing code here, i could easily have missed some 
critical details.

-dean
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: raid5 high cpu usage during reads - oprofile results

2006-04-01 Thread dean gaudet
On Sat, 1 Apr 2006, Alex Izvorski wrote:

> Dean - I think I see what you mean, you're looking at this line in the
> assembly?
> 
> 65830 16.8830 : c1f:   cmp%rcx,0x28(%rax)

yup that's the one... that's probably a fair number of cache (or tlb) 
misses going on right there.


> I looked at the hash stuff, I think the problem is not that the hash
> function is poor, but rather that the number of entries in all buckets
> gets to be pretty high.

yeah... your analysis seems more likely.

i suppose increasing the number of buckets is the only option.  it looks 
to me like you'd just need to change NR_HASH and the kzalloc in run() in 
order to increase the number of buckets.

i'm guessing there's a good reason for STRIPE_SIZE being 4KiB -- 'cause 
otherwise it'd be cool to run with STRIPE_SIZE the same as your raid 
chunksize... which would decrease the number of entries -- much more 
desirable than increasing the number of buckets.

-dean
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: raid5 high cpu usage during reads - oprofile results

2006-03-25 Thread dean gaudet
On Sat, 25 Mar 2006, Alex Izvorski wrote:

>  http://linuxraid.pastebin.com/621363 - oprofile annotated assembly

it looks to me like a lot of time is spent in __find_stripe() ... i wonder 
if the hash is working properly.

in raid5.c you could try changing

#define stripe_hash(conf, sect) (&((conf)->stripe_hashtbl[((sect) >> 
STRIPE_SHIFT) & HASH_MASK]))

to

#define stripe_hash(conf, sect) (&((conf)->stripe_hashtbl[(((sect) >> 
STRIPE_SHIFT) ^ ((sect) >> (2*STRIPE_SHIFT))) & HASH_MASK]))

or maybe try using jhash_1word((sect) >> STRIPE_SHIFT, 0) ...

-dean
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: raid5 that used parity for reads only when degraded

2006-03-24 Thread dean gaudet
On Thu, 23 Mar 2006, Alex Izvorski wrote:

> Also the cpu load is measured with Andrew Morton's cyclesoak
> tool which I believe to be quite accurate.

there's something cyclesoak does which i'm not sure i agree with: 
cyclesoak process dirties an array of 100 bytes... so what you're 
really getting is some sort of composite measurement of memory system 
utilisation and cpu cycle availability.

i think that 1MB number was chosen before 1MiB caches were common... and 
what you get during calibration is a L2 cache-hot loop, but i'm not sure 
that's an important number.

i'd look at what happens if you increase cyclesoak.c busyloop_size to 8MB 
... and decrease it to 128.  the two extremes are going to weight the "cpu 
load" towards measuring available memory system bandwidth and available 
cpu cycles.

also for calibration consider using a larger "-p n" ... especially if 
you've got any cpufreq/powernowd setup which is varying your clock 
rates... you want to be sure that it's calibrated (and measured) at a 
fixed clock rate.

-dean
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: naming of md devices

2006-03-22 Thread dean gaudet
On Thu, 23 Mar 2006, Nix wrote:

> Last I heard the Debian initramfs constructs RAID arrays by explicitly
> specifying the devices that make them up. This is, um, a bad idea:
> the first time a disk fails or your kernel renumbers them you're
> in *trouble*.

yaird seems to dtrt ... at least in unstable.  if you install yaird 
instead of initramfs-tools you get stuff like this in the initrd /init:

mknod /dev/md3 b 9 3
mdadm -Ac partitions /dev/md3 --uuid 2b3a5b77:c7b4ab81:a2b8322a:db5c4e88

initramfs-tools also appears to do something which should work... but i 
haven't tested it... it basically runs "mdrun /dev" without specifying a 
minor/uuid for the root, so it'll start all arrays... i'm afraid that 
might mess up for one of my arrays which is "auto=mdp"... and has the 
annoying property of starting arrays on disks you've moved from other 
systems.

so anyhow i lean towards yaird at the moment... (and i should submit some 
bug reports i guess).

the above is on unstable... i don't use stable (and stable definitely does 
the wrong thing -- 
).

-dean
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: how to clone a disk

2006-03-11 Thread dean gaudet
On Sat, 11 Mar 2006, Ming Zhang wrote:

> On Sat, 2006-03-11 at 16:31 -0800, dean gaudet wrote:
> > if you fail the disk from the array, or boot without the failing disk, 
> > then the event counter in the other superblocks will be updated... and the 
> > removed/failed disk will no longer be considered an up to date 
> > component... so after doing the ddrescue you'd need to reassemble the 
> > raid5.  i'm not sure you can convince md to use the bitmap in this case -- 
> > i'm just not familiar enough with it.
> 
> i am little confused here. then what the purpose of that bitmap for? is
> not that bitmap is for a component temporarily out of place and thus out
> of sync a bit?

hmm... yeah i suppose that is the purpose of the bitmap... i haven't used 
bitmaps yet though... so i don't know which types of events they protect 
against.  in theory what you want to do sounds like it should work though, 
but i'd experiment somewhere safe first.

-dean
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: how to clone a disk

2006-03-11 Thread dean gaudet
On Sat, 11 Mar 2006, Ming Zhang wrote:

> On Sat, 2006-03-11 at 16:15 -0800, dean gaudet wrote:
> 
> > you're planning to do this while the array is online?  that's not safe... 
> > unless it's a read-only array...
> 
> what i plan to do is to pull out the disk (which is ok now but going to
> die), so raid5 will degrade with 1 disk fail and no spare disk here,
> then do ddresue to a new disk which will have same uuid and everything,
> then put it back, then bitmap will shine here right?
> 
> so raid5 is still online while that disk is not part of raid5 now. and
> no diskio on it at all. so do not think i need an atomic operation here.

if you fail the disk from the array, or boot without the failing disk, 
then the event counter in the other superblocks will be updated... and the 
removed/failed disk will no longer be considered an up to date 
component... so after doing the ddrescue you'd need to reassemble the 
raid5.  i'm not sure you can convince md to use the bitmap in this case -- 
i'm just not familiar enough with it.

> this raid5 over raid1 way sounds interesting. worthy trying.

let us know how it goes :)  i've considered doing this a few times 
myself... but i've been too conservative and just taken the system down to 
single user to do the ddrescue with the raid offline entirely.

-dean
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


  1   2   >