Re: 2.6.24-rc6 reproducible raid5 hang

2008-02-14 Thread Burkhard Carstens
Am Dienstag, 29. Januar 2008 23:58 schrieb Bill Davidsen: Carlos Carvalho wrote: Tim Southerwood ([EMAIL PROTECTED]) wrote on 28 January 2008 17:29: Subtitle: Patch to mainline yet? Hi I don't see evidence of Neil's patch in 2.6.24, so I applied it by hand on my server.

Re: 2.6.24-rc6 reproducible raid5 hang

2008-01-29 Thread Carlos Carvalho
Tim Southerwood ([EMAIL PROTECTED]) wrote on 28 January 2008 17:29: Subtitle: Patch to mainline yet? Hi I don't see evidence of Neil's patch in 2.6.24, so I applied it by hand on my server. I applied all 4 pending patches to .24. It's been better than .22 and .23... Unfortunately the

Re: 2.6.24-rc6 reproducible raid5 hang

2008-01-29 Thread Bill Davidsen
Carlos Carvalho wrote: Tim Southerwood ([EMAIL PROTECTED]) wrote on 28 January 2008 17:29: Subtitle: Patch to mainline yet? Hi I don't see evidence of Neil's patch in 2.6.24, so I applied it by hand on my server. I applied all 4 pending patches to .24. It's been better than .22 and

Re: 2.6.24-rc6 reproducible raid5 hang

2008-01-28 Thread Tim Southerwood
Subtitle: Patch to mainline yet? Hi I don't see evidence of Neil's patch in 2.6.24, so I applied it by hand on my server. Was that the correct thing to do, or did this issue get fixed in a different way that I wouldn't have spotted? I had a look at the git logs but it was not obvious -

Re: 2.6.24-rc6 reproducible raid5 hang

2008-01-24 Thread Tim Southerwood
Carlos Carvalho wrote: Tim Southerwood ([EMAIL PROTECTED]) wrote on 23 January 2008 13:37: Sorry if this breaks threaded mail readers, I only just subscribed to the list so don;t have the original post to reply to. I believe I'm having the same problem. Regarding XFS on a raid5 md

Re: 2.6.24-rc6 reproducible raid5 hang

2008-01-23 Thread Tim Southerwood
Sorry if this breaks threaded mail readers, I only just subscribed to the list so don;t have the original post to reply to. I believe I'm having the same problem. Regarding XFS on a raid5 md array: Kernels 2.6.22-14 (Ubuntu Gutsy generic and server builds) *and* 2.6.24-rc8 (pure build from

Re: 2.6.24-rc6 reproducible raid5 hang

2008-01-23 Thread Carlos Carvalho
Tim Southerwood ([EMAIL PROTECTED]) wrote on 23 January 2008 13:37: Sorry if this breaks threaded mail readers, I only just subscribed to the list so don;t have the original post to reply to. I believe I'm having the same problem. Regarding XFS on a raid5 md array: Kernels 2.6.22-14

Re: 2.6.24-rc6 reproducible raid5 hang

2008-01-10 Thread dean gaudet
On Thu, 10 Jan 2008, Neil Brown wrote: On Wednesday January 9, [EMAIL PROTECTED] wrote: On Sun, 2007-12-30 at 10:58 -0700, dean gaudet wrote: i have evidence pointing to d89d87965dcbe6fe4f96a2a7e8421b3a75f634d1

Re: 2.6.24-rc6 reproducible raid5 hang

2008-01-10 Thread Dan Williams
On Jan 10, 2008 12:13 AM, dean gaudet [EMAIL PROTECTED] wrote: w.r.t. dan's cfq comments -- i really don't know the details, but does this mean cfq will misattribute the IO to the wrong user/process? or is it just a concern that CPU time will be spent on someone's IO? the latter is fine to

Re: 2.6.24-rc6 reproducible raid5 hang

2008-01-10 Thread Neil Brown
On Thursday January 10, [EMAIL PROTECTED] wrote: On Jan 10, 2008 12:13 AM, dean gaudet [EMAIL PROTECTED] wrote: w.r.t. dan's cfq comments -- i really don't know the details, but does this mean cfq will misattribute the IO to the wrong user/process? or is it just a concern that CPU time

Re: 2.6.24-rc6 reproducible raid5 hang

2008-01-10 Thread dean gaudet
On Fri, 11 Jan 2008, Neil Brown wrote: Thanks. But I suspect you didn't test it with a bitmap :-) I ran the mdadm test suite and it hit a problem - easy enough to fix. damn -- i lost my bitmap 'cause it was external and i didn't have things set up properly to pick it up after a reboot :) if

Re: 2.6.24-rc6 reproducible raid5 hang

2008-01-09 Thread Dan Williams
On Sun, 2007-12-30 at 10:58 -0700, dean gaudet wrote: On Sat, 29 Dec 2007, Dan Williams wrote: On Dec 29, 2007 1:58 PM, dean gaudet [EMAIL PROTECTED] wrote: On Sat, 29 Dec 2007, Dan Williams wrote: On Dec 29, 2007 9:48 AM, dean gaudet [EMAIL PROTECTED] wrote: hmm bummer,

Re: 2.6.24-rc6 reproducible raid5 hang

2008-01-09 Thread Neil Brown
On Wednesday January 9, [EMAIL PROTECTED] wrote: On Sun, 2007-12-30 at 10:58 -0700, dean gaudet wrote: i have evidence pointing to d89d87965dcbe6fe4f96a2a7e8421b3a75f634d1 http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=d89d87965dcbe6fe4f96a2a7e8421b3a75f634d1

Re: 2.6.24-rc6 reproducible raid5 hang

2008-01-09 Thread Dan Williams
On Jan 9, 2008 5:09 PM, Neil Brown [EMAIL PROTECTED] wrote: On Wednesday January 9, [EMAIL PROTECTED] wrote: On Sun, 2007-12-30 at 10:58 -0700, dean gaudet wrote: i have evidence pointing to d89d87965dcbe6fe4f96a2a7e8421b3a75f634d1

Re: 2.6.24-rc6 reproducible raid5 hang

2008-01-09 Thread Neil Brown
On Wednesday January 9, [EMAIL PROTECTED] wrote: On Jan 9, 2008 5:09 PM, Neil Brown [EMAIL PROTECTED] wrote: On Wednesday January 9, [EMAIL PROTECTED] wrote: Can you test it please? This passes my failure case. Thanks! Does it seem reasonable? What do you think about limiting

Re: 2.6.24-rc6 reproducible raid5 hang

2008-01-09 Thread Dan Williams
On Wed, 2008-01-09 at 20:57 -0700, Neil Brown wrote: So I'm incline to leave it as do as much work as is available to be done as that is simplest. But I can probably be talked out of it with a convincing argument Well, in an age of CFS and CFQ it smacks of 'unfairness'. But does that

Re: 2.6.24-rc6 reproducible raid5 hang

2007-12-30 Thread dean gaudet
On Sat, 29 Dec 2007, Dan Williams wrote: On Dec 29, 2007 1:58 PM, dean gaudet [EMAIL PROTECTED] wrote: On Sat, 29 Dec 2007, Dan Williams wrote: On Dec 29, 2007 9:48 AM, dean gaudet [EMAIL PROTECTED] wrote: hmm bummer, i'm doing another test (rsync 3.5M inodes from another box) on

Re: 2.6.24-rc6 reproducible raid5 hang

2007-12-29 Thread dean gaudet
hmm bummer, i'm doing another test (rsync 3.5M inodes from another box) on the same 64k chunk array and had raised the stripe_cache_size to 1024... and got a hang. this time i grabbed stripe_cache_active before bumping the size again -- it was only 905 active. as i recall the bug we were

Re: 2.6.24-rc6 reproducible raid5 hang

2007-12-29 Thread Dan Williams
On Dec 29, 2007 9:48 AM, dean gaudet [EMAIL PROTECTED] wrote: hmm bummer, i'm doing another test (rsync 3.5M inodes from another box) on the same 64k chunk array and had raised the stripe_cache_size to 1024... and got a hang. this time i grabbed stripe_cache_active before bumping the size

Re: 2.6.24-rc6 reproducible raid5 hang

2007-12-29 Thread dean gaudet
On Sat, 29 Dec 2007, Dan Williams wrote: On Dec 29, 2007 9:48 AM, dean gaudet [EMAIL PROTECTED] wrote: hmm bummer, i'm doing another test (rsync 3.5M inodes from another box) on the same 64k chunk array and had raised the stripe_cache_size to 1024... and got a hang. this time i grabbed

Re: 2.6.24-rc6 reproducible raid5 hang

2007-12-29 Thread Justin Piszcz
On Sat, 29 Dec 2007, dean gaudet wrote: On Sat, 29 Dec 2007, Dan Williams wrote: On Dec 29, 2007 9:48 AM, dean gaudet [EMAIL PROTECTED] wrote: hmm bummer, i'm doing another test (rsync 3.5M inodes from another box) on the same 64k chunk array and had raised the stripe_cache_size to 1024...

Re: 2.6.24-rc6 reproducible raid5 hang

2007-12-29 Thread Dan Williams
On Dec 29, 2007 1:58 PM, dean gaudet [EMAIL PROTECTED] wrote: On Sat, 29 Dec 2007, Dan Williams wrote: On Dec 29, 2007 9:48 AM, dean gaudet [EMAIL PROTECTED] wrote: hmm bummer, i'm doing another test (rsync 3.5M inodes from another box) on the same 64k chunk array and had raised the

Re: 2.6.24-rc6 reproducible raid5 hang

2007-12-29 Thread dean gaudet
On Sat, 29 Dec 2007, Justin Piszcz wrote: Curious btw what kind of filesystem size/raid type (5, but defaults I assume, nothing special right? (right-symmetric vs. left-symmetric, etc?)/cache size/chunk size(s) are you using/testing with? mdadm --create --level=5 --chunk=64 -n7 -x1 /dev/md2

Re: 2.6.24-rc6 reproducible raid5 hang

2007-12-29 Thread dean gaudet
On Sat, 29 Dec 2007, dean gaudet wrote: On Sat, 29 Dec 2007, Justin Piszcz wrote: Curious btw what kind of filesystem size/raid type (5, but defaults I assume, nothing special right? (right-symmetric vs. left-symmetric, etc?)/cache size/chunk size(s) are you using/testing with?

Re: 2.6.24-rc6 reproducible raid5 hang

2007-12-27 Thread dean gaudet
hmm this seems more serious... i just ran into it with chunksize 64KiB and while just untarring a bunch of linux kernels in parallel... increasing stripe_cache_size did the trick again. -dean On Thu, 27 Dec 2007, dean gaudet wrote: hey neil -- remember that raid5 hang which me and only one

Re: 2.6.24-rc6 reproducible raid5 hang

2007-12-27 Thread Justin Piszcz
On Thu, 27 Dec 2007, dean gaudet wrote: hey neil -- remember that raid5 hang which me and only one or two others ever experienced and which was hard to reproduce? we were debugging it well over a year ago (that box has 400+ day uptime now so at least that long ago :) the workaround was to

Re: 2.6.24-rc6 reproducible raid5 hang

2007-12-27 Thread dean gaudet
On Thu, 27 Dec 2007, Justin Piszcz wrote: With that high of a stripe size the stripe_cache_size needs to be greater than the default to handle it. i'd argue that any deadlock is a bug... regardless i'm still seeing deadlocks with the default chunk_size of 64k and stripe_cache_size of 256...