Re: 2.6.24-rc6 reproducible raid5 hang
Am Dienstag, 29. Januar 2008 23:58 schrieb Bill Davidsen: Carlos Carvalho wrote: Tim Southerwood ([EMAIL PROTECTED]) wrote on 28 January 2008 17:29: Subtitle: Patch to mainline yet? Hi I don't see evidence of Neil's patch in 2.6.24, so I applied it by hand on my server. I applied all 4 pending patches to .24. It's been better than .22 and .23... Unfortunately the bitmap and rai1 patch don't go in .22.16. Neil, have these been sent up against 24-stable and 23-stable? .. and .22-stable ? Also, is this a xfs-on-raid5 bug or would it also happen with ext3-on-raid5 ? regards Burkhard - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 2.6.24-rc6 reproducible raid5 hang
Tim Southerwood ([EMAIL PROTECTED]) wrote on 28 January 2008 17:29: Subtitle: Patch to mainline yet? Hi I don't see evidence of Neil's patch in 2.6.24, so I applied it by hand on my server. I applied all 4 pending patches to .24. It's been better than .22 and .23... Unfortunately the bitmap and rai1 patch don't go in .22.16. - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 2.6.24-rc6 reproducible raid5 hang
Carlos Carvalho wrote: Tim Southerwood ([EMAIL PROTECTED]) wrote on 28 January 2008 17:29: Subtitle: Patch to mainline yet? Hi I don't see evidence of Neil's patch in 2.6.24, so I applied it by hand on my server. I applied all 4 pending patches to .24. It's been better than .22 and .23... Unfortunately the bitmap and rai1 patch don't go in .22.16. Neil, have these been sent up against 24-stable and 23-stable? -- Bill Davidsen [EMAIL PROTECTED] Woe unto the statesman who makes war without a reason that will still be valid when the war is over... Otto von Bismark - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 2.6.24-rc6 reproducible raid5 hang
Subtitle: Patch to mainline yet? Hi I don't see evidence of Neil's patch in 2.6.24, so I applied it by hand on my server. Was that the correct thing to do, or did this issue get fixed in a different way that I wouldn't have spotted? I had a look at the git logs but it was not obvious - please pardon my ignorance, I'm not familiar enough with the code. Many thanks, Tim Tim Southerwood wrote: Carlos Carvalho wrote: Tim Southerwood ([EMAIL PROTECTED]) wrote on 23 January 2008 13:37: Sorry if this breaks threaded mail readers, I only just subscribed to the list so don;t have the original post to reply to. I believe I'm having the same problem. Regarding XFS on a raid5 md array: Kernels 2.6.22-14 (Ubuntu Gutsy generic and server builds) *and* 2.6.24-rc8 (pure build from virgin sources) compiled for amd64 arch. This has been corrected already, install Neil's patches. It worked for several people under high stress, including us. - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Hi I just coerced the patch into 2.6.23.14, reset /sys/block/md1/md/stripe_cache_size to default (256) and rebooted. I can confirm that after 2 hours of heavy bashing[1] the system has not hung. Looks good - many thanks. But I will run with a stripe_cache_size of 4096 in practise as it improves write speen on my configuration about 2.5 times. Cheers Tim [1] Rsync 50GB to raid pluf xfs_fsr + dd 11GB of /dev/zero to same filesystem. - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 2.6.24-rc6 reproducible raid5 hang
Carlos Carvalho wrote: Tim Southerwood ([EMAIL PROTECTED]) wrote on 23 January 2008 13:37: Sorry if this breaks threaded mail readers, I only just subscribed to the list so don;t have the original post to reply to. I believe I'm having the same problem. Regarding XFS on a raid5 md array: Kernels 2.6.22-14 (Ubuntu Gutsy generic and server builds) *and* 2.6.24-rc8 (pure build from virgin sources) compiled for amd64 arch. This has been corrected already, install Neil's patches. It worked for several people under high stress, including us. - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Hi I just coerced the patch into 2.6.23.14, reset /sys/block/md1/md/stripe_cache_size to default (256) and rebooted. I can confirm that after 2 hours of heavy bashing[1] the system has not hung. Looks good - many thanks. But I will run with a stripe_cache_size of 4096 in practise as it improves write speen on my configuration about 2.5 times. Cheers Tim [1] Rsync 50GB to raid pluf xfs_fsr + dd 11GB of /dev/zero to same filesystem. - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 2.6.24-rc6 reproducible raid5 hang
Sorry if this breaks threaded mail readers, I only just subscribed to the list so don;t have the original post to reply to. I believe I'm having the same problem. Regarding XFS on a raid5 md array: Kernels 2.6.22-14 (Ubuntu Gutsy generic and server builds) *and* 2.6.24-rc8 (pure build from virgin sources) compiled for amd64 arch. Raid 5 configured across 4 x 500GB SATA disks (Nforce nv_sata driver, Asus M2N-E mobo, Athlon X64, 4GB RAM MD Chunk size is 1024k. This is allocated to an LVM2 PV, then sliced up. Taking one sample logical volume of 150GB I ran mkfs.xfs -d su=1024k,sw=3 -L vol_linux /dev/vg00/vol_linux I then found that putting high write load on that filesystem cause a hang. High load could be a little as a single rsync of a mirror of Ubunty Gutsy (many 10's of GB) from my old server to here. Hang would happen in a few hours typically. I could generate relatively quick hangs by running xfs_fsr (defragger) in parallel. Trying the workaround up upping /sys/block/md1/md/stripe_cache_size to 4096 seems (fingers crossed) to have helped. Been running the rsync again, plus xfs_fst + a few dd's of 11 GB to the same filesystem. I did notice also that the write speed increased dramatically with a bigger stripe_cache_size. A more detailed analysis of the problem indicated that, after the hang: I could log in; One CPU core was stuck in 100% IO wait. The other core was useable, with care. So I managed to get a SysRQ T and one place the system appeared blocked was via this path: [ 2039.466258] xfs_fsr D 0 7324 7308 [ 2039.466260] 810119399858 0082 0046 [ 2039.466263] 810110d6c680 8101102ba998 8101102ba770 8054e5e0 [ 2039.466265] 8101102ba998 00010014a1e6 810110ddcb30 [ 2039.466268] Call Trace: [ 2039.466277] [8808a26b] :raid456:get_active_stripe+0x1cb/0x610 [ 2039.466282] [80234000] default_wake_function+0x0/0x10 [ 2039.466289] [88090ff8] :raid456:make_request+0x1f8/0x610 [ 2039.466293] [80251c20] autoremove_wake_function+0x0/0x30 [ 2039.466295] [80331121] __up_read+0x21/0xb0 [ 2039.466300] [8031f336] generic_make_request+0x1d6/0x3d0 [ 2039.466303] [80280bad] vm_normal_page+0x3d/0xc0 [ 2039.466307] [8031f59f] submit_bio+0x6f/0xf0 [ 2039.466311] [802c98cc] dio_bio_submit+0x5c/0x90 [ 2039.466313] [802c9943] dio_send_cur_page+0x43/0xa0 [ 2039.466316] [802c99ee] submit_page_section+0x4e/0x150 [ 2039.466319] [802ca2e2] __blockdev_direct_IO+0x742/0xb50 [ 2039.466342] [8832e9a2] :xfs:xfs_vm_direct_IO+0x182/0x190 [ 2039.466357] [8832edb0] :xfs:xfs_get_blocks_direct+0x0/0x20 [ 2039.466370] [8832e350] :xfs:xfs_end_io_direct+0x0/0x80 [ 2039.466375] [80444fb5] __wait_on_bit_lock+0x65/0x80 [ 2039.466380] [80272883] generic_file_direct_IO+0xe3/0x190 [ 2039.466385] [802729a4] generic_file_direct_write+0x74/0x150 [ 2039.466402] [88336db2] :xfs:xfs_write+0x492/0x8f0 [ 2039.466421] [883099bc] :xfs:xfs_iunlock+0x2c/0xb0 [ 2039.466437] [88336866] :xfs:xfs_read+0x186/0x240 [ 2039.466443] [8029e5b9] do_sync_write+0xd9/0x120 [ 2039.466448] [80251c20] autoremove_wake_function+0x0/0x30 [ 2039.466457] [8029eead] vfs_write+0xdd/0x190 [ 2039.466461] [8029f5b3] sys_write+0x53/0x90 [ 2039.466465] [8020c29e] system_call+0x7e/0x83 However, I'm of the opinion that the system should not deadlock, even if tunable parameters are unfavourable. I'm happy with the workaround (indeed the system performs better). However, it will take me a week's worth of testing before I'm willing to commission this as my new fileserver. So, if there is anything anyone would like me to try, I'm happy to volunteer as a guinea pig :) Yes, I can build and patch kernels. But I'm not hot at debugging kernels so if kernel core dumps or whatever are needed, please point me at the right document or hint as to which commands I need to read about. Cheers Tim - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 2.6.24-rc6 reproducible raid5 hang
Tim Southerwood ([EMAIL PROTECTED]) wrote on 23 January 2008 13:37: Sorry if this breaks threaded mail readers, I only just subscribed to the list so don;t have the original post to reply to. I believe I'm having the same problem. Regarding XFS on a raid5 md array: Kernels 2.6.22-14 (Ubuntu Gutsy generic and server builds) *and* 2.6.24-rc8 (pure build from virgin sources) compiled for amd64 arch. This has been corrected already, install Neil's patches. It worked for several people under high stress, including us. - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 2.6.24-rc6 reproducible raid5 hang
On Thu, 10 Jan 2008, Neil Brown wrote: On Wednesday January 9, [EMAIL PROTECTED] wrote: On Sun, 2007-12-30 at 10:58 -0700, dean gaudet wrote: i have evidence pointing to d89d87965dcbe6fe4f96a2a7e8421b3a75f634d1 http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=d89d87965dcbe6fe4f96a2a7e8421b3a75f634d1 which was Neil's change in 2.6.22 for deferring generic_make_request until there's enough stack space for it. Commit d89d87965dcbe6fe4f96a2a7e8421b3a75f634d1 reduced stack utilization by preventing recursive calls to generic_make_request. However the following conditions can cause raid5 to hang until 'stripe_cache_size' is increased: Thanks for pursuing this guys. That explanation certainly sounds very credible. The generic_make_request_immed is a good way to confirm that we have found the bug, but I don't like it as a long term solution, as it just reintroduced the problem that we were trying to solve with the problematic commit. As you say, we could arrange that all request submission happens in raid5d and I think this is the right way to proceed. However we can still take some of the work into the thread that is submitting the IO by calling raid5d() at the end of make_request, like this. Can you test it please? Does it seem reasonable? Thanks, NeilBrown Signed-off-by: Neil Brown [EMAIL PROTECTED] it has passed 11h of the untar/diff/rm linux.tar.gz workload... that's pretty good evidence it works for me. thanks! Tested-by: dean gaudet [EMAIL PROTECTED] ### Diffstat output ./drivers/md/md.c|2 +- ./drivers/md/raid5.c |4 +++- 2 files changed, 4 insertions(+), 2 deletions(-) diff .prev/drivers/md/md.c ./drivers/md/md.c --- .prev/drivers/md/md.c 2008-01-07 13:32:10.0 +1100 +++ ./drivers/md/md.c 2008-01-10 11:08:02.0 +1100 @@ -5774,7 +5774,7 @@ void md_check_recovery(mddev_t *mddev) if (mddev-ro) return; - if (signal_pending(current)) { + if (current == mddev-thread-tsk signal_pending(current)) { if (mddev-pers-sync_request) { printk(KERN_INFO md: %s in immediate safe mode\n, mdname(mddev)); diff .prev/drivers/md/raid5.c ./drivers/md/raid5.c --- .prev/drivers/md/raid5.c 2008-01-07 13:32:10.0 +1100 +++ ./drivers/md/raid5.c 2008-01-10 11:06:54.0 +1100 @@ -3432,6 +3432,7 @@ static int chunk_aligned_read(struct req } } +static void raid5d (mddev_t *mddev); static int make_request(struct request_queue *q, struct bio * bi) { @@ -3547,7 +3548,7 @@ static int make_request(struct request_q goto retry; } finish_wait(conf-wait_for_overlap, w); - handle_stripe(sh, NULL); + set_bit(STRIPE_HANDLE, sh-state); release_stripe(sh); } else { /* cannot get stripe for read-ahead, just give-up */ @@ -3569,6 +3570,7 @@ static int make_request(struct request_q test_bit(BIO_UPTODATE, bi-bi_flags) ? 0 : -EIO); } + raid5d(mddev); return 0; } - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 2.6.24-rc6 reproducible raid5 hang
On Jan 10, 2008 12:13 AM, dean gaudet [EMAIL PROTECTED] wrote: w.r.t. dan's cfq comments -- i really don't know the details, but does this mean cfq will misattribute the IO to the wrong user/process? or is it just a concern that CPU time will be spent on someone's IO? the latter is fine to me... the former seems sucky because with today's multicore systems CPU time seems cheap compared to IO. I do not see this affecting the time slicing feature of cfq, because as Neil says the work has to get done at some point. If I give up some of my slice working on someone else's I/O chances are the favor will be returned in kind since the code does not discriminate. The io-priority capability of cfq currently does not work as advertised with current MD since the priority is tied to the current thread and the thread that actually submits the i/o on a stripe is non-deterministic. So I do not see this change making the situation any worse. In fact, it may make it a bit better since there is a higher chance for the thread submitting i/o to MD to do its own i/o to the backing disks. Reviewed-by: Dan Williams [EMAIL PROTECTED] - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 2.6.24-rc6 reproducible raid5 hang
On Thursday January 10, [EMAIL PROTECTED] wrote: On Jan 10, 2008 12:13 AM, dean gaudet [EMAIL PROTECTED] wrote: w.r.t. dan's cfq comments -- i really don't know the details, but does this mean cfq will misattribute the IO to the wrong user/process? or is it just a concern that CPU time will be spent on someone's IO? the latter is fine to me... the former seems sucky because with today's multicore systems CPU time seems cheap compared to IO. I do not see this affecting the time slicing feature of cfq, because as Neil says the work has to get done at some point. If I give up some of my slice working on someone else's I/O chances are the favor will be returned in kind since the code does not discriminate. The io-priority capability of cfq currently does not work as advertised with current MD since the priority is tied to the current thread and the thread that actually submits the i/o on a stripe is non-deterministic. So I do not see this change making the situation any worse. In fact, it may make it a bit better since there is a higher chance for the thread submitting i/o to MD to do its own i/o to the backing disks. Reviewed-by: Dan Williams [EMAIL PROTECTED] Thanks. But I suspect you didn't test it with a bitmap :-) I ran the mdadm test suite and it hit a problem - easy enough to fix. I'll look out for any other possible related problem (due to raid5d running in different processes) and then submit it. Thanks, NeilBrown - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 2.6.24-rc6 reproducible raid5 hang
On Fri, 11 Jan 2008, Neil Brown wrote: Thanks. But I suspect you didn't test it with a bitmap :-) I ran the mdadm test suite and it hit a problem - easy enough to fix. damn -- i lost my bitmap 'cause it was external and i didn't have things set up properly to pick it up after a reboot :) if you send an updated patch i'll give it another spin... -dean - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 2.6.24-rc6 reproducible raid5 hang
On Sun, 2007-12-30 at 10:58 -0700, dean gaudet wrote: On Sat, 29 Dec 2007, Dan Williams wrote: On Dec 29, 2007 1:58 PM, dean gaudet [EMAIL PROTECTED] wrote: On Sat, 29 Dec 2007, Dan Williams wrote: On Dec 29, 2007 9:48 AM, dean gaudet [EMAIL PROTECTED] wrote: hmm bummer, i'm doing another test (rsync 3.5M inodes from another box) on the same 64k chunk array and had raised the stripe_cache_size to 1024... and got a hang. this time i grabbed stripe_cache_active before bumping the size again -- it was only 905 active. as i recall the bug we were debugging a year+ ago the active was at the size when it would hang. so this is probably something new. I believe I am seeing the same issue and am trying to track down whether XFS is doing something unexpected, i.e. I have not been able to reproduce the problem with EXT3. MD tries to increase throughput by letting some stripe work build up in batches. It looks like every time your system has hung it has been in the 'inactive_blocked' state i.e. 3/4 of stripes active. This state should automatically clear... cool, glad you can reproduce it :) i have a bit more data... i'm seeing the same problem on debian's 2.6.22-3-amd64 kernel, so it's not new in 2.6.24. This is just brainstorming at this point, but it looks like xfs can submit more requests in the bi_end_io path such that it can lock itself out of the RAID array. The sequence that concerns me is: return_io-xfs_buf_end_io-xfs_buf_io_end-xfs_buf_iodone_work-xfs_buf_iorequest-make_request-hang I need verify whether this path is actually triggering, but if we are in an inactive_blocked condition this new request will be put on a wait queue and we'll never get to the release_stripe() call after return_io(). It would be interesting to see if this is new XFS behavior in recent kernels. i have evidence pointing to d89d87965dcbe6fe4f96a2a7e8421b3a75f634d1 http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=d89d87965dcbe6fe4f96a2a7e8421b3a75f634d1 which was Neil's change in 2.6.22 for deferring generic_make_request until there's enough stack space for it. with my git tree sync'd to that commit my test cases fail in under 20 minutes uptime (i rebooted and tested 3x). sync'd to the commit previous to it i've got 8h of run-time now without the problem. this isn't definitive of course since it does seem to be timing dependent, but since all failures have occured much earlier than that for me so far i think this indicates this change is either the cause of the problem or exacerbates an existing raid5 problem. given that this problem looks like a very rare problem i saw with 2.6.18 (raid5+xfs there too) i'm thinking Neil's commit may just exacerbate an existing problem... not that i have evidence either way. i've attached a new kernel log with a hang at d89d87965d... and the reduced config file i was using for the bisect. hopefully the hang looks the same as what we were seeing at 2.6.24-rc6. let me know. Dean could you try the below patch to see if it fixes your failure scenario? It passes my test case. Thanks, Dan --- md: add generic_make_request_immed to prevent raid5 hang From: Dan Williams [EMAIL PROTECTED] Commit d89d87965dcbe6fe4f96a2a7e8421b3a75f634d1 reduced stack utilization by preventing recursive calls to generic_make_request. However the following conditions can cause raid5 to hang until 'stripe_cache_size' is increased: 1/ stripe_cache_active is N stripes away from the 'inactive_blocked' limit (3/4 * stripe_cache_size) 2/ a bio is submitted that requires M stripes to be processed where M N 3/ stripes 1 through N are up-to-date and ready for immediate processing, i.e. no trip through raid5d required This results in the calling thread hanging while waiting for resources to process stripes N through M. This means we never return from make_request. All other raid5 users pile up in get_active_stripe. Increasing stripe_cache_size temporarily resolves the blockage by allowing the blocked make_request to return to generic_make_request. Another way to solve this is to move all i/o submission to raid5d context. Thanks to Dean Gaudet for bisecting this down to d89d8796. Signed-off-by: Dan Williams [EMAIL PROTECTED] --- block/ll_rw_blk.c | 16 +--- drivers/md/raid5.c |4 ++-- include/linux/blkdev.h |1 + 3 files changed, 16 insertions(+), 5 deletions(-) diff --git a/block/ll_rw_blk.c b/block/ll_rw_blk.c index 8b91994..bff40c2 100644 --- a/block/ll_rw_blk.c +++ b/block/ll_rw_blk.c @@ -3287,16 +3287,26 @@ end_io: } /* - * We only want one -make_request_fn to be active at a time, - * else stack usage with stacked devices could be a problem. + * In the general case we only want one -make_request_fn to be
Re: 2.6.24-rc6 reproducible raid5 hang
On Wednesday January 9, [EMAIL PROTECTED] wrote: On Sun, 2007-12-30 at 10:58 -0700, dean gaudet wrote: i have evidence pointing to d89d87965dcbe6fe4f96a2a7e8421b3a75f634d1 http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=d89d87965dcbe6fe4f96a2a7e8421b3a75f634d1 which was Neil's change in 2.6.22 for deferring generic_make_request until there's enough stack space for it. Commit d89d87965dcbe6fe4f96a2a7e8421b3a75f634d1 reduced stack utilization by preventing recursive calls to generic_make_request. However the following conditions can cause raid5 to hang until 'stripe_cache_size' is increased: Thanks for pursuing this guys. That explanation certainly sounds very credible. The generic_make_request_immed is a good way to confirm that we have found the bug, but I don't like it as a long term solution, as it just reintroduced the problem that we were trying to solve with the problematic commit. As you say, we could arrange that all request submission happens in raid5d and I think this is the right way to proceed. However we can still take some of the work into the thread that is submitting the IO by calling raid5d() at the end of make_request, like this. Can you test it please? Does it seem reasonable? Thanks, NeilBrown Signed-off-by: Neil Brown [EMAIL PROTECTED] ### Diffstat output ./drivers/md/md.c|2 +- ./drivers/md/raid5.c |4 +++- 2 files changed, 4 insertions(+), 2 deletions(-) diff .prev/drivers/md/md.c ./drivers/md/md.c --- .prev/drivers/md/md.c 2008-01-07 13:32:10.0 +1100 +++ ./drivers/md/md.c 2008-01-10 11:08:02.0 +1100 @@ -5774,7 +5774,7 @@ void md_check_recovery(mddev_t *mddev) if (mddev-ro) return; - if (signal_pending(current)) { + if (current == mddev-thread-tsk signal_pending(current)) { if (mddev-pers-sync_request) { printk(KERN_INFO md: %s in immediate safe mode\n, mdname(mddev)); diff .prev/drivers/md/raid5.c ./drivers/md/raid5.c --- .prev/drivers/md/raid5.c2008-01-07 13:32:10.0 +1100 +++ ./drivers/md/raid5.c2008-01-10 11:06:54.0 +1100 @@ -3432,6 +3432,7 @@ static int chunk_aligned_read(struct req } } +static void raid5d (mddev_t *mddev); static int make_request(struct request_queue *q, struct bio * bi) { @@ -3547,7 +3548,7 @@ static int make_request(struct request_q goto retry; } finish_wait(conf-wait_for_overlap, w); - handle_stripe(sh, NULL); + set_bit(STRIPE_HANDLE, sh-state); release_stripe(sh); } else { /* cannot get stripe for read-ahead, just give-up */ @@ -3569,6 +3570,7 @@ static int make_request(struct request_q test_bit(BIO_UPTODATE, bi-bi_flags) ? 0 : -EIO); } + raid5d(mddev); return 0; } - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 2.6.24-rc6 reproducible raid5 hang
On Jan 9, 2008 5:09 PM, Neil Brown [EMAIL PROTECTED] wrote: On Wednesday January 9, [EMAIL PROTECTED] wrote: On Sun, 2007-12-30 at 10:58 -0700, dean gaudet wrote: i have evidence pointing to d89d87965dcbe6fe4f96a2a7e8421b3a75f634d1 http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=d89d87965dcbe6fe4f96a2a7e8421b3a75f634d1 which was Neil's change in 2.6.22 for deferring generic_make_request until there's enough stack space for it. Commit d89d87965dcbe6fe4f96a2a7e8421b3a75f634d1 reduced stack utilization by preventing recursive calls to generic_make_request. However the following conditions can cause raid5 to hang until 'stripe_cache_size' is increased: Thanks for pursuing this guys. That explanation certainly sounds very credible. The generic_make_request_immed is a good way to confirm that we have found the bug, but I don't like it as a long term solution, as it just reintroduced the problem that we were trying to solve with the problematic commit. As you say, we could arrange that all request submission happens in raid5d and I think this is the right way to proceed. However we can still take some of the work into the thread that is submitting the IO by calling raid5d() at the end of make_request, like this. Can you test it please? This passes my failure case. However, my test is different from Dean's in that I am using tiobench and the latest rev of my 'get_priority_stripe' patch. I believe the failure mechanism is the same, but it would be good to get confirmation from Dean. get_priority_stripe has the effect of increasing the frequency of make_request-handle_stripe-generic_make_request sequences. Does it seem reasonable? What do you think about limiting the number of stripes the submitting thread handles to be equal to what it submitted? If I'm a stripe that only submits 1 stripe worth of work should I get stuck handling the rest of the cache? Regards, Dan - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 2.6.24-rc6 reproducible raid5 hang
On Wednesday January 9, [EMAIL PROTECTED] wrote: On Jan 9, 2008 5:09 PM, Neil Brown [EMAIL PROTECTED] wrote: On Wednesday January 9, [EMAIL PROTECTED] wrote: Can you test it please? This passes my failure case. Thanks! Does it seem reasonable? What do you think about limiting the number of stripes the submitting thread handles to be equal to what it submitted? If I'm a stripe that only submits 1 stripe worth of work should I get stuck handling the rest of the cache? Dunno Someone has to do the work, and leaving it all to raid5d means that it all gets done on one CPU. I expect that most of the time the queue of ready stripes is empty so make_request will mostly only handle it's own stripes anyway. The times that it handles other thread's stripes will probably balance out with the times that other threads handle this threads stripes. So I'm incline to leave it as do as much work as is available to be done as that is simplest. But I can probably be talked out of it with a convincing argument NeilBrown - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 2.6.24-rc6 reproducible raid5 hang
On Wed, 2008-01-09 at 20:57 -0700, Neil Brown wrote: So I'm incline to leave it as do as much work as is available to be done as that is simplest. But I can probably be talked out of it with a convincing argument Well, in an age of CFS and CFQ it smacks of 'unfairness'. But does that trump KISS...? Probably not. -- Dan - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 2.6.24-rc6 reproducible raid5 hang
On Sat, 29 Dec 2007, Dan Williams wrote: On Dec 29, 2007 1:58 PM, dean gaudet [EMAIL PROTECTED] wrote: On Sat, 29 Dec 2007, Dan Williams wrote: On Dec 29, 2007 9:48 AM, dean gaudet [EMAIL PROTECTED] wrote: hmm bummer, i'm doing another test (rsync 3.5M inodes from another box) on the same 64k chunk array and had raised the stripe_cache_size to 1024... and got a hang. this time i grabbed stripe_cache_active before bumping the size again -- it was only 905 active. as i recall the bug we were debugging a year+ ago the active was at the size when it would hang. so this is probably something new. I believe I am seeing the same issue and am trying to track down whether XFS is doing something unexpected, i.e. I have not been able to reproduce the problem with EXT3. MD tries to increase throughput by letting some stripe work build up in batches. It looks like every time your system has hung it has been in the 'inactive_blocked' state i.e. 3/4 of stripes active. This state should automatically clear... cool, glad you can reproduce it :) i have a bit more data... i'm seeing the same problem on debian's 2.6.22-3-amd64 kernel, so it's not new in 2.6.24. This is just brainstorming at this point, but it looks like xfs can submit more requests in the bi_end_io path such that it can lock itself out of the RAID array. The sequence that concerns me is: return_io-xfs_buf_end_io-xfs_buf_io_end-xfs_buf_iodone_work-xfs_buf_iorequest-make_request-hang I need verify whether this path is actually triggering, but if we are in an inactive_blocked condition this new request will be put on a wait queue and we'll never get to the release_stripe() call after return_io(). It would be interesting to see if this is new XFS behavior in recent kernels. i have evidence pointing to d89d87965dcbe6fe4f96a2a7e8421b3a75f634d1 http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=d89d87965dcbe6fe4f96a2a7e8421b3a75f634d1 which was Neil's change in 2.6.22 for deferring generic_make_request until there's enough stack space for it. with my git tree sync'd to that commit my test cases fail in under 20 minutes uptime (i rebooted and tested 3x). sync'd to the commit previous to it i've got 8h of run-time now without the problem. this isn't definitive of course since it does seem to be timing dependent, but since all failures have occured much earlier than that for me so far i think this indicates this change is either the cause of the problem or exacerbates an existing raid5 problem. given that this problem looks like a very rare problem i saw with 2.6.18 (raid5+xfs there too) i'm thinking Neil's commit may just exacerbate an existing problem... not that i have evidence either way. i've attached a new kernel log with a hang at d89d87965d... and the reduced config file i was using for the bisect. hopefully the hang looks the same as what we were seeing at 2.6.24-rc6. let me know. -dean kern.log.d89d87965d.bz2 Description: Binary data config-2.6.21-b1.bz2 Description: Binary data
Re: 2.6.24-rc6 reproducible raid5 hang
hmm bummer, i'm doing another test (rsync 3.5M inodes from another box) on the same 64k chunk array and had raised the stripe_cache_size to 1024... and got a hang. this time i grabbed stripe_cache_active before bumping the size again -- it was only 905 active. as i recall the bug we were debugging a year+ ago the active was at the size when it would hang. so this is probably something new. anyhow raising it to 2048 got it unstuck, but i'm guessing i'll be able to hit that limit too if i try harder :) btw what units are stripe_cache_size/active in? is the memory consumed equal to (chunk_size * raid_disks * stripe_cache_size) or (chunk_size * raid_disks * stripe_cache_active)? -dean On Thu, 27 Dec 2007, dean gaudet wrote: hmm this seems more serious... i just ran into it with chunksize 64KiB and while just untarring a bunch of linux kernels in parallel... increasing stripe_cache_size did the trick again. -dean On Thu, 27 Dec 2007, dean gaudet wrote: hey neil -- remember that raid5 hang which me and only one or two others ever experienced and which was hard to reproduce? we were debugging it well over a year ago (that box has 400+ day uptime now so at least that long ago :) the workaround was to increase stripe_cache_size... i seem to have a way to reproduce something which looks much the same. setup: - 2.6.24-rc6 - system has 8GiB RAM but no swap - 8x750GB in a raid5 with one spare, chunksize 1024KiB. - mkfs.xfs default options - mount -o noatime - dd if=/dev/zero of=/mnt/foo bs=4k count=2621440 that sequence hangs for me within 10 seconds... and i can unhang / rehang it by toggling between stripe_cache_size 256 and 1024. i detect the hang by watching iostat -kx /dev/sd? 5. i've attached the kernel log where i dumped task and timer state while it was hung... note that you'll see at some point i did an xfs mount with external journal but it happens with internal journal as well. looks like it's using the raid456 module and async api. anyhow let me know if you need more info / have any suggestions. -dean - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 2.6.24-rc6 reproducible raid5 hang
On Dec 29, 2007 9:48 AM, dean gaudet [EMAIL PROTECTED] wrote: hmm bummer, i'm doing another test (rsync 3.5M inodes from another box) on the same 64k chunk array and had raised the stripe_cache_size to 1024... and got a hang. this time i grabbed stripe_cache_active before bumping the size again -- it was only 905 active. as i recall the bug we were debugging a year+ ago the active was at the size when it would hang. so this is probably something new. I believe I am seeing the same issue and am trying to track down whether XFS is doing something unexpected, i.e. I have not been able to reproduce the problem with EXT3. MD tries to increase throughput by letting some stripe work build up in batches. It looks like every time your system has hung it has been in the 'inactive_blocked' state i.e. 3/4 of stripes active. This state should automatically clear... anyhow raising it to 2048 got it unstuck, but i'm guessing i'll be able to hit that limit too if i try harder :) Once you hang if 'stripe_cache_size' is increased such that stripe_cache_active 3/4 * stripe_cache_size things will start flowing again. btw what units are stripe_cache_size/active in? is the memory consumed equal to (chunk_size * raid_disks * stripe_cache_size) or (chunk_size * raid_disks * stripe_cache_active)? memory_consumed = PAGE_SIZE * raid_disks * stripe_cache_size -dean -- Dan - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 2.6.24-rc6 reproducible raid5 hang
On Sat, 29 Dec 2007, Dan Williams wrote: On Dec 29, 2007 9:48 AM, dean gaudet [EMAIL PROTECTED] wrote: hmm bummer, i'm doing another test (rsync 3.5M inodes from another box) on the same 64k chunk array and had raised the stripe_cache_size to 1024... and got a hang. this time i grabbed stripe_cache_active before bumping the size again -- it was only 905 active. as i recall the bug we were debugging a year+ ago the active was at the size when it would hang. so this is probably something new. I believe I am seeing the same issue and am trying to track down whether XFS is doing something unexpected, i.e. I have not been able to reproduce the problem with EXT3. MD tries to increase throughput by letting some stripe work build up in batches. It looks like every time your system has hung it has been in the 'inactive_blocked' state i.e. 3/4 of stripes active. This state should automatically clear... cool, glad you can reproduce it :) i have a bit more data... i'm seeing the same problem on debian's 2.6.22-3-amd64 kernel, so it's not new in 2.6.24. i'm doing some more isolation but just grabbing kernels i have precompiled so far -- a 2.6.19.7 kernel doesn't show the problem, and early indications are a 2.6.21.7 kernel also doesn't have the problem but i'm giving it longer to show its head. i'll try a stock 2.6.22 next depending on how the 2.6.21 test goes, just so we get the debian patches out of the way. i was tempted to blame async api because it's newish :) but according to the dmesg output it doesn't appear the 2.6.22-3-amd64 kernel used async API, and it still hung, so async is probably not to blame. anyhow the test case i'm using is the dma_thrasher script i attached... it takes about an hour to give me confidence there's no problems so this will take a while. -dean - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 2.6.24-rc6 reproducible raid5 hang
On Sat, 29 Dec 2007, dean gaudet wrote: On Sat, 29 Dec 2007, Dan Williams wrote: On Dec 29, 2007 9:48 AM, dean gaudet [EMAIL PROTECTED] wrote: hmm bummer, i'm doing another test (rsync 3.5M inodes from another box) on the same 64k chunk array and had raised the stripe_cache_size to 1024... and got a hang. this time i grabbed stripe_cache_active before bumping the size again -- it was only 905 active. as i recall the bug we were debugging a year+ ago the active was at the size when it would hang. so this is probably something new. I believe I am seeing the same issue and am trying to track down whether XFS is doing something unexpected, i.e. I have not been able to reproduce the problem with EXT3. MD tries to increase throughput by letting some stripe work build up in batches. It looks like every time your system has hung it has been in the 'inactive_blocked' state i.e. 3/4 of stripes active. This state should automatically clear... cool, glad you can reproduce it :) i have a bit more data... i'm seeing the same problem on debian's 2.6.22-3-amd64 kernel, so it's not new in 2.6.24. i'm doing some more isolation but just grabbing kernels i have precompiled so far -- a 2.6.19.7 kernel doesn't show the problem, and early indications are a 2.6.21.7 kernel also doesn't have the problem but i'm giving it longer to show its head. i'll try a stock 2.6.22 next depending on how the 2.6.21 test goes, just so we get the debian patches out of the way. i was tempted to blame async api because it's newish :) but according to the dmesg output it doesn't appear the 2.6.22-3-amd64 kernel used async API, and it still hung, so async is probably not to blame. anyhow the test case i'm using is the dma_thrasher script i attached... it takes about an hour to give me confidence there's no problems so this will take a while. -dean - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Dean, Curious btw what kind of filesystem size/raid type (5, but defaults I assume, nothing special right? (right-symmetric vs. left-symmetric, etc?)/cache size/chunk size(s) are you using/testing with? The script you sent out earlier, you are able to reproduce it easily with 31 or so kernel tar decompressions? Justin. - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 2.6.24-rc6 reproducible raid5 hang
On Dec 29, 2007 1:58 PM, dean gaudet [EMAIL PROTECTED] wrote: On Sat, 29 Dec 2007, Dan Williams wrote: On Dec 29, 2007 9:48 AM, dean gaudet [EMAIL PROTECTED] wrote: hmm bummer, i'm doing another test (rsync 3.5M inodes from another box) on the same 64k chunk array and had raised the stripe_cache_size to 1024... and got a hang. this time i grabbed stripe_cache_active before bumping the size again -- it was only 905 active. as i recall the bug we were debugging a year+ ago the active was at the size when it would hang. so this is probably something new. I believe I am seeing the same issue and am trying to track down whether XFS is doing something unexpected, i.e. I have not been able to reproduce the problem with EXT3. MD tries to increase throughput by letting some stripe work build up in batches. It looks like every time your system has hung it has been in the 'inactive_blocked' state i.e. 3/4 of stripes active. This state should automatically clear... cool, glad you can reproduce it :) i have a bit more data... i'm seeing the same problem on debian's 2.6.22-3-amd64 kernel, so it's not new in 2.6.24. This is just brainstorming at this point, but it looks like xfs can submit more requests in the bi_end_io path such that it can lock itself out of the RAID array. The sequence that concerns me is: return_io-xfs_buf_end_io-xfs_buf_io_end-xfs_buf_iodone_work-xfs_buf_iorequest-make_request-hang I need verify whether this path is actually triggering, but if we are in an inactive_blocked condition this new request will be put on a wait queue and we'll never get to the release_stripe() call after return_io(). It would be interesting to see if this is new XFS behavior in recent kernels. -- Dan - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 2.6.24-rc6 reproducible raid5 hang
On Sat, 29 Dec 2007, Justin Piszcz wrote: Curious btw what kind of filesystem size/raid type (5, but defaults I assume, nothing special right? (right-symmetric vs. left-symmetric, etc?)/cache size/chunk size(s) are you using/testing with? mdadm --create --level=5 --chunk=64 -n7 -x1 /dev/md2 /dev/sd[a-h]1 mkfs.xfs -f /dev/md2 otherwise defaults The script you sent out earlier, you are able to reproduce it easily with 31 or so kernel tar decompressions? not sure, the point of the script is to untar more than there is RAM. it happened with a single rsync running though -- 3.5M indoes from a remote box. it also happens with the single 10GB dd write... although i've been using the tar method for testing different kernel revs. -dean - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 2.6.24-rc6 reproducible raid5 hang
On Sat, 29 Dec 2007, dean gaudet wrote: On Sat, 29 Dec 2007, Justin Piszcz wrote: Curious btw what kind of filesystem size/raid type (5, but defaults I assume, nothing special right? (right-symmetric vs. left-symmetric, etc?)/cache size/chunk size(s) are you using/testing with? mdadm --create --level=5 --chunk=64 -n7 -x1 /dev/md2 /dev/sd[a-h]1 mkfs.xfs -f /dev/md2 otherwise defaults hmm i missed a few things, here's exactly how i created the array: mdadm --create --level=5 --chunk=64 -n7 -x1 --assume-clean /dev/md2 /dev/sd[a-h]1 it's reassembled automagically each reboot, but i do this each reboot: mkfs.xfs -f /dev/md2 mount -o noatime /dev/md2 /mnt/new ./dma_thrasher linux.tar.gz /mnt/new the --assume-clean and noatime probably make no difference though... on the bisection front it looks like it's new behaviour between 2.6.21.7 and 2.6.22.15 (stock kernels now, not debian). i've got to step out for a while, but i'll go at it again later, probably with git bisect unless someone has some cherry picked changes to suggest. -dean - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 2.6.24-rc6 reproducible raid5 hang
hmm this seems more serious... i just ran into it with chunksize 64KiB and while just untarring a bunch of linux kernels in parallel... increasing stripe_cache_size did the trick again. -dean On Thu, 27 Dec 2007, dean gaudet wrote: hey neil -- remember that raid5 hang which me and only one or two others ever experienced and which was hard to reproduce? we were debugging it well over a year ago (that box has 400+ day uptime now so at least that long ago :) the workaround was to increase stripe_cache_size... i seem to have a way to reproduce something which looks much the same. setup: - 2.6.24-rc6 - system has 8GiB RAM but no swap - 8x750GB in a raid5 with one spare, chunksize 1024KiB. - mkfs.xfs default options - mount -o noatime - dd if=/dev/zero of=/mnt/foo bs=4k count=2621440 that sequence hangs for me within 10 seconds... and i can unhang / rehang it by toggling between stripe_cache_size 256 and 1024. i detect the hang by watching iostat -kx /dev/sd? 5. i've attached the kernel log where i dumped task and timer state while it was hung... note that you'll see at some point i did an xfs mount with external journal but it happens with internal journal as well. looks like it's using the raid456 module and async api. anyhow let me know if you need more info / have any suggestions. -dean - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 2.6.24-rc6 reproducible raid5 hang
On Thu, 27 Dec 2007, dean gaudet wrote: hey neil -- remember that raid5 hang which me and only one or two others ever experienced and which was hard to reproduce? we were debugging it well over a year ago (that box has 400+ day uptime now so at least that long ago :) the workaround was to increase stripe_cache_size... i seem to have a way to reproduce something which looks much the same. setup: - 2.6.24-rc6 - system has 8GiB RAM but no swap - 8x750GB in a raid5 with one spare, chunksize 1024KiB. - mkfs.xfs default options - mount -o noatime - dd if=/dev/zero of=/mnt/foo bs=4k count=2621440 that sequence hangs for me within 10 seconds... and i can unhang / rehang it by toggling between stripe_cache_size 256 and 1024. i detect the hang by watching iostat -kx /dev/sd? 5. i've attached the kernel log where i dumped task and timer state while it was hung... note that you'll see at some point i did an xfs mount with external journal but it happens with internal journal as well. looks like it's using the raid456 module and async api. anyhow let me know if you need more info / have any suggestions. -dean With that high of a stripe size the stripe_cache_size needs to be greater than the default to handle it. Justin. - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 2.6.24-rc6 reproducible raid5 hang
On Thu, 27 Dec 2007, Justin Piszcz wrote: With that high of a stripe size the stripe_cache_size needs to be greater than the default to handle it. i'd argue that any deadlock is a bug... regardless i'm still seeing deadlocks with the default chunk_size of 64k and stripe_cache_size of 256... in this case it's with a workload which is untarring 34 copies of the linux kernel at the same time. it's a variant of doug ledford's memtest, and i've attached it. -dean#!/usr/bin/perl # Copyright (c) 2007 dean gaudet [EMAIL PROTECTED] # # Permission is hereby granted, free of charge, to any person obtaining a # copy of this software and associated documentation files (the Software), # to deal in the Software without restriction, including without limitation # the rights to use, copy, modify, merge, publish, distribute, sublicense, # and/or sell copies of the Software, and to permit persons to whom the # Software is furnished to do so, subject to the following conditions: # # The above copyright notice and this permission notice shall be included # in all copies or substantial portions of the Software. # # THE SOFTWARE IS PROVIDED AS IS, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR # IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, # FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL # THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR # OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, # ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR # OTHER DEALINGS IN THE SOFTWARE. # this idea shamelessly stolen from doug ledford use warnings; use strict; # ensure stdout is not buffered select(STDOUT); $| = 1; my $usage = usage: $0 linux.tar.gz /path1 [/path2 ...]\n; defined(my $tarball = shift) or die $usage; -f $tarball or die $tarball does not exist or is not a file\n; my @paths = @ARGV; $#paths = 0 or die $usage; # determine size of uncompressed tarball open(GZIP, -|) || exec gzip, --quiet, --list, $tarball; my $line = GZIP; my ($tarball_size) = $line =~ m#^\s*\d+\s*(\d+)#; defined($tarball_size) or die unexpected result from gzip --quiet --list $tarball\n; close(GZIP); # determine amount of memory open(MEMINFO, /proc/meminfo) or die unable to open /proc/meminfo for read: $!\n; my $total_mem; while (MEMINFO) { if (/^MemTotal:\s*(\d+)\s*kB/) { $total_mem = $1; last; } } defined($total_mem) or die did not find MemTotal line in /proc/meminfo\n; close(MEMINFO); $total_mem *= 1024; print total memory: $total_mem\n; print uncompressed tarball: $tarball_size\n; my $nr_simultaneous = int(1.2 * $total_mem / $tarball_size); print nr simultaneous processes: $nr_simultaneous\n; sub system_or_die { my @args = @_; system(@args); if ($? == 1) { my $msg = sprintf(%s failed to exec %s: $!\n, scalar(localtime), $args[0]); } elsif ($? 127) { my $msg = sprintf(%s %s died with signal %d, %s coredump\n, scalar(localtime), $args[0], ($? 127), ($? 128) ? with : without); die $msg; } elsif (($? 8) != 0) { my $msg = sprintf(%s %s exited with non-zero exit code %d\n, scalar(localtime), $args[0], $? 8); die $msg; } } sub untar($) { mkdir($_[0]) or die localtime(). unable to mkdir($_[0]): $!\n; system_or_die(tar, -xzf, $tarball, -C, $_[0]); } print localtime(). untarring golden copy\n; my $golden = $paths[0]./dma_tmp.$$.gold; untar($golden); my $pass_no = 0; while (1) { print localtime(). pass $pass_no: extracting\n; my @outputs; foreach my $n (1..$nr_simultaneous) { # treat paths in a round-robin manner my $dir = shift(@paths); push(@paths, $dir); $dir .= /dma_tmp.$$.$n; push(@outputs, $dir); my $pid = fork; defined($pid) or die localtime(). unable to fork: $!\n; if ($pid == 0) { untar($dir); exit(0); } } # wait for the children while (wait != -1) {} print localtime(). pass $pass_no: diffing\n; foreach my $dir (@outputs) { my $pid = fork; defined($pid) or die localtime(). unable to fork: $!\n; if ($pid == 0) { system_or_die(diff, -U, 3, -rN, $golden, $dir); system_or_die(rm, -fr, $dir); exit(0); } } # wait for the children while (wait != -1) {} ++$pass_no; }