Re: [RFC] A SCSI fault injection framework using SystemTap.
Matthew Wilcox wrote: On Tue, Jan 15, 2008 at 12:04:09PM +0900, K.Tanaka wrote: I would like to introduce a SCSI fault injection framework using SystemTap. Currently, kernel has Fault-injection framework and Faulty mode for md, which can also be used for testing the error handling. But, they could only produce fixed type of errors stochastically. In order to simulate more realistic scsi disk faults, I have created a new flexible fault injection framework using SystemTap. How does it compare to using scsi_debug, which I believe can do all of the above and more? Sorry for the lack of explanation. The new framework is supposed to be used by a userspace testing tool (such as a shell script). For the availability, this framework enables user to designate the inode number of the target file on the device to inject faults. On accessing the target file through page caches, a fault will be injected. Also, user can designate the logical block address as the target position of a fault injection. -- - Kenichi TANAKA| Open Source Software Platform Development Division | Computers Software Operations Unit, NEC Corporation | [EMAIL PROTECTED] - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [dm-devel] [RFC] A SCSI fault injection framework using SystemTap.
On Tue, Jan 15, 2008 at 12:04:09PM +0900, K.Tanaka wrote: -dm-mirror's redundancy doesn't work. A read error from the disk consisting the array will be directory passed to the userspace, without reading from the other mirror. (It turns out that this issue is a known issue, but the patch is not merged. http://www.kernel.org/pub/linux/kernel/people/agk/patches/2.6/editing/dm-raid1-handle-read-failures.patch) It's in the queue for 2.6.25. Alasdair -- [EMAIL PROTECTED] - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 001 of 6] md: Fix an occasional deadlock in raid5
On Mon, 14 Jan 2008, NeilBrown wrote: raid5's 'make_request' function calls generic_make_request on underlying devices and if we run out of stripe heads, it could end up waiting for one of those requests to complete. This is bad as recursive calls to generic_make_request go on a queue and are not even attempted until make_request completes. So: don't make any generic_make_request calls in raid5 make_request until all waiting has been done. We do this by simply setting STRIPE_HANDLE instead of calling handle_stripe(). If we need more stripe_heads, raid5d will get called to process the pending stripe_heads which will call generic_make_request from a different thread where no deadlock will happen. This change by itself causes a performance hit. So add a change so that raid5_activate_delayed is only called at unplug time, never in raid5. This seems to bring back the performance numbers. Calling it in raid5d was sometimes too soon... Cc: Dan Williams [EMAIL PROTECTED] Signed-off-by: Neil Brown [EMAIL PROTECTED] probably doesn't matter, but for the record: Tested-by: dean gaudet [EMAIL PROTECTED] this time i tested with internal and external bitmaps and it survived 8h and 14h resp. under the parallel tar workload i used to reproduce the hang. btw this should probably be a candidate for 2.6.22 and .23 stable. thanks -dean - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 001 of 6] md: Fix an occasional deadlock in raid5
On Tue, 15 Jan 2008 21:01:17 -0800 (PST) dean gaudet [EMAIL PROTECTED] wrote: On Mon, 14 Jan 2008, NeilBrown wrote: raid5's 'make_request' function calls generic_make_request on underlying devices and if we run out of stripe heads, it could end up waiting for one of those requests to complete. This is bad as recursive calls to generic_make_request go on a queue and are not even attempted until make_request completes. So: don't make any generic_make_request calls in raid5 make_request until all waiting has been done. We do this by simply setting STRIPE_HANDLE instead of calling handle_stripe(). If we need more stripe_heads, raid5d will get called to process the pending stripe_heads which will call generic_make_request from a different thread where no deadlock will happen. This change by itself causes a performance hit. So add a change so that raid5_activate_delayed is only called at unplug time, never in raid5. This seems to bring back the performance numbers. Calling it in raid5d was sometimes too soon... Cc: Dan Williams [EMAIL PROTECTED] Signed-off-by: Neil Brown [EMAIL PROTECTED] probably doesn't matter, but for the record: Tested-by: dean gaudet [EMAIL PROTECTED] this time i tested with internal and external bitmaps and it survived 8h and 14h resp. under the parallel tar workload i used to reproduce the hang. btw this should probably be a candidate for 2.6.22 and .23 stable. hm, Neil said The first fixes a bug which could make it a candidate for 24-final. However it is a deadlock that seems to occur very rarely, and has been in mainline since 2.6.22. So letting it into one more release shouldn't be a big problem. While the fix is fairly simple, it could have some unexpected consequences, so I'd rather go for the next cycle. food fight! - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 001 of 6] md: Fix an occasional deadlock in raid5
On Tue, 15 Jan 2008, Andrew Morton wrote: On Tue, 15 Jan 2008 21:01:17 -0800 (PST) dean gaudet [EMAIL PROTECTED] wrote: On Mon, 14 Jan 2008, NeilBrown wrote: raid5's 'make_request' function calls generic_make_request on underlying devices and if we run out of stripe heads, it could end up waiting for one of those requests to complete. This is bad as recursive calls to generic_make_request go on a queue and are not even attempted until make_request completes. So: don't make any generic_make_request calls in raid5 make_request until all waiting has been done. We do this by simply setting STRIPE_HANDLE instead of calling handle_stripe(). If we need more stripe_heads, raid5d will get called to process the pending stripe_heads which will call generic_make_request from a different thread where no deadlock will happen. This change by itself causes a performance hit. So add a change so that raid5_activate_delayed is only called at unplug time, never in raid5. This seems to bring back the performance numbers. Calling it in raid5d was sometimes too soon... Cc: Dan Williams [EMAIL PROTECTED] Signed-off-by: Neil Brown [EMAIL PROTECTED] probably doesn't matter, but for the record: Tested-by: dean gaudet [EMAIL PROTECTED] this time i tested with internal and external bitmaps and it survived 8h and 14h resp. under the parallel tar workload i used to reproduce the hang. btw this should probably be a candidate for 2.6.22 and .23 stable. hm, Neil said The first fixes a bug which could make it a candidate for 24-final. However it is a deadlock that seems to occur very rarely, and has been in mainline since 2.6.22. So letting it into one more release shouldn't be a big problem. While the fix is fairly simple, it could have some unexpected consequences, so I'd rather go for the next cycle. food fight! heheh. it's really easy to reproduce the hang without the patch -- i could hang the box in under 20 min on 2.6.22+ w/XFS and raid5 on 7x750GB. i'll try with ext3... Dan's experiences suggest it won't happen with ext3 (or is even more rare), which would explain why this has is overall a rare problem. but it doesn't result in dataloss or permanent system hangups as long as you can become root and raise the size of the stripe cache... so OK i agree with Neil, let's test more... food fight over! :) -dean - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 001 of 6] md: Fix an occasional deadlock in raid5
heheh. it's really easy to reproduce the hang without the patch -- i could hang the box in under 20 min on 2.6.22+ w/XFS and raid5 on 7x750GB. i'll try with ext3... Dan's experiences suggest it won't happen with ext3 (or is even more rare), which would explain why this has is overall a rare problem. Hmmm... how rare? http://marc.info/?l=linux-kernelm=119461747005776w=2 There is nothing specific that prevents other filesystems from hitting it, perhaps XFS is just better at submitting large i/o's. -stable should get some kind of treatment. I'll take altered performance over a hung system. -- Dan - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 001 of 6] md: Fix an occasional deadlock in raid5
On Wed, 16 Jan 2008 00:09:31 -0700 Dan Williams [EMAIL PROTECTED] wrote: heheh. it's really easy to reproduce the hang without the patch -- i could hang the box in under 20 min on 2.6.22+ w/XFS and raid5 on 7x750GB. i'll try with ext3... Dan's experiences suggest it won't happen with ext3 (or is even more rare), which would explain why this has is overall a rare problem. Hmmm... how rare? http://marc.info/?l=linux-kernelm=119461747005776w=2 There is nothing specific that prevents other filesystems from hitting it, perhaps XFS is just better at submitting large i/o's. -stable should get some kind of treatment. I'll take altered performance over a hung system. We can always target 2.6.25-rc1 then 2.6.24.1 if Neil is still feeling wimpy. - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html