[BUGREPORT] The kernel thread for md RAID10 could cause a md RAID10 array deadlock
This message describes another issue about md-RAID10 found by testing the 2.6.24 md RAID10 using new scsi fault injection framework. Abstract: When a scsi command timeout occurs during RAID10 recovery, the kernel threads for md RAID10 could cause a md RAID10 array deadlock. The nr_pending flag set during normal I/O and barrier flag set by recovery thread conflicts, results in raid10d() and sync_request() deadlock. Details: normal I/O recovery I/O - B-1. kernel thread starts by calling A-1. A process issues a read request. md_do_sync() make_request() for raid10 is called by block layer. B-2. md_do_sync() calls sync_request operation for md raid10. A-2. In make_request(), wait_barrier() increments nr_pending flag. A-3. A read command is issued to the disk, but it takes a lot of time because of no response from the disk. B-3. sync_request() of raid10 calls raise_barrier(), increments barrier flag, and waits for nr_pending set in (A-2) to be cleared. A-4. raid10_end_read_request() is called in the interrupt context. It detects read error and wakes up raid10d kernel thread. A-5. raid10d() calls freeze_array() and waits for barrier flag incremented in (B-3) to be cleared. (** stalls here because waiting conditions in A-5 and B-3 are never met **) A-6. raid1d calls fix_read_error() to handle read error. B-4. barrier flag will be cleared after the pending barrier request completes. A-7 nr_pending flag will be cleared after the pending read request completes. The deadlock mechanism: When a normal I/O occurs during recovery, nr_pending flag incremented in (A-2) blocks subsequent recovery I/O until the normal I/O completes. The recovery thread will increment barrier flag and wait for nr_pending flag to be decremented (B-3). Normally, nr_pending flag is decremented after the I/O has completed successfully. Also, barrier flag is decremented after barrier request (such as recovery I/O) has completed successfully. If a normal read I/O results in scsi command timeout, the read request is handled by error handler in raid10d kernel thread. Then, raid10d calls freeze_array(). But the barrier flag is set by (B-3), freeze_array() waits for barrier request completion. On the other hand, the recovery thread stalls waiting for nr_pending flag to be decremented(B-3). In this way, both error handler and recovery thread are deadlocked. This problem can be reproduced by using the new scsi fault injection framework, using no response from the SCSI device simulation. I think the new scsi fault injection framework is a little bit complicated to use, so I will upload some sample wrapper shell scripts for usability. -- - Kenichi TANAKA| Open Source Software Platform Development Division | Computers Software Operations Unit, NEC Corporation | [EMAIL PROTECTED] - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [BUG] The kernel thread for md RAID1 could cause a md RAID1 array deadlock
Hi, Also, md raid10 seems to have the same problem. I will test raid10 applying this patch as well. Sorry for the late response. I had a trouble with reproducing the problem, but it turns out that the 2.6.24 kernel needs the latest (possibly testing) version of systemtap-0.6.1-1 to run systemtap for the fault injection tool. I've reproduced the stall on both raid1 and raid10 using 2.6.24. Also I've tested the patch applied to 2.6.24 and confirmed that it will fix the stall problem for both cases. K.Tanaka wrote: Hi, Thank you for the patch. I have applied the patch to 2.6.23.14 and it works well. - In case of 2.6.23.14, the problem is reproduced. - In case of 2.6.23.14 with this patch, raid1 works well so far. The fault injection script continues to run, and it doesn't deadlock. I will keep it running for a while. Also, md raid10 seems to have the same problem. I will test raid10 applying this patch as well. Neil Brown wrote: On Tuesday January 15, [EMAIL PROTECTED] wrote: This message describes the details about md-RAID1 issue found by testing the md RAID1 using the SCSI fault injection framework. Abstract: Both the error handler for md RAID1 and write access request to the md RAID1 use raid1d kernel thread. The nr_pending flag could cause a race condition in raid1d, results in a raid1d deadlock. Thanks for finding and reporting this. I believe the following patch should fix the deadlock. If you are able to repeat your test and confirm this I would appreciate it. Thanks, NeilBrown Fix deadlock in md/raid1 when handling a read error. When handling a read error, we freeze the array to stop any other IO while attempting to over-write with correct data. -- - Kenichi TANAKA| Open Source Software Platform Development Division | Computers Software Operations Unit, NEC Corporation | [EMAIL PROTECTED] - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [BUG] The kernel thread for md RAID1 could cause a md RAID1 array deadlock
Hi, Thank you for the patch. I have applied the patch to 2.6.23.14 and it works well. - In case of 2.6.23.14, the problem is reproduced. - In case of 2.6.23.14 with this patch, raid1 works well so far. The fault injection script continues to run, and it doesn't deadlock. I will keep it running for a while. Also, md raid10 seems to have the same problem. I will test raid10 applying this patch as well. Neil Brown wrote: On Tuesday January 15, [EMAIL PROTECTED] wrote: This message describes the details about md-RAID1 issue found by testing the md RAID1 using the SCSI fault injection framework. Abstract: Both the error handler for md RAID1 and write access request to the md RAID1 use raid1d kernel thread. The nr_pending flag could cause a race condition in raid1d, results in a raid1d deadlock. Thanks for finding and reporting this. I believe the following patch should fix the deadlock. If you are able to repeat your test and confirm this I would appreciate it. Thanks, NeilBrown Fix deadlock in md/raid1 when handling a read error. When handling a read error, we freeze the array to stop any other IO while attempting to over-write with correct data. This is done in the raid1d thread and must wait for all submitted IO to complete (except for requests that failed and are sitting in the retry queue - these are counted in -nr_queue and will stay there during a freeze). However write requests need attention from raid1d as bitmap updates might be required. This can cause a deadlock as raid1 is waiting for requests to finish that themselves need attention from raid1d. So we create a new function 'flush_pending_writes' to give that attention, and call it in freeze_array to be sure that we aren't waiting on raid1d. Thanks to K.Tanaka [EMAIL PROTECTED] for finding and reporting this problem. Cc: K.Tanaka [EMAIL PROTECTED] Signed-off-by: Neil Brown [EMAIL PROTECTED] -- - Kenichi TANAKA| Open Source Software Platform Development Division | Computers Software Operations Unit, NEC Corporation | [EMAIL PROTECTED] - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] A SCSI fault injection framework using SystemTap.
The new framework is tested on Fedora8(i386) running with kernel 2.6.23.12. So far, I'm cleaning up the tool set for release, and plan to post it in the near future. Now it's ready. The scsi fault injection tool is available from the following site. https://sourceforge.net/projects/scsifaultinjtst/ If you have any comments, please let me know. Additionally, the deadlock problem reproduced also on md RAID10. I think that the same reason for RAID1 deadlock reported earlier cause this problem, because raid10.c is based on raid1.c. e.g. -The kernel thread for md RAID1 could cause a deadlock when the error handler for md RAID1 contends with the write access to the md RAID1 array. I've reproduced the deadlock on RAID10 using this tool with a small shell script for automatically injecting a fault repeatedly. But I can't come up with any good idea for the patch to fix this problem so far. -- Kenichi TANAKA| Open Source Software Platform Development Division | Computers Software Operations Unit, NEC Corporation | [EMAIL PROTECTED] - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] A SCSI fault injection framework using SystemTap.
Matthew Wilcox wrote: On Tue, Jan 15, 2008 at 12:04:09PM +0900, K.Tanaka wrote: I would like to introduce a SCSI fault injection framework using SystemTap. Currently, kernel has Fault-injection framework and Faulty mode for md, which can also be used for testing the error handling. But, they could only produce fixed type of errors stochastically. In order to simulate more realistic scsi disk faults, I have created a new flexible fault injection framework using SystemTap. How does it compare to using scsi_debug, which I believe can do all of the above and more? Sorry for the lack of explanation. The new framework is supposed to be used by a userspace testing tool (such as a shell script). For the availability, this framework enables user to designate the inode number of the target file on the device to inject faults. On accessing the target file through page caches, a fault will be injected. Also, user can designate the logical block address as the target position of a fault injection. -- - Kenichi TANAKA| Open Source Software Platform Development Division | Computers Software Operations Unit, NEC Corporation | [EMAIL PROTECTED] - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[BUG] The kernel thread for md RAID1 could cause a md RAID1 array deadlock
This message describes the details about md-RAID1 issue found by testing the md RAID1 using the SCSI fault injection framework. Abstract: Both the error handler for md RAID1 and write access request to the md RAID1 use raid1d kernel thread. The nr_pending flag could cause a race condition in raid1d, results in a raid1d deadlock. Details: error handlingwrite operation -- A-1. Issue a read request A-2 SCSI error detected B-1. make_request() for raid1 starts. A-3. raid1_end_read_request() is called in the interrupt context. It detects read error and wakes up raid1d kernel thread.B-2. make_request() calls wait_barrier() to increment nr_pending flag. A-4. raid1d wake up A-5. raid1d calls freeze_array() and waiting for nr_pending to be decremented. That means stop IO and wait for B-3. make_request() wakes up raid1d kernel thread everything to go quite.to send write request to the lower layer. B-4. raid1d wake up (already waken up by A-3) ( process stalls here because A-5 never ends ) A-6. raid1d calls fix_read_error() to handle read error.B-5. raid1d calls generic_make_request() for write request. B-6. raid1_end_write_request() is called in the interrupt context when the write access is completed and nr_pending flag is decremented. The deadlock mechanism: If raid1d waken up by detecting read error (A-4) goes into freeze_array() right after make_request() for write request has incremented nr_pending flag(B-2), raid1d stalls waiting for nr_pending flag to be decremented (A-5). On the other hand, nr_pending flag incremented by make_request() for write request will never be decremented because the flag can be decremented after raid1d issues generic_make_request() (B-5, B-6) but now raid1d is stopped. This problem could could easily be reproduced with by using the new fault injection framework, using no response from the SCSI device simulation. However, it could also occur if raid1 error handler contends with write operation, but with low probability. I will report the other problems after I clean up and post the code for the scsi fault injection framework. -- Kenichi TANAKA| Open Source Software Platform Development Division | Computers Software Operations Unit, NEC Corporation | [EMAIL PROTECTED] - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html