[BUGREPORT] The kernel thread for md RAID10 could cause a md RAID10 array deadlock

2008-02-13 Thread K.Tanaka
This message describes another issue about md-RAID10 found by
testing the 2.6.24 md RAID10 using new scsi fault injection framework.

Abstract:
When a scsi command timeout occurs during RAID10 recovery, the kernel
threads for md RAID10 could cause a md RAID10 array deadlock.
The nr_pending flag set during normal I/O and barrier flag set by recovery
thread conflicts, results in raid10d() and sync_request() deadlock.

Details:
 normal I/O recovery I/O
   -
   B-1. kernel thread starts by calling
   A-1. A process issues a read request. md_do_sync()
make_request() for raid10 is called
by block layer.
   B-2. md_do_sync() calls sync_request
operation for md raid10.
   A-2. In make_request(), wait_barrier()
increments nr_pending flag.

   A-3. A read command is issued to the disk,
but it takes a lot of time because
of no response from the disk.
   B-3. sync_request() of raid10 calls
raise_barrier(), increments 
barrier
flag, and waits for nr_pending 
set
in (A-2) to be cleared.
   A-4. raid10_end_read_request() is called
in the interrupt context. It detects
read error and wakes up raid10d kernel
thread.

   A-5. raid10d() calls freeze_array() and waits
for barrier flag incremented in (B-3)
to be cleared.

(**  stalls here because waiting conditions in A-5 and B-3 are never met **)


   A-6. raid1d calls fix_read_error() to
handle read error. B-4. barrier flag will be cleared 
after
the pending barrier request 
completes.
   A-7  nr_pending flag will be cleared after
the pending read request completes.

The deadlock mechanism:
When a normal I/O occurs during recovery, nr_pending flag incremented in (A-2)
blocks subsequent recovery I/O until the normal I/O completes. The recovery 
thread
will increment barrier flag and wait for nr_pending flag to be decremented 
(B-3).

Normally, nr_pending flag is decremented after the I/O has completed 
successfully.
Also, barrier flag is decremented after barrier request (such as recovery I/O) 
has
completed successfully.

If a normal read I/O results in scsi command timeout, the read request is 
handled
by error handler in raid10d kernel thread. Then, raid10d calls freeze_array().
But the barrier flag is set by (B-3), freeze_array() waits for barrier request
completion. On the other hand, the recovery thread stalls waiting for nr_pending
flag to  be decremented(B-3). In this way, both error handler and recovery
thread are deadlocked.

This problem can be reproduced  by using the new scsi fault injection framework,
using no response from the SCSI device simulation.
I think the new scsi fault injection framework is a little bit complicated
to use, so I will upload some sample wrapper shell scripts for usability.

-- 

-
Kenichi TANAKA| Open Source Software Platform Development Division
  | Computers Software Operations Unit, NEC Corporation
  | [EMAIL PROTECTED]
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [BUG] The kernel thread for md RAID1 could cause a md RAID1 array deadlock

2008-01-29 Thread K.Tanaka
Hi,

Also, md raid10 seems to have the same problem.
I will test raid10 applying this patch as well.

Sorry for the late response. I had a trouble with reproducing the problem,
but it turns out that the 2.6.24 kernel needs the latest (possibly testing)
version of systemtap-0.6.1-1 to run systemtap for the fault injection tool.

I've reproduced the stall on both raid1 and raid10 using 2.6.24.
Also I've tested the patch applied to 2.6.24 and confirmed that
it will fix the stall problem for both cases.

K.Tanaka wrote:
 Hi,
 
 Thank you for the patch.
 I have applied the patch to 2.6.23.14 and it works well.
 
 - In case of 2.6.23.14, the problem is reproduced.
 - In case of 2.6.23.14 with this patch, raid1 works well so far.
   The fault injection script continues to run, and it doesn't deadlock.
   I will keep it running for a while.
 
 Also, md raid10 seems to have the same problem.
 I will test raid10 applying this patch as well.
 
 
 Neil Brown wrote:
 On Tuesday January 15, [EMAIL PROTECTED] wrote:
 This message describes the details about md-RAID1 issue found by
 testing the md RAID1 using the SCSI fault injection framework.

 Abstract:
 Both the error handler for md RAID1 and write access request to the md RAID1
 use raid1d kernel thread. The nr_pending flag could cause a race condition
 in raid1d, results in a raid1d deadlock.
 Thanks for finding and reporting this.

 I believe the following patch should fix the deadlock.

 If you are able to repeat your test and confirm this I would
 appreciate it.

 Thanks,
 NeilBrown



 Fix deadlock in md/raid1 when handling a read error.

 When handling a read error, we freeze the array to stop any other
 IO while attempting to over-write with correct data.


-- 
-
Kenichi TANAKA| Open Source Software Platform Development Division
  | Computers Software Operations Unit, NEC Corporation
  | [EMAIL PROTECTED]

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [BUG] The kernel thread for md RAID1 could cause a md RAID1 array deadlock

2008-01-24 Thread K.Tanaka
Hi,

Thank you for the patch.
I have applied the patch to 2.6.23.14 and it works well.

- In case of 2.6.23.14, the problem is reproduced.
- In case of 2.6.23.14 with this patch, raid1 works well so far.
  The fault injection script continues to run, and it doesn't deadlock.
  I will keep it running for a while.

Also, md raid10 seems to have the same problem.
I will test raid10 applying this patch as well.


Neil Brown wrote:
 On Tuesday January 15, [EMAIL PROTECTED] wrote:
 This message describes the details about md-RAID1 issue found by
 testing the md RAID1 using the SCSI fault injection framework.

 Abstract:
 Both the error handler for md RAID1 and write access request to the md RAID1
 use raid1d kernel thread. The nr_pending flag could cause a race condition
 in raid1d, results in a raid1d deadlock.
 
 Thanks for finding and reporting this.
 
 I believe the following patch should fix the deadlock.
 
 If you are able to repeat your test and confirm this I would
 appreciate it.
 
 Thanks,
 NeilBrown
 
 
 
 Fix deadlock in md/raid1 when handling a read error.
 
 When handling a read error, we freeze the array to stop any other
 IO while attempting to over-write with correct data.
 
 This is done in the raid1d thread and must wait for all submitted IO
 to complete (except for requests that failed and are sitting in the
 retry queue - these are counted in -nr_queue and will stay there during
 a freeze).
 
 However write requests need attention from raid1d as bitmap updates
 might be required.  This can cause a deadlock as raid1 is waiting for
 requests to finish that themselves need attention from raid1d.
 
 So we create a new function 'flush_pending_writes' to give that attention,
 and call it in freeze_array to be sure that we aren't waiting on raid1d.
 
 Thanks to K.Tanaka [EMAIL PROTECTED] for finding and reporting
 this problem.
 
 Cc: K.Tanaka [EMAIL PROTECTED]
 Signed-off-by: Neil Brown [EMAIL PROTECTED]
 
-- 
-
Kenichi TANAKA| Open Source Software Platform Development Division
  | Computers Software Operations Unit, NEC Corporation
  | [EMAIL PROTECTED]
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] A SCSI fault injection framework using SystemTap.

2008-01-22 Thread K.Tanaka

The new framework is tested on Fedora8(i386) running with kernel 2.6.23.12.
So far, I'm cleaning up the tool set for release, and plan to post it in the 
near future.

Now it's ready. The scsi fault injection tool is available from the following 
site.
https://sourceforge.net/projects/scsifaultinjtst/
If you have any comments, please let me know.

Additionally, the deadlock problem reproduced also on md RAID10. I think that 
the
same reason for RAID1 deadlock reported earlier cause this problem, because
raid10.c is based on raid1.c.
 e.g.
  -The kernel thread for md RAID1 could cause a deadlock when the error 
 handler for
md RAID1 contends with the write access to the md RAID1 array.

I've reproduced the deadlock on RAID10 using this tool with a small shell 
script for
automatically injecting a fault repeatedly. But I can't come up with any good
idea for the patch to fix this problem so far.

-- 

Kenichi TANAKA| Open Source Software Platform Development Division
  | Computers Software Operations Unit, NEC Corporation
  | [EMAIL PROTECTED]

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] A SCSI fault injection framework using SystemTap.

2008-01-15 Thread K.Tanaka
Matthew Wilcox wrote:
 On Tue, Jan 15, 2008 at 12:04:09PM +0900, K.Tanaka wrote:
 I would like to introduce a SCSI fault injection framework using SystemTap.

 Currently, kernel has Fault-injection framework and Faulty mode for md,
 which can also be used for testing the error handling. But, they could
 only produce fixed type of errors stochastically. In order to simulate
 more realistic scsi disk faults, I have created a  new flexible fault 
 injection
 framework using SystemTap.
 
 How does it compare to using scsi_debug, which I believe can do all of
 the above and more?
 

 Sorry for the lack of explanation.
 The new framework is supposed to be used by a userspace testing tool
 (such as a shell script). For the availability, this framework enables user to
 designate the inode number of the target file on the device to inject faults.
 On accessing the target file through page caches, a fault will be injected.
 Also, user can designate the logical block address as the target position
 of a fault injection.

-- 
-
Kenichi TANAKA| Open Source Software Platform Development Division
  | Computers Software Operations Unit, NEC Corporation
  | [EMAIL PROTECTED]
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[BUG] The kernel thread for md RAID1 could cause a md RAID1 array deadlock

2008-01-14 Thread K.Tanaka

This message describes the details about md-RAID1 issue found by
testing the md RAID1 using the SCSI fault injection framework.

Abstract:
Both the error handler for md RAID1 and write access request to the md RAID1
use raid1d kernel thread. The nr_pending flag could cause a race condition
in raid1d, results in a raid1d deadlock.

Details:
 error handlingwrite operation
   
--
A-1. Issue a read request

A-2  SCSI error detected
   B-1. make_request() for raid1 
starts.
A-3. raid1_end_read_request() is called
 in the interrupt context. It detects
 read error and wakes up raid1d
 kernel thread.B-2. make_request() calls 
wait_barrier() to
increment nr_pending flag.
A-4. raid1d wake up

A-5. raid1d calls freeze_array() and waiting
 for nr_pending to be decremented.
 That means stop IO and wait for   B-3. make_request() wakes up 
raid1d kernel thread
 everything to go quite.to send write request to 
the lower layer.

   B-4. raid1d wake up (already 
waken up by A-3)

   (  process stalls here because A-5 never ends )

A-6. raid1d calls fix_read_error() to
 handle read error.B-5. raid1d calls 
generic_make_request() for write request.

   B-6. raid1_end_write_request() 
is called in the
interrupt context when the 
write access is completed
and nr_pending flag is 
decremented.
The deadlock mechanism:
If raid1d waken up by detecting read error (A-4) goes into freeze_array()
right after make_request() for write request has incremented nr_pending 
flag(B-2),
raid1d stalls waiting for nr_pending flag to be decremented (A-5).
On the other hand, nr_pending flag incremented by make_request() for write 
request
will never be decremented because the flag can be decremented after raid1d 
issues
generic_make_request() (B-5, B-6) but now raid1d is stopped.

This problem could could easily be reproduced with by using the new fault 
injection framework,
using no response from the SCSI device simulation.
However, it could also occur if raid1 error handler contends with write
operation,  but with low probability.

I will report the other problems after I clean up and post the code for
the scsi fault injection framework.

-- 

Kenichi TANAKA| Open Source Software Platform Development Division
  | Computers Software Operations Unit, NEC Corporation
  | [EMAIL PROTECTED]


-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html