Re: [PATCH 001 of 6] md: Fix an occasional deadlock in raid5

2008-01-15 Thread Andrew Morton
On Tue, 15 Jan 2008 21:01:17 -0800 (PST) dean gaudet [EMAIL PROTECTED] wrote:

 On Mon, 14 Jan 2008, NeilBrown wrote:
 
  
  raid5's 'make_request' function calls generic_make_request on
  underlying devices and if we run out of stripe heads, it could end up
  waiting for one of those requests to complete.
  This is bad as recursive calls to generic_make_request go on a queue
  and are not even attempted until make_request completes.
  
  So: don't make any generic_make_request calls in raid5 make_request
  until all waiting has been done.  We do this by simply setting
  STRIPE_HANDLE instead of calling handle_stripe().
  
  If we need more stripe_heads, raid5d will get called to process the
  pending stripe_heads which will call generic_make_request from a
  different thread where no deadlock will happen.
  
  
  This change by itself causes a performance hit.  So add a change so
  that raid5_activate_delayed is only called at unplug time, never in
  raid5.  This seems to bring back the performance numbers.  Calling it
  in raid5d was sometimes too soon...
  
  Cc: Dan Williams [EMAIL PROTECTED]
  Signed-off-by: Neil Brown [EMAIL PROTECTED]
 
 probably doesn't matter, but for the record:
 
 Tested-by: dean gaudet [EMAIL PROTECTED]
 
 this time i tested with internal and external bitmaps and it survived 8h 
 and 14h resp. under the parallel tar workload i used to reproduce the 
 hang.
 
 btw this should probably be a candidate for 2.6.22 and .23 stable.
 

hm, Neil said

  The first fixes a bug which could make it a candidate for 24-final. 
  However it is a deadlock that seems to occur very rarely, and has been in
  mainline since 2.6.22.  So letting it into one more release shouldn't be
  a big problem.  While the fix is fairly simple, it could have some
  unexpected consequences, so I'd rather go for the next cycle.

food fight!
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 001 of 6] md: Fix an occasional deadlock in raid5

2008-01-15 Thread Andrew Morton
On Wed, 16 Jan 2008 00:09:31 -0700 Dan Williams [EMAIL PROTECTED] wrote:

  heheh.
 
  it's really easy to reproduce the hang without the patch -- i could
  hang the box in under 20 min on 2.6.22+ w/XFS and raid5 on 7x750GB.
  i'll try with ext3... Dan's experiences suggest it won't happen with ext3
  (or is even more rare), which would explain why this has is overall a
  rare problem.
 
 
 Hmmm... how rare?
 
 http://marc.info/?l=linux-kernelm=119461747005776w=2
 
 There is nothing specific that prevents other filesystems from hitting
 it, perhaps XFS is just better at submitting large i/o's.  -stable
 should get some kind of treatment.  I'll take altered performance over
 a hung system.

We can always target 2.6.25-rc1 then 2.6.24.1 if Neil is still feeling
wimpy.

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 001 of 7] md: Support 'external' metadata for md arrays.

2007-12-25 Thread Andrew Morton
On Fri, 14 Dec 2007 17:26:08 +1100 NeilBrown [EMAIL PROTECTED] wrote:

 + if (strncmp(buf, external:, 9) == 0) {
 + int namelen = len-9;
 + if (namelen = sizeof(mddev-metadata_type))
 + namelen = sizeof(mddev-metadata_type)-1;
 + strncpy(mddev-metadata_type, buf+9, namelen);
 + mddev-metadata_type[namelen] = 0;
 + if (namelen  mddev-metadata_type[namelen-1] == '\n')
 + mddev-metadata_type[--namelen] = 0;
 + mddev-persistent = 0;
 + mddev-external = 1;

size_t would be a more appropriate type for `namelen'.
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 004 of 7] md: Allow devices to be shared between md arrays.

2007-12-25 Thread Andrew Morton
On Fri, 14 Dec 2007 17:26:28 +1100 NeilBrown [EMAIL PROTECTED] wrote:

 + mddev_unlock(rdev-mddev);
 + ITERATE_MDDEV(mddev, tmp) {
 + mdk_rdev_t *rdev2;
 +
 + mddev_lock(mddev);
 + ITERATE_RDEV(mddev, rdev2, tmp2)
 + if (test_bit(AllReserved, rdev2-flags) ||
 + (rdev-bdev == rdev2-bdev 
 +  rdev != rdev2 
 +  overlaps(rdev-data_offset, rdev-size,
 + rdev2-data_offset, rdev2-size))) {
 + overlap = 1;
 + break;
 + }
 + mddev_unlock(mddev);
 + if (overlap) {
 + mddev_put(mddev);
 + break;
 + }
 + }

eww, ITERATE_MDDEV() and ITERATE_RDEV() are an eyesore.

for_each_mddev() and for_each_rdev() would at least mean the reader doesn't
need to check the implementation when wondering what that `break' is
breaking from.

  #define  In_sync 2   /* device is in_sync with rest 
 of array */
  #define  WriteMostly 4   /* Avoid reading if at all 
 possible */
  #define  BarriersNotsupp 5   /* BIO_RW_BARRIER is not 
 supported */
 +#define  AllReserved 6   /* If whole device is reserved 
 for

The naming style here is inconsistent.

A task for the keen would be to convert these to an enum and add some
namespacing prefix to them.  
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Kernel 2.6.23.9 / P35 Chipset + WD 750GB Drives (reset port)

2007-12-06 Thread Andrew Morton
On Sat, 1 Dec 2007 06:26:08 -0500 (EST)
Justin Piszcz [EMAIL PROTECTED] wrote:

 I am putting a new machine together and I have dual raptor raid 1 for the 
 root, which works just fine under all stress tests.
 
 Then I have the WD 750 GiB drive (not RE2, desktop ones for ~150-160 on 
 sale now adays):
 
 I ran the following:
 
 dd if=/dev/zero of=/dev/sdc
 dd if=/dev/zero of=/dev/sdd
 dd if=/dev/zero of=/dev/sde
 
 (as it is always a very good idea to do this with any new disk)
 
 And sometime along the way(?) (i had gone to sleep and let it run), this 
 occurred:
 
 [42880.680144] ata3.00: exception Emask 0x10 SAct 0x0 SErr 0x401 
 action 0x2 frozen

Gee we're seeing a lot of these lately.

 [42880.680231] ata3.00: irq_stat 0x00400040, connection status changed
 [42880.680290] ata3.00: cmd ec/00:00:00:00:00/00:00:00:00:00/00 tag 0 cdb 
 0x0 data 512 in
 [42880.680292]  res 40/00:ac:d8:64:54/00:00:57:00:00/40 Emask 0x10 
 (ATA bus error)
 [42881.841899] ata3: soft resetting port
 [42885.966320] ata3: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
 [42915.919042] ata3.00: qc timeout (cmd 0xec)
 [42915.919094] ata3.00: failed to IDENTIFY (I/O error, err_mask=0x5)
 [42915.919149] ata3.00: revalidation failed (errno=-5)
 [42915.919206] ata3: failed to recover some devices, retrying in 5 secs
 [42920.912458] ata3: hard resetting port
 [42926.411363] ata3: port is slow to respond, please be patient (Status 
 0x80)
 [42930.943080] ata3: COMRESET failed (errno=-16)
 [42930.943130] ata3: hard resetting port
 [42931.399628] ata3: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
 [42931.413523] ata3.00: configured for UDMA/133
 [42931.413586] ata3: EH pending after completion, repeating EH (cnt=4)
 [42931.413655] ata3: EH complete
 [42931.413719] sd 2:0:0:0: [sdc] 1465149168 512-byte hardware sectors 
 (750156 MB)
 [42931.413809] sd 2:0:0:0: [sdc] Write Protect is off
 [42931.413856] sd 2:0:0:0: [sdc] Mode Sense: 00 3a 00 00
 [42931.413867] sd 2:0:0:0: [sdc] Write cache: enabled, read cache: 
 enabled, doesn't support DPO or FUA
 
 Usually when I see this sort of thing with another box I have full of 
 raptors, it was due to a bad raptor and I never saw it again after I 
 replaced the disk that it happened on, but that was using the Intel P965 
 chipset.
 
 For this board, it is a Gigabyte GSP-P35-DS4 (Rev 2.0) and I have all of 
 the drives (2 raptors, 3 750s connected to the Intel ICH9 Southbridge).
 
 I am going to do some further testing but does this indicate a bad drive? 
 Bad cable?  Bad connector?
 
 As you can see above, /dev/sdc stopped responding for a little bit and 
 then the kernel reset the port.
 
 Why is this though?  What is the likely root cause?  Should I replace the 
 drive?  Obviously this is not normal and cannot be good at all, the idea 
 is to put these drives in a RAID5 and if one is going to timeout that is 
 going to cause the array to go degraded and thus be worthless in a raid5 
 configuration.
 
 Can anyone offer any insight here?

It would be interesting to try 2.6.21 or 2.6.22.

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Kernel 2.6.23.9 / P35 Chipset + WD 750GB Drives (reset port)

2007-12-06 Thread Andrew Morton
On Thu, 6 Dec 2007 17:38:08 -0500 (EST)
Justin Piszcz [EMAIL PROTECTED] wrote:

 
 
 On Thu, 6 Dec 2007, Andrew Morton wrote:
 
  On Sat, 1 Dec 2007 06:26:08 -0500 (EST)
  Justin Piszcz [EMAIL PROTECTED] wrote:
 
  I am putting a new machine together and I have dual raptor raid 1 for the
  root, which works just fine under all stress tests.
 
  Then I have the WD 750 GiB drive (not RE2, desktop ones for ~150-160 on
  sale now adays):
 
  I ran the following:
 
  dd if=/dev/zero of=/dev/sdc
  dd if=/dev/zero of=/dev/sdd
  dd if=/dev/zero of=/dev/sde
 
  (as it is always a very good idea to do this with any new disk)
 
  And sometime along the way(?) (i had gone to sleep and let it run), this
  occurred:
 
  [42880.680144] ata3.00: exception Emask 0x10 SAct 0x0 SErr 0x401
  action 0x2 frozen
 
  Gee we're seeing a lot of these lately.
 
  [42880.680231] ata3.00: irq_stat 0x00400040, connection status changed
  [42880.680290] ata3.00: cmd ec/00:00:00:00:00/00:00:00:00:00/00 tag 0 cdb
  0x0 data 512 in
  [42880.680292]  res 40/00:ac:d8:64:54/00:00:57:00:00/40 Emask 0x10
  (ATA bus error)
  [42881.841899] ata3: soft resetting port
  [42885.966320] ata3: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
  [42915.919042] ata3.00: qc timeout (cmd 0xec)
  [42915.919094] ata3.00: failed to IDENTIFY (I/O error, err_mask=0x5)
  [42915.919149] ata3.00: revalidation failed (errno=-5)
  [42915.919206] ata3: failed to recover some devices, retrying in 5 secs
  [42920.912458] ata3: hard resetting port
  [42926.411363] ata3: port is slow to respond, please be patient (Status
  0x80)
  [42930.943080] ata3: COMRESET failed (errno=-16)
  [42930.943130] ata3: hard resetting port
  [42931.399628] ata3: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
  [42931.413523] ata3.00: configured for UDMA/133
  [42931.413586] ata3: EH pending after completion, repeating EH (cnt=4)
  [42931.413655] ata3: EH complete
  [42931.413719] sd 2:0:0:0: [sdc] 1465149168 512-byte hardware sectors
  (750156 MB)
  [42931.413809] sd 2:0:0:0: [sdc] Write Protect is off
  [42931.413856] sd 2:0:0:0: [sdc] Mode Sense: 00 3a 00 00
  [42931.413867] sd 2:0:0:0: [sdc] Write cache: enabled, read cache:
  enabled, doesn't support DPO or FUA
 
  Usually when I see this sort of thing with another box I have full of
  raptors, it was due to a bad raptor and I never saw it again after I
  replaced the disk that it happened on, but that was using the Intel P965
  chipset.
 
  For this board, it is a Gigabyte GSP-P35-DS4 (Rev 2.0) and I have all of
  the drives (2 raptors, 3 750s connected to the Intel ICH9 Southbridge).
 
  I am going to do some further testing but does this indicate a bad drive?
  Bad cable?  Bad connector?
 
  As you can see above, /dev/sdc stopped responding for a little bit and
  then the kernel reset the port.
 
  Why is this though?  What is the likely root cause?  Should I replace the
  drive?  Obviously this is not normal and cannot be good at all, the idea
  is to put these drives in a RAID5 and if one is going to timeout that is
  going to cause the array to go degraded and thus be worthless in a raid5
  configuration.
 
  Can anyone offer any insight here?
 
  It would be interesting to try 2.6.21 or 2.6.22.
 
 
 This was due to NCQ issues (disabling it fixed the problem).
 

I cannot locate any further email discussion on this topic.

Disabling NCQ at either compile time or runtime is not a fix and further
work should be done here to maek the kernel run acceptably on that
hardware.
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2.6.23-rc7 0/3] async_tx and md-accel fixes for 2.6.23

2007-09-21 Thread Andrew Morton
On Thu, 20 Sep 2007 18:27:35 -0700
Dan Williams [EMAIL PROTECTED] wrote:

 Fix a couple bugs and provide documentation for the async_tx api.
 
 Neil, please 'ack' patch #3.
 
 git://lost.foo-projects.org/~dwillia2/git/iop async-tx-fixes-for-linus

Well it looks like Neil is on vacation or is hiding from us or something.

In which case our options would be to go ahead and merge your #3 with
our fingers crossed, or wait and do it in 2.6.23.1 or .2.

How serious is this bug which you're fixing?  Will it affect users
of legacy MD setups, or only those who have enabled fancy dma-engine
features?

Either way, we need to get #2 (at least) in to Linus.
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: md: raid10: fix use-after-free of bio

2007-07-29 Thread Andrew Morton
On Fri, 27 Jul 2007 16:46:23 +0200 Maik Hampel [EMAIL PROTECTED] wrote:

 In case of read errors raid10d tries to print a nice error message,
 unfortunately using data from an already put bio.
 
 Signed-off-by: Maik Hampel [EMAIL PROTECTED]
 
 diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
 index f730a14..ea1b3e3 100644
 --- a/drivers/md/raid10.c
 +++ b/drivers/md/raid10.c
 @@ -1557,7 +1557,6 @@ static void raid10d(mddev_t *mddev)
   bio = r10_bio-devs[r10_bio-read_slot].bio;
   r10_bio-devs[r10_bio-read_slot].bio =
   mddev-ro ? IO_BLOCKED : NULL;
 - bio_put(bio);
   mirror = read_balance(conf, r10_bio);
   if (mirror == -1) {
   printk(KERN_ALERT raid10: %s: unrecoverable 
 I/O
 @@ -1567,6 +1566,7 @@ static void raid10d(mddev_t *mddev)
   raid_end_bio_io(r10_bio);
   } else {
   const int do_sync = 
 bio_sync(r10_bio-master_bio);
 + bio_put(bio);
   rdev = conf-mirrors[mirror].rdev;
   if (printk_ratelimit())
   printk(KERN_ERR raid10: %s: 
 redirecting sector %llu to
 
 

Surely we just leaked that bio if (mirror == -1)?

better:

--- a/drivers/md/raid10.c~md-raid10-fix-use-after-free-of-bio
+++ a/drivers/md/raid10.c
@@ -1534,7 +1534,6 @@ static void raid10d(mddev_t *mddev)
bio = r10_bio-devs[r10_bio-read_slot].bio;
r10_bio-devs[r10_bio-read_slot].bio =
mddev-ro ? IO_BLOCKED : NULL;
-   bio_put(bio);
mirror = read_balance(conf, r10_bio);
if (mirror == -1) {
printk(KERN_ALERT raid10: %s: unrecoverable 
I/O
@@ -1542,8 +1541,10 @@ static void raid10d(mddev_t *mddev)
   bdevname(bio-bi_bdev,b),
   (unsigned long long)r10_bio-sector);
raid_end_bio_io(r10_bio);
+   bio_put(bio);
} else {
const int do_sync = 
bio_sync(r10_bio-master_bio);
+   bio_put(bio);
rdev = conf-mirrors[mirror].rdev;
if (printk_ratelimit())
printk(KERN_ERR raid10: %s: 
redirecting sector %llu to
_

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [GIT PATCH 0/2] stripe-queue for 2.6.23 consideration

2007-07-23 Thread Andrew Morton
On Sun, 22 Jul 2007 02:44:52 -0700
Dan Williams [EMAIL PROTECTED] wrote:

 The stripe-queue patches are showing solid performance improvement.
 
   git://lost.foo-projects.org/~dwillia2/git/iop md-for-linus

It's a slippery little tree that one.  I was using md-accel-linus, only I
nuked it because it was empty, or was causing problems or something.

Oh well, I shall switch to md-for-linus and shall see what happens.
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [-mm PATCH 0/2] 74% decrease in dispatched writes, stripe-queue take3

2007-07-13 Thread Andrew Morton
On Fri, 13 Jul 2007 15:35:42 -0700
Dan Williams [EMAIL PROTECTED] wrote:

 The following patches replace the stripe-queue patches currently in -mm.

I have a little practical problem here: am presently unable to compile
anything much due to all the git rejects coming out of git-md-accel.patch.

It'd be appreciated if you could keep on top of that, please.  It's a common
problem at this time of the kernel cycle.  The quilt trees are much worse - 
Greg's
stuff is an unholy mess.  Ho hum.
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [-mm PATCH 0/2] 74% decrease in dispatched writes, stripe-queue take3

2007-07-13 Thread Andrew Morton
On Fri, 13 Jul 2007 15:57:26 -0700
Williams, Dan J [EMAIL PROTECTED] wrote:

  -Original Message-
  From: Andrew Morton [mailto:[EMAIL PROTECTED]
   The following patches replace the stripe-queue patches currently in
 -mm.
  
  I have a little practical problem here: am presently unable to compile
  anything much due to all the git rejects coming out of
 git-md-accel.patch.
  
  It'd be appreciated if you could keep on top of that, please.  It's a
 common
  problem at this time of the kernel cycle.  The quilt trees are much
 worse -
  Greg's
  stuff is an unholy mess.  Ho hum.
 
 Sorry, please drop git-md-accel.patch and git-ioat.patch as they have
 been merged into Linus' tree.

But your ongoing maintenance activity will continue to be held in those
trees, won't it?
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [-mm PATCH 0/2] 74% decrease in dispatched writes, stripe-queue take3

2007-07-13 Thread Andrew Morton
On Fri, 13 Jul 2007 16:28:30 -0700
Williams, Dan J [EMAIL PROTECTED] wrote:

  -Original Message-
  From: Andrew Morton [mailto:[EMAIL PROTECTED]
  
  But your ongoing maintenance activity will continue to be held in
 those
  trees, won't it?
 
 For now:
   git://lost.foo-projects.org/~dwillia2/git/iop
 ioat-md-accel-for-linus
 
 is where the latest combined tree is located.  However, Shannon Nelson
 is coming online to own the i/oat driver so we may need to revisit this
 situation.  We want to avoid the git-ioat/git-md-accel collisions that
 happened in the past.  I will talk with Shannon about how we will
 coordinate this going forward.
 
 The code ownership looks like this:
 ioat dma driver - Shannon
 net dma offload implementation - Shannon
 dmaengine core - shared
 async_tx api - shared
 iop-adma dma driver - Dan
 md-accel implementation - Dan

oh my, how scary.  I'll go into hiding until the dust has settled.  Please
send me the git URLs when it's all set up, thanks.

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2.6.21-mm2 boot failure, raid autodetect, bd_set_size+0xb/0x80

2007-05-11 Thread Andrew Morton
On Fri, 11 May 2007 20:03:34 +0200
[EMAIL PROTECTED] wrote:

 From: Andrew Morton [EMAIL PROTECTED]
 Date: Wed, May 09, 2007 at 01:23:22AM -0700
  
  ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.21/2.6.21-mm2/
  
  
 It won't boot here.
 
 AMD64 platform, raid6 partition.
 
 2.6.21-rc7-mm2 runs fine, it's dmesg says:
 
 md: created md4
 md: bindhda9
 md: bindhdc9
 md: running: hdc9hda9
 raid1: raid set md4 active with 2 out of 2 mirrors
 md4: bitmap initialized from disk: read 10/10 pages, set 46 bits, status: 0
 created bitmap (156 pages) for device md4
 md: considering hdc8 ...
 md:  adding hdc8 ...
 snip
 
 
 where 2.6.21-mm2 halts with
 
 md: created md4
 md: bindhda9
 md: bindhdc9
 md: running: hdc9hda9
 raid1: raid set md4 active with 2 out of 2 mirrors
 md4: bitmap initialized from disk: read 10/10 pages, set 46 bits, status: 0
 created bitmap (156 pages) for device md4
 Unable to handle kernel null pointer dereference 
 
 bd_set_size+0xb/0x80

Yes, Neil had a whoops and a dud patch spent a day in mainline.

Hopefully the below revert (from mainline) will fix it.


From: Linux Kernel Mailing List [EMAIL PROTECTED]
To: [EMAIL PROTECTED]
Subject: Revert md: improve partition detection in md array
Date:   Thu, 10 May 2007 01:59:03 GMT
Sender: [EMAIL PROTECTED]

Gitweb: 
http://git.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=44ce6294d07555c3d313757105fd44b78208407f
Commit: 44ce6294d07555c3d313757105fd44b78208407f
Parent: 497f050c42e46a4b1f6a9bcd8827fa5d97fe1feb
Author: Linus Torvalds [EMAIL PROTECTED]
AuthorDate: Wed May 9 18:51:36 2007 -0700
Committer:  Linus Torvalds [EMAIL PROTECTED]
CommitDate: Wed May 9 18:51:36 2007 -0700

Revert md: improve partition detection in md array

This reverts commit 5b479c91da90eef605f851508744bfe8269591a0.

Quoth Neil Brown:

  It causes an oops when auto-detecting raid arrays, and it doesn't
   seem easy to fix.

   The array may not be 'open' when do_md_run is called, so
   bdev-bd_disk might be NULL, so bd_set_size can oops.

   This whole approach of opening an md device before it has been
   assembled just seems to get more and more painful.  I think I'm going
   to have to come up with something clever to provide both backward
   comparability with usage expectation, and sane integration into the
   rest of the kernel.

Signed-off-by: Linus Torvalds [EMAIL PROTECTED]
---
 drivers/md/md.c   |   26 ++
 drivers/md/raid1.c|1 +
 drivers/md/raid5.c|2 ++
 include/linux/raid/md_k.h |1 +
 4 files changed, 22 insertions(+), 8 deletions(-)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index 2901d0c..65814b0 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -3104,7 +3104,6 @@ static int do_md_run(mddev_t * mddev)
struct gendisk *disk;
struct mdk_personality *pers;
char b[BDEVNAME_SIZE];
-   struct block_device *bdev;
 
if (list_empty(mddev-disks))
/* cannot run an array with no devices.. */
@@ -3332,13 +3331,7 @@ static int do_md_run(mddev_t * mddev)
md_wakeup_thread(mddev-thread);
md_wakeup_thread(mddev-sync_thread); /* possibly kick off a reshape */
 
-   bdev = bdget_disk(mddev-gendisk, 0);
-   if (bdev) {
-   bd_set_size(bdev, mddev-array_size  1);
-   blkdev_ioctl(bdev-bd_inode, NULL, BLKRRPART, 0);
-   bdput(bdev);
-   }
-
+   mddev-changed = 1;
md_new_event(mddev);
kobject_uevent(mddev-gendisk-kobj, KOBJ_CHANGE);
return 0;
@@ -3460,6 +3453,7 @@ static int do_md_stop(mddev_t * mddev, int mode)
mddev-pers = NULL;
 
set_capacity(disk, 0);
+   mddev-changed = 1;
 
if (mddev-ro)
mddev-ro = 0;
@@ -4599,6 +4593,20 @@ static int md_release(struct inode *inode, struct file * 
file)
return 0;
 }
 
+static int md_media_changed(struct gendisk *disk)
+{
+   mddev_t *mddev = disk-private_data;
+
+   return mddev-changed;
+}
+
+static int md_revalidate(struct gendisk *disk)
+{
+   mddev_t *mddev = disk-private_data;
+
+   mddev-changed = 0;
+   return 0;
+}
 static struct block_device_operations md_fops =
 {
.owner  = THIS_MODULE,
@@ -4606,6 +4614,8 @@ static struct block_device_operations md_fops =
.release= md_release,
.ioctl  = md_ioctl,
.getgeo = md_getgeo,
+   .media_changed  = md_media_changed,
+   .revalidate_disk= md_revalidate,
 };
 
 static int md_thread(void * arg)
diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index 1b7130c..97ee870 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -2063,6 +2063,7 @@ static int raid1_resize(mddev_t *mddev, sector_t sectors)
 */
mddev-array_size

Re: [PATCH 002 of 2] md: Improve the is_mddev_idle test

2007-05-10 Thread Andrew Morton
On Thu, 10 May 2007 16:22:31 +1000 NeilBrown [EMAIL PROTECTED] wrote:

 The test currently looks for any (non-fuzz) difference, either
 positive or negative.  This clearly is not needed.  Any non-sync
 activity will cause the total sectors to grow faster than the sync_io
 count (never slower) so we only need to look for a positive differences.
 
 ...

 --- .prev/drivers/md/md.c 2007-05-10 15:51:54.0 +1000
 +++ ./drivers/md/md.c 2007-05-10 16:05:10.0 +1000
 @@ -5095,7 +5095,7 @@ static int is_mddev_idle(mddev_t *mddev)
*
* Note: the following is an unsigned comparison.
*/
 - if ((curr_events - rdev-last_events + 4096)  8192) {
 + if ((long)curr_events - (long)rdev-last_events  4096) {
   rdev-last_events = curr_events;
   idle = 0;

In which case would unsigned counters be more appropriate?
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Please revert 5b479c91da90eef605f851508744bfe8269591a0 (md partition rescan)

2007-05-10 Thread Andrew Morton
On Thu, 10 May 2007 16:51:31 +0200 (MEST) Jan Engelhardt [EMAIL PROTECTED] 
wrote:

 On May 9 2007 18:51, Linus Torvalds wrote:
 
 (But Andrew never saw your email, I suspect: [EMAIL PROTECTED] is probably 
 some strange mixup of Andrew Morton and Andi Kleen in your mind ;)
 
 What do the letters kp stand for?
 

Some say Kernel Programmer.  My parents said Keith Paul.
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RAID1 out of memory error, was Re: 2.6.21-rc5-mm4

2007-04-05 Thread Andrew Morton
On Fri, 06 Apr 2007 02:33:03 +1000
Reuben Farrelly [EMAIL PROTECTED] wrote:

 Hi,
 
 On 3/04/2007 3:47 PM, Andrew Morton wrote:
  ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.21-rc5/2.6.21-rc5-mm4/
  
  - The oops in git-net.patch has been fixed, so that tree has been restored. 
It is huge.
  
  - Added the device-mapper development tree to the -mm lineup (Alasdair
Kergon).  It is a quilt tree, living at
ftp://ftp.kernel.org/pub/linux/kernel/people/agk/patches/2.6/editing/.
  
  - Added davidel's signalfd stuff.
 
 Looks like some damage, or maybe intolerance to on-disk damage, to RAID-1.
 
 md1 is the first array on the disk, and it refuses to start up on boot, or 
 after 
 boot.
 
 ...
 
 tornado ~ # mdadm --assemble /dev/md1 /dev/sda1 /dev/sdc1
 mdadm: device /dev/md1 already active - cannot assemble it
 tornado ~ # mdadm --run /dev/md1
 mdadm: failed to run array /dev/md1: Cannot allocate memory
 tornado ~ #
 
 and looking at a dmesg, this is logged:
 
 md: bindsdc1
 md: bindsda1
 raid1: raid set md1 active with 2 out of 2 mirrors
 md1: bitmap initialized from disk: read 0/1 pages, set 0 bits, status: -12
 md1: failed to create bitmap (-12)
 md: pers-run() failed ...
 
 tornado ~ # uname -a
 Linux tornado 2.6.21-rc5-mm4 #1 SMP Thu Apr 5 23:47:42 EST 2007 x86_64 
 Intel(R) 
 Pentium(R) 4 CPU 3.00GHz GenuineIntel GNU/Linux
 tornado ~ #
 
 The last known version that worked was 2.6.21-rc3-mm1 - I haven't been 
 testing 
 out the -mm releases so much lately.

OK.  I assume that bitmap-chunks in bitmap_init_from_disk() has some
unexpectedly large value.

I don't _think_ there's anything in -mm which would have triggered this. 
Does mainline do the same thing?

I guess it's possible that the code in git-md-accel.patch accidentally
broke things.  Perhaps try disabling CONFIG_DMA_ENGINE?

 Also, Andrew, can you please restart posting/cc'ing your -mm announcements to 
 the [EMAIL PROTECTED] list?  Seems this stopped around about 
 2.6.20, it was handy.

hm.  I always Bcc [EMAIL PROTECTED]  I assume that its
filters didn't get updated after the s/osdl/linux-foundation/ thing.  I'll
talk to people, thanks.

 .config is up at http://www.reub.net/files/kernel/configs/2.6.21-rc5-mm4
 

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] md: Avoid a deadlock when removing a device from an md array via sysfs.

2007-04-02 Thread Andrew Morton
On Mon, 2 Apr 2007 17:44:17 +1000 NeilBrown [EMAIL PROTECTED] wrote:

 (This patch should go in 2.6.21 as it fixes a recent regression - NB)
 
 A device can be removed from an md array via e.g.
   echo remove  /sys/block/md3/md/dev-sde/state
 
 This will try to remove the 'dev-sde' subtree which will deadlock
 since
   commit e7b0d26a86943370c04d6833c6edba2a72a6e240
 
 With this patch we run the kobject_del via schedule_work so as to
 avoid the deadlock.
 
 Cc: Alan Stern [EMAIL PROTECTED]
 Signed-off-by: Neil Brown [EMAIL PROTECTED]
 
 ### Diffstat output
  ./drivers/md/md.c   |   13 -
  ./include/linux/raid/md_k.h |1 +
  2 files changed, 13 insertions(+), 1 deletion(-)
 
 diff .prev/drivers/md/md.c ./drivers/md/md.c
 --- .prev/drivers/md/md.c 2007-04-02 17:43:03.0 +1000
 +++ ./drivers/md/md.c 2007-04-02 17:38:46.0 +1000
 @@ -1389,6 +1389,12 @@ static int bind_rdev_to_array(mdk_rdev_t
   return err;
  }
  
 +static void delayed_delete(struct work_struct *ws)
 +{
 + mdk_rdev_t *rdev = container_of(ws, mdk_rdev_t, del_work);
 + kobject_del(rdev-kobj);
 +}
 +
  static void unbind_rdev_from_array(mdk_rdev_t * rdev)
  {
   char b[BDEVNAME_SIZE];
 @@ -1401,7 +1407,12 @@ static void unbind_rdev_from_array(mdk_r
   printk(KERN_INFO md: unbind%s\n, bdevname(rdev-bdev,b));
   rdev-mddev = NULL;
   sysfs_remove_link(rdev-kobj, block);
 - kobject_del(rdev-kobj);
 +
 + /* We need to delay this, otherwise we can deadlock when
 +  * writing to 'remove' to dev/state
 +  */
 + INIT_WORK(rdev-del_work, delayed_delete);
 + schedule_work(rdev-del_work);
  }
  
  /*
 
 diff .prev/include/linux/raid/md_k.h ./include/linux/raid/md_k.h
 --- .prev/include/linux/raid/md_k.h   2007-04-02 17:43:03.0 +1000
 +++ ./include/linux/raid/md_k.h   2007-04-02 17:36:32.0 +1000
 @@ -104,6 +104,7 @@ struct mdk_rdev_s
  * for reporting to userspace and 
 storing
  * in superblock.
  */
 + struct work_struct del_work;/* used for delayed sysfs removal */
  };
  

What guarantees that *rdev is still valid when delayed_delete() runs?

And what guarantees that the md module hasn't been rmmodded when
delayed_delete() tries to run?

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] md: Fix for raid6 reshape.

2007-03-01 Thread Andrew Morton
On Fri, 2 Mar 2007 15:56:55 +1100 NeilBrown [EMAIL PROTECTED] wrote:

 - conf-expand_progress = (sector_nr + i)*(conf-raid_disks-1);
 + conf-expand_progress = (sector_nr + i) * new_data_disks);

ahem.
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 006 of 6] md: Add support for reshape of a raid6

2007-02-21 Thread Andrew Morton
On Tue, 20 Feb 2007 17:35:16 +1100
NeilBrown [EMAIL PROTECTED] wrote:

 + for (i = conf-raid_disks ; i-- ;  ) {

That statement should be dragged out, shot, stomped on then ceremonially
incinerated.

What's wrong with doing

for (i = 0; i  conf-raid_disks; i++) {

in a manner which can be understood without alcoholic fortification?

ho hum.
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 006 of 6] md: Add support for reshape of a raid6

2007-02-21 Thread Andrew Morton
On Thu, 22 Feb 2007 00:36:22 +0100
Oleg Verych [EMAIL PROTECTED] wrote:

  From: Andrew Morton
  Newsgroups: gmane.linux.raid,gmane.linux.kernel
  Subject: Re: [PATCH 006 of 6] md: Add support for reshape of a raid6
  Date: Wed, 21 Feb 2007 14:48:06 -0800
 
 Hallo.
 
  On Tue, 20 Feb 2007 17:35:16 +1100
  NeilBrown [EMAIL PROTECTED] wrote:
 
  +  for (i = conf-raid_disks ; i-- ;  ) {
 
  That statement should be dragged out, shot, stomped on then ceremonially
  incinerated.
 
  What's wrong with doing
 
  for (i = 0; i  conf-raid_disks; i++) {
 
  in a manner which can be understood without alcoholic fortification?
 
  ho hum.
 
 In case someone likes to do job, GCC usually ought to do, i would
 suggest something like this instead:
 
if (expanded  test_bit(STRIPE_EXPANDING, sh-state)) {
/* Need to write out all blocks after computing PQ */
 -   sh-disks = conf-raid_disks;
 + i = conf-raid_disks;
 + sh-disks = i;
 -   sh-pd_idx = stripe_to_pdidx(sh-sector, conf,
 -conf-raid_disks);
 +   sh-pd_idx = stripe_to_pdidx(sh-sector, conf, i);
 
compute_parity6(sh, RECONSTRUCT_WRITE);
 -   for (i = conf-raid_disks ; i-- ;  ) {
 + do {
set_bit(R5_LOCKED, sh-dev[i].flags);
locked++;
set_bit(R5_Wantwrite, sh-dev[i].flags);
 -   }
 + } while (--i);
 
clear_bit(STRIPE_EXPANDING, sh-state);
} else if (expanded) {
 
 In any case this is subject of scripts/bloat-o-meter.

This:

--- a/drivers/md/raid5.c~a
+++ a/drivers/md/raid5.c
@@ -2364,7 +2364,7 @@ static void handle_stripe6(struct stripe
sh-pd_idx = stripe_to_pdidx(sh-sector, conf,
 conf-raid_disks);
compute_parity6(sh, RECONSTRUCT_WRITE);
-   for (i = conf-raid_disks ; i-- ;  ) {
+   for (i = 0; i  conf-raid_disks; ++) {
set_bit(R5_LOCKED, sh-dev[i].flags);
locked++;
set_bit(R5_Wantwrite, sh-dev[i].flags);
_

reduces the size of drivers/md/raid5.o's .text by two bytes.


-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 006 of 6] md: Add support for reshape of a raid6

2007-02-21 Thread Andrew Morton
On Thu, 22 Feb 2007 13:39:56 +1100 Neil Brown [EMAIL PROTECTED] wrote:

 I must right code that Andrew can read.

That's write.

But more importantly, things that people can immediately see and understand
help reduce the possibility of mistakes.  Now and in the future.

If we did all loops like that, then it'd be the the best way to do it in new 
code,
because people's eyes and brains are locked into that idiom and we just
don't have to think about it when we see it.
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Kernel 2.6.19.2 New RAID 5 Bug (oops when writing Samba - RAID5)

2007-01-26 Thread Andrew Morton
On Wed, 24 Jan 2007 18:37:15 -0500 (EST)
Justin Piszcz [EMAIL PROTECTED] wrote:

  Without digging too deeply, I'd say you've hit the same bug Sami Farin and
  others
  have reported starting with 2.6.19: pages mapped with kmap_atomic() become
  unmapped
  during memcpy() or similar operations.  Try disabling preempt -- that seems 
  to
  be the
  common factor.
  
  
  -
  To unsubscribe from this list: send the line unsubscribe linux-raid in
  the body of a message to [EMAIL PROTECTED]
  More majordomo info at  http://vger.kernel.org/majordomo-info.html
  
  
 
 After I run some other tests, I am going to re-run this test and see if it 
 OOPSes again with PREEMPT off.

Strange.  The below debug patch might catch it - please run with this
applied.  


--- a/arch/i386/mm/highmem.c~kmap_atomic-debugging
+++ a/arch/i386/mm/highmem.c
@@ -30,7 +30,43 @@ void *kmap_atomic(struct page *page, enu
 {
enum fixed_addresses idx;
unsigned long vaddr;
+   static unsigned warn_count = 10;
 
+   if (unlikely(warn_count == 0))
+   goto skip;
+
+   if (unlikely(in_interrupt())) {
+   if (in_irq()) {
+   if (type != KM_IRQ0  type != KM_IRQ1 
+   type != KM_BIO_SRC_IRQ  type != KM_BIO_DST_IRQ 
+   type != KM_BOUNCE_READ) {
+   WARN_ON(1);
+   warn_count--;
+   }
+   } else if (!irqs_disabled()) {  /* softirq */
+   if (type != KM_IRQ0  type != KM_IRQ1 
+   type != KM_SOFTIRQ0  type != KM_SOFTIRQ1 
+   type != KM_SKB_SUNRPC_DATA 
+   type != KM_SKB_DATA_SOFTIRQ 
+   type != KM_BOUNCE_READ) {
+   WARN_ON(1);
+   warn_count--;
+   }
+   }
+   }
+
+   if (type == KM_IRQ0 || type == KM_IRQ1 || type == KM_BOUNCE_READ) {
+   if (!irqs_disabled()) {
+   WARN_ON(1);
+   warn_count--;
+   }
+   } else if (type == KM_SOFTIRQ0 || type == KM_SOFTIRQ1) {
+   if (irq_count() == 0  !irqs_disabled()) {
+   WARN_ON(1);
+   warn_count--;
+   }
+   }
+skip:
/* even !CONFIG_PREEMPT needs this, for in_atomic in do_page_fault */
pagefault_disable();
if (!PageHighMem(page))
_

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 002 of 4] md: Make 'repair' actually work for raid1.

2007-01-23 Thread Andrew Morton
On Tue, 23 Jan 2007 11:26:52 +1100
NeilBrown [EMAIL PROTECTED] wrote:

 + for (j = 0; j  vcnt ; j++)
 + 
 memcpy(page_address(sbio-bi_io_vec[j].bv_page),
 +
 page_address(pbio-bi_io_vec[j].bv_page),
 +PAGE_SIZE);

I trust these BIOs are known to only contain suitably-allocated, MD-private
pages?  Because if these pages can be user pages then this change is
spectacularly buggy ;)

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2.6.20-rc5: cp 18gb 18gb.2 = OOM killer, reproducible just like 2.16.19.2

2007-01-22 Thread Andrew Morton
 On Sun, 21 Jan 2007 14:27:34 -0500 (EST) Justin Piszcz [EMAIL PROTECTED] 
 wrote:
 Why does copying an 18GB on a 74GB raptor raid1 cause the kernel to invoke 
 the OOM killer and kill all of my processes?

What's that?   Software raid or hardware raid?  If the latter, which driver?

 Doing this on a single disk 2.6.19.2 is OK, no issues.  However, this 
 happens every time!
 
 Anything to try?  Any other output needed?  Can someone shed some light on 
 this situation?
 
 Thanks.
 
 
 The last lines of vmstat 1 (right before it kill -9'd my shell/ssh)
 
 procs ---memory-- ---swap-- -io -system-- 
 cpu
  r  b   swpd   free   buff  cache   si   sobibo   in   cs us sy id 
 wa
  0  7764  50348 12 126998800 53632   172 1902 4600  1  8 
 29 62
  0  7764  49420 12 126000400 53632 34368 1871 6357  2 11 
 48 40

The wordwrapping is painful :(

 
 The last lines of dmesg:
 [ 5947.199985] lowmem_reserve[]: 0 0 0
 [ 5947.12] DMA: 0*4kB 1*8kB 1*16kB 0*32kB 1*64kB 1*128kB 1*256kB 
 0*512kB 1*1024kB 1*2048kB 0*4096kB = 3544kB
 [ 5947.200010] Normal: 1*4kB 0*8kB 1*16kB 1*32kB 0*64kB 1*128kB 0*256kB 
 1*512kB 0*1024kB 1*2048kB 0*4096kB = 2740kB
 [ 5947.200035] HighMem: 98*4kB 35*8kB 9*16kB 69*32kB 4*64kB 1*128kB 
 1*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 3664kB
 [ 5947.200052] Swap cache: add 789, delete 189, find 16/17, race 0+0
 [ 5947.200055] Free swap  = 2197628kB
 [ 5947.200058] Total swap = 2200760kB
 [ 5947.200060] Free swap:   2197628kB
 [ 5947.205664] 517888 pages of RAM 
 [ 5947.205671] 288512 pages of HIGHMEM
 [ 5947.205673] 5666 reserved pages 
 [ 5947.205675] 257163 pages shared
 [ 5947.205678] 600 pages swap cached 
 [ 5947.205680] 88876 pages dirty
 [ 5947.205682] 115111 pages writeback
 [ 5947.205684] 5608 pages mapped
 [ 5947.205686] 49367 pages slab
 [ 5947.205688] 541 pages pagetables
 [ 5947.205795] Out of memory: kill process 1853 (named) score 9937 or a 
 child
 [ 5947.205801] Killed process 1853 (named)
 [ 5947.206616] bash invoked oom-killer: gfp_mask=0x84d0, order=0, 
 oomkilladj=0
 [ 5947.206621]  [c013e33b] out_of_memory+0x17b/0x1b0
 [ 5947.206631]  [c013fcac] __alloc_pages+0x29c/0x2f0
 [ 5947.206636]  [c01479ad] __pte_alloc+0x1d/0x90
 [ 5947.206643]  [c0148bf7] copy_page_range+0x357/0x380
 [ 5947.206649]  [c0119d75] copy_process+0x765/0xfc0
 [ 5947.206655]  [c012c3f9] alloc_pid+0x1b9/0x280
 [ 5947.206662]  [c011a839] do_fork+0x79/0x1e0
 [ 5947.206674]  [c015f91f] do_pipe+0x5f/0xc0
 [ 5947.206680]  [c0101176] sys_clone+0x36/0x40
 [ 5947.206686]  [c0103138] syscall_call+0x7/0xb
 [ 5947.206691]  [c0420033] __sched_text_start+0x853/0x950
 [ 5947.206698]  === 

Important information from the oom-killing event is missing.  Please send
it all.

From your earlier reports we have several hundred MB of ZONE_NORMAL memory
which has gone awol.

Please include /proc/meminfo from after the oom-killing.

Please work out what is using all that slab memory, via /proc/slabinfo.

After the oom-killing, please see if you can free up the ZONE_NORMAL memory
via a few `echo 3  /proc/sys/vm/drop_caches' commands.  See if you can
work out what happened to the missing couple-of-hundred MB from
ZONE_NORMAL.


-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2.6.20-rc5: cp 18gb 18gb.2 = OOM killer, reproducible just like 2.16.19.2

2007-01-22 Thread Andrew Morton
On Tue, 23 Jan 2007 11:37:09 +1100
Donald Douwsma [EMAIL PROTECTED] wrote:

 Andrew Morton wrote:
  On Sun, 21 Jan 2007 14:27:34 -0500 (EST) Justin Piszcz [EMAIL PROTECTED] 
  wrote:
  Why does copying an 18GB on a 74GB raptor raid1 cause the kernel to invoke 
  the OOM killer and kill all of my processes?
  
  What's that?   Software raid or hardware raid?  If the latter, which driver?
 
 I've hit this using local disk while testing xfs built against 2.6.20-rc4 
 (SMP x86_64)
 
 dmesg follows, I'm not sure if anything in this is useful after the first 
 event as our automated tests continued on
 after the failure.

This looks different.

 ...

 Mem-info:
 Node 0 DMA per-cpu:
 CPU0: Hot: hi:0, btch:   1 usd:   0   Cold: hi:0, btch:   1 usd:  
  0
 CPU1: Hot: hi:0, btch:   1 usd:   0   Cold: hi:0, btch:   1 usd:  
  0
 CPU2: Hot: hi:0, btch:   1 usd:   0   Cold: hi:0, btch:   1 usd:  
  0
 CPU3: Hot: hi:0, btch:   1 usd:   0   Cold: hi:0, btch:   1 usd:  
  0
 Node 0 DMA32 per-cpu:
 CPU0: Hot: hi:  186, btch:  31 usd:  31   Cold: hi:   62, btch:  15 usd:  
 53
 CPU1: Hot: hi:  186, btch:  31 usd:   2   Cold: hi:   62, btch:  15 usd:  
 60
 CPU2: Hot: hi:  186, btch:  31 usd:  20   Cold: hi:   62, btch:  15 usd:  
 47
 CPU3: Hot: hi:  186, btch:  31 usd:  25   Cold: hi:   62, btch:  15 usd:  
 56
 Active:76 inactive:495856 dirty:0 writeback:0 unstable:0 free:3680 slab:9119 
 mapped:32 pagetables:637

No dirty pages, no pages under writeback.

 Node 0 DMA free:8036kB min:24kB low:28kB high:36kB active:0kB inactive:1856kB 
 present:9376kB pages_scanned:3296
 all_unreclaimable? yes
 lowmem_reserve[]: 0 2003 2003
 Node 0 DMA32 free:6684kB min:5712kB low:7140kB high:8568kB active:304kB 
 inactive:1981624kB present:2052068kB

Inactive list is filled.

 pages_scanned:4343329 all_unreclaimable? yes

We scanned our guts out and decided that nothing was reclaimable.

 No available memory (MPOL_BIND): kill process 3492 (hald) score 0 or a child
 No available memory (MPOL_BIND): kill process 7914 (top) score 0 or a child
 No available memory (MPOL_BIND): kill process 4166 (nscd) score 0 or a child
 No available memory (MPOL_BIND): kill process 17869 (xfs_repair) score 0 or a 
 child

But in all cases a constrained memory policy was in use.
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch] md: pass down BIO_RW_SYNC in raid{1,10}

2007-01-08 Thread Andrew Morton
On Mon, 8 Jan 2007 10:08:34 +0100
Lars Ellenberg [EMAIL PROTECTED] wrote:

 md raidX make_request functions strip off the BIO_RW_SYNC flag,
 thus introducing additional latency.
 
 fixing this in raid1 and raid10 seems to be straight forward enough.
 
 for our particular usage case in DRBD, passing this flag improved
 some initialization time from ~5 minutes to ~5 seconds.

That sounds like a significant fix.

This patch also applies to 2.6.19 and I have tagged it for a -stable
backport.  Neil, are you OK with that?

 Acked-by: NeilBrown [EMAIL PROTECTED]
 Signed-off-by: Lars Ellenberg [EMAIL PROTECTED]

 ---
 
 diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
 index b30f74b..164b25d 100644
 --- a/drivers/md/raid1.c
 +++ b/drivers/md/raid1.c
 @@ -775,6 +775,7 @@ static int make_request(request_queue_t 
   struct bio_list bl;
   struct page **behind_pages = NULL;
   const int rw = bio_data_dir(bio);
 + const int do_sync = bio_sync(bio);
   int do_barriers;
  
   /*
 @@ -835,7 +836,7 @@ static int make_request(request_queue_t 
   read_bio-bi_sector = r1_bio-sector + 
 mirror-rdev-data_offset;
   read_bio-bi_bdev = mirror-rdev-bdev;
   read_bio-bi_end_io = raid1_end_read_request;
 - read_bio-bi_rw = READ;
 + read_bio-bi_rw = READ | do_sync;
   read_bio-bi_private = r1_bio;
  
   generic_make_request(read_bio);
 @@ -906,7 +907,7 @@ #endif
   mbio-bi_sector = r1_bio-sector + 
 conf-mirrors[i].rdev-data_offset;
   mbio-bi_bdev = conf-mirrors[i].rdev-bdev;
   mbio-bi_end_io = raid1_end_write_request;
 - mbio-bi_rw = WRITE | do_barriers;
 + mbio-bi_rw = WRITE | do_barriers | do_sync;
   mbio-bi_private = r1_bio;
  
   if (behind_pages) {
 @@ -941,6 +942,8 @@ #endif
   blk_plug_device(mddev-queue);
   spin_unlock_irqrestore(conf-device_lock, flags);
  
 + if (do_sync)
 + md_wakeup_thread(mddev-thread);
  #if 0
   while ((bio = bio_list_pop(bl)) != NULL)
   generic_make_request(bio);
 @@ -1541,6 +1544,7 @@ static void raid1d(mddev_t *mddev)
* We already have a nr_pending reference on these 
 rdevs.
*/
   int i;
 + const int do_sync = bio_sync(r1_bio-master_bio);
   clear_bit(R1BIO_BarrierRetry, r1_bio-state);
   clear_bit(R1BIO_Barrier, r1_bio-state);
   for (i=0; i  conf-raid_disks; i++)
 @@ -1561,7 +1565,7 @@ static void raid1d(mddev_t *mddev)
   
 conf-mirrors[i].rdev-data_offset;
   bio-bi_bdev = 
 conf-mirrors[i].rdev-bdev;
   bio-bi_end_io = 
 raid1_end_write_request;
 - bio-bi_rw = WRITE;
 + bio-bi_rw = WRITE | do_sync;
   bio-bi_private = r1_bio;
   r1_bio-bios[i] = bio;
   generic_make_request(bio);
 @@ -1593,6 +1597,7 @@ static void raid1d(mddev_t *mddev)
  (unsigned long long)r1_bio-sector);
   raid_end_bio_io(r1_bio);
   } else {
 + const int do_sync = 
 bio_sync(r1_bio-master_bio);
   r1_bio-bios[r1_bio-read_disk] =
   mddev-ro ? IO_BLOCKED : NULL;
   r1_bio-read_disk = disk;
 @@ -1608,7 +1613,7 @@ static void raid1d(mddev_t *mddev)
   bio-bi_sector = r1_bio-sector + 
 rdev-data_offset;
   bio-bi_bdev = rdev-bdev;
   bio-bi_end_io = raid1_end_read_request;
 - bio-bi_rw = READ;
 + bio-bi_rw = READ | do_sync;
   bio-bi_private = r1_bio;
   unplug = 1;
   generic_make_request(bio);
 diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
 index f014191..a9401c0 100644
 --- a/drivers/md/raid10.c
 +++ b/drivers/md/raid10.c
 @@ -782,6 +782,7 @@ static int make_request(request_queue_t 
   int i;
   int chunk_sects = conf-chunk_mask + 1;
   const int rw = bio_data_dir(bio);
 + const int do_sync = bio_sync(bio);
   struct bio_list bl;
   unsigned long flags;
  
 @@ -863,7 +864,7 @@ static int make_request(request_queue_t 
   mirror-rdev-data_offset;
   read_bio-bi_bdev = mirror-rdev-bdev;
   read_bio-bi_end_io = raid10_end_read_request;
 - read_bio-bi_rw = READ;
 + read_bio-bi_rw = READ | do_sync;
   read_bio-bi_private = 

Re: sata badness in 2.6.20-rc1? [Was: Re: md patches in -mm]

2006-12-19 Thread Andrew Morton
On Tue, 19 Dec 2006 15:26:00 -0800 (PST)
Luben Tuikov [EMAIL PROTECTED] wrote:

 The reason was that my dev tree was tainted by this bug:
 
 if (good_bytes 
 -   scsi_end_request(cmd, 1, good_bytes, !!result) == NULL)
 +   scsi_end_request(cmd, 1, good_bytes, result == 0) == NULL)
 return;
 
 in scsi_io_completion().  I had there !!result which is wrong, and when
 I diffed against master, it produced a bad patch.

Oh.  I thought that got sorted out.  It's a shame this wasn't made clear to
me..

 As James mentioned one of the chunks is good and can go in.

Please send a new patch, not referential to any previous patch or email,
including full changelogging.

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: sata badness in 2.6.20-rc1? [Was: Re: md patches in -mm]

2006-12-17 Thread Andrew Morton
On Sun, 17 Dec 2006 12:00:12 +0100
Rafael J. Wysocki [EMAIL PROTECTED] wrote:

 Okay, I have identified the patch that causes the problem to appear, which is
 
 fix-sense-key-medium-error-processing-and-retry.patch
 
 With this patch reverted -rc1-mm1 is happily running on my test box.

That was rather unexpected.   Thanks.
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: md patches in -mm

2006-12-15 Thread Andrew Morton
On Fri, 15 Dec 2006 20:21:46 +0100
[EMAIL PROTECTED] wrote:

 From: Neil Brown [EMAIL PROTECTED]
 Date: Wed, Dec 06, 2006 at 06:20:57PM +1100
  i.e. current -mm is good for 2.6.20 (though I have a few other little
  things I'll be sending in soon, they aren't related to the raid6
  problem).
  
 2.6.20-rc1-mm1 doesn't boot on my box, due to the fact that e2fsck gives
 
 Buffer I/O error on device /dev/md0, logical block 0
 
 and after that 1,2,3,4,5,6,7,8,9, at which points it complains it can't
 read the superblock. It seems the raid6 problem hasn't gone away
 completely, after all.

Odd.  The only md patch in rc1-mm1 is the truly ancient 
md-dm-reduce-stack-usage-with-stacked-block-devices.patch

Does 2.6.20-rc1 work?
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: sata badness in 2.6.20-rc1? [Was: Re: md patches in -mm]

2006-12-15 Thread Andrew Morton
On Sat, 16 Dec 2006 07:50:01 +1100
Neil Brown [EMAIL PROTECTED] wrote:

 On Friday December 15, [EMAIL PROTECTED] wrote:
  From: Neil Brown [EMAIL PROTECTED]
  Date: Wed, Dec 06, 2006 at 06:20:57PM +1100
   i.e. current -mm is good for 2.6.20 (though I have a few other little
   things I'll be sending in soon, they aren't related to the raid6
   problem).
   
  2.6.20-rc1-mm1 doesn't boot on my box, due to the fact that e2fsck gives
  
  Buffer I/O error on device /dev/md0, logical block 0
  
 
 But before that
  raid5: device sdh1 operational as raid disk 1
  raid5: device sdg1 operational as raid disk 0
  raid5: device sdf1 operational as raid disk 5
  raid5: device sde1 operational as raid disk 6
  raid5: device sdd1 operational as raid disk 7
  raid5: device sdc1 operational as raid disk 3
  raid5: device sdb1 operational as raid disk 2
  raid5: device sda1 operational as raid disk 4
  raid5: allocated 8462kB for md0
  raid5: raid level 6 set md0 active with 8 out of 8 devices, algorithm 2
  RAID5 conf printout:
   --- rd:8 wd:8
   disk 0, o:1, dev:sdg1
   disk 1, o:1, dev:sdh1
   disk 2, o:1, dev:sdb1
   disk 3, o:1, dev:sdc1
   disk 4, o:1, dev:sda1
   disk 5, o:1, dev:sdf1
   disk 6, o:1, dev:sde1
   disk 7, o:1, dev:sdd1
  md0: bitmap initialized from disk: read 15/15 pages, set 1 bits, status: 0
  created bitmap (233 pages) for device md0
  md: super_written gets error=-5, uptodate=0
  raid5: Disk failure on sde1, disabling device. Operation continuing on 7 
  devices
  md: super_written gets error=-5, uptodate=0
  raid5: Disk failure on sdg1, disabling device. Operation continuing on 6 
  devices
  md: super_written gets error=-5, uptodate=0
  raid5: Disk failure on sdf1, disabling device. Operation continuing on 5 
  devices
  md: super_written gets error=-5, uptodate=0
  raid5: Disk failure on sdc1, disabling device. Operation continuing on 4 
  devices
  md: super_written gets error=-5, uptodate=0
  raid5: Disk failure on sdb1, disabling device. Operation continuing on 3 
  devices
  md: super_written gets error=-5, uptodate=0
  raid5: Disk failure on sdh1, disabling device. Operation continuing on 2 
  devices
  md: super_written gets error=-5, uptodate=0
  raid5: Disk failure on sdd1, disabling device. Operation continuing on 1 
  devices
  md: super_written gets error=-5, uptodate=0
  raid5: Disk failure on sda1, disabling device. Operation continuing on 0 
  devices
 
 Oh dear, that array isn't much good any more.!
 That is the second report I have had of this with sata drives.  This
 was raid456, the other was raid1.  Two different sata drivers are
 involved (sata_nv in this case, sata_uli in the other case).
 I think something bad happened in sata land just recently.
 The device driver is returning -EIO for a write without printing any messages.
 

OK, this is bad.  The wheels do appear to have fallen off sata in rc1-mm1.

Jeff, I shall send all the sata patches which I have at you one single time
and I shall then drop the lot.  So please don't flub them.

I'll then do a rc1-mm2 without them.
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: sata badness in 2.6.20-rc1? [Was: Re: md patches in -mm]

2006-12-15 Thread Andrew Morton
On Fri, 15 Dec 2006 13:05:52 -0800
Andrew Morton [EMAIL PROTECTED] wrote:

 Jeff, I shall send all the sata patches which I have at you one single time
 and I shall then drop the lot.  So please don't flub them.
 
 I'll then do a rc1-mm2 without them.

hm, this is looking like a lot of work for not much gain.  Rafael, are
you able to do a quick chop and tell us whether these:

pci-move-pci_vdevice-from-libata-to-core.patch
pata_cs5530-suspend-resume-support-tweak.patch
ata-fix-platform_device_register_simple-error-check.patch
initializer-entry-defined-twice-in-pata_rz1000.patch
pata_via-suspend-resume-support-fix.patch
sata_nv-add-suspend-resume-support.patch
libata-simulate-report-luns-for-atapi-devices.patch
user-of-the-jiffies-rounding-patch-ata-subsystem.patch
libata-fix-oops-with-sparsemem.patch
sata_nv-fix-kfree-ordering-in-remove.patch
libata-dont-initialize-sg-in-ata_exec_internal-if-dma_none-take-2.patch
pci-quirks-fix-the-festering-mess-that-claims-to-handle-ide-quirks-ide-fix.patch

are innocent?

Thanks.
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 002 of 6] md: Change lifetime rules for 'md' devices.

2006-10-31 Thread Andrew Morton
On Tue, 31 Oct 2006 17:00:51 +1100
NeilBrown [EMAIL PROTECTED] wrote:

 Currently md devices are created when first opened and remain in existence
 until the module is unloaded.
 This isn't a major problem, but it somewhat ugly.
 
 This patch changes the lifetime rules so that an md device will
 disappear on the last close if it has no state.

This kills the G5:


EXT3-fs: recovery complete.
EXT3-fs: mounted filesystem with ordered data mode.
Oops: Kernel access of bad area, sig: 11 [#1]
SMP NR_CPUS=4 
Modules linked in:
NIP: C01A31B8 LR: C018E5DC CTR: C01A3404
REGS: c17ff4a0 TRAP: 0300   Not tainted  (2.6.19-rc4-mm1)
MSR: 90009032 EE,ME,IR,DR  CR: 8448  XER: 
DAR: 6B6B6B6B6B6B6BB3, DSISR: 4000
TASK = cff2b7f0[1899] 'nash' THREAD: c17fc000 CPU: 1
GPR00: 0008 C17FF720 C06B26D0 6B6B6B6B6B6B6B7B 
GPR04: 0002 0001 0001 000200D0 
GPR08: 00050C00 C01A3404  C01A318C 
GPR12: 8444 C0535680 100FE350 100FE7B8 
GPR16:     
GPR20: 10005CD4 1009  10007E54 
GPR24:  C0472C80 6B6B6B6B6B6B6B7B C1FD2530 
GPR28: C7B8C2F0 6B6B6B6B6B6B6B7B C057DAE8 C79A2550 
NIP [C01A31B8] .kobject_uevent+0xac/0x55c
LR [C018E5DC] .__elv_unregister_queue+0x20/0x44
Call Trace:
[C17FF720] [C0562508] read_pipe_fops+0xd0/0xd8 (unreliable)
[C17FF840] [C018E5DC] .__elv_unregister_queue+0x20/0x44
[C17FF8D0] [C0195548] .blk_unregister_queue+0x58/0x9c
[C17FF960] [C019683C] .unlink_gendisk+0x1c/0x50
[C17FF9F0] [C0122840] .del_gendisk+0x98/0x22c
[C17FFA90] [C035B56C] .mddev_put+0xa0/0xe0
[C17FFB20] [C0362178] .md_release+0x84/0x9c
[C17FFBA0] [C00FDDE0] .__blkdev_put+0x204/0x220
[C17FFC50] [C00C765C] .__fput+0x234/0x274
[C17FFD00] [C00C5264] .filp_close+0x6c/0xfc
[C17FFD90] [C00C53B8] .sys_close+0xc4/0x178
[C17FFE30] [C000872C] syscall_exit+0x0/0x40
Instruction dump:
4e800420 0020 0250 0278 0280 0258 0260 0268 
0270 3b20 2fb9 419e003c e81a0038 2fa0 7c1d0378 409e0070 
 7eth0: no IPv6 routers present

It happens during initscripts.  The machine has no MD devices.  config is
at http://userweb.kernel.org/~akpm/config-g5.txt

Also, it'd be nice to enable CONFIG_MUST_CHECK and take a look at a few
things...

drivers/md/md.c: In function `bind_rdev_to_array':
drivers/md/md.c:1379: warning: ignoring return value of `kobject_add', declared 
with attribute warn_unused_result
drivers/md/md.c:1385: warning: ignoring return value of `sysfs_create_link', 
declared with attribute warn_unused_result
drivers/md/md.c: In function `md_probe':
drivers/md/md.c:2986: warning: ignoring return value of `kobject_register', 
declared with attribute warn_unused_result
drivers/md/md.c: In function `do_md_run':
drivers/md/md.c:3135: warning: ignoring return value of `sysfs_create_group', 
declared with attribute warn_unused_result
drivers/md/md.c:3150: warning: ignoring return value of `sysfs_create_link', 
declared with attribute warn_unused_result
drivers/md/md.c: In function `md_check_recovery':
drivers/md/md.c:5446: warning: ignoring return value of `sysfs_create_link', 
declared with attribute warn_unused_result


-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 001 of 4] md: Define backing_dev_info.congested_fn for raid0 and linear

2006-08-28 Thread Andrew Morton
On Tue, 29 Aug 2006 15:39:24 +1000
NeilBrown [EMAIL PROTECTED] wrote:

 
 Each backing_dev needs to be able to report whether it is congested,
 either by modulating BDI_*_congested in -state, or by
 defining a -congested_fn.
 md/raid did neither of these.  This patch add a congested_fn
 which simply checks all component devices to see if they are
 congested.
 
 Signed-off-by: Neil Brown [EMAIL PROTECTED]
 
 +static int linear_congested(void *data, int bits)
 +{
 + mddev_t *mddev = data;
 + linear_conf_t *conf = mddev_to_conf(mddev);
 + int i, ret = 0;
 +
 + for (i = 0; i  mddev-raid_disks  !ret ; i++) {
 + request_queue_t *q = bdev_get_queue(conf-disks[i].rdev-bdev);
 + ret |= bdi_congested(q-backing_dev_info, bits);

nit: `ret = ' would suffice here.

 +static int raid0_congested(void *data, int bits)
 +{
 + mddev_t *mddev = data;
 + raid0_conf_t *conf = mddev_to_conf(mddev);
 + mdk_rdev_t **devlist = conf-strip_zone[0].dev;
 + int i, ret = 0;
 +
 + for (i = 0; i  mddev-raid_disks  !ret ; i++) {
 + request_queue_t *q = bdev_get_queue(devlist[i]-bdev);
 +
 + ret |= bdi_congested(q-backing_dev_info, bits);

And here.

 + }
 + return ret;
 +}
 +
  
  static int create_strip_zones (mddev_t *mddev)
  {
 @@ -236,6 +251,8 @@ static int create_strip_zones (mddev_t *
   mddev-queue-unplug_fn = raid0_unplug;
  
   mddev-queue-issue_flush_fn = raid0_issue_flush;
 + mddev-queue-backing_dev_info.congested_fn = raid0_congested;
 + mddev-queue-backing_dev_info.congested_data = mddev;
  
   printk(raid0: done.\n);
   return 0;
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 000 of 4] md: Introduction

2006-08-24 Thread Andrew Morton
On Thu, 24 Aug 2006 17:40:56 +1000
NeilBrown [EMAIL PROTECTED] wrote:

 
 Following are 4 patches against 2.6.18-rc4-mm2
 
 The first 2 are bug fixes which should go in 2.6.18, and apply
 equally well to that tree as to -mm.
 
 The latter two should stay in -mm until after 2.6.18.
 
 The second patch is maybe bigger than it absolutely needs to be as a bugfix.
 If you like I can stripe out all the rcu-extra-carefulness as a separate
 patch and just leave the important bit which involves moving the
 atomic_add down twenty-something lines.
 
 Thanks,
 NeilBrown
 
  [PATCH 001 of 4] md: Fix recent breakage of md/raid1 array checking
  [PATCH 002 of 4] md: Fix issues with referencing rdev in md/raid1.
  [PATCH 003 of 4] md: new sysfs interface for setting bits in the 
 write-intent-bitmap
  [PATCH 004 of 4] md: Remove unnecessary variable x in stripe_to_pdidx().

The second patch is against -mm and doesn't come within a mile of applying
to mainline.

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 008 of 8] md/bitmap: Change md/bitmap file handling to use bmap to file blocks.

2006-05-15 Thread Andrew Morton
Neil Brown [EMAIL PROTECTED] wrote:

 ...
 
I have a patch which did that,
   but decided that the possibility of kmalloc failure at awkward times
   would make that not suitable.
  
  submit_bh() can and will allocate memory, although most decent device
  drivers should be OK.
  
 
 submit_bh (like all decent device drivers) uses a mempool for memory
 allocation so we can be sure that the delay in getting memory is
 bounded by the delay for a few IO requests to complete, and we can be
 sure the allocation won't fail.  This is perfectly fine.

My point is that some device drivers will apparently call into the mamory
allocator (should be GFP_NOIO) on the request submission path.  I don't
recall whcih drivers - but not mainstream ones.

   
   I don't think a_ops really provides an interface that I can use, partly
   because, as I said in a previous email, it isn't really a public
   interface to a filesystem.
  
  It's publicer than bmap+submit_bh!
  
 
 I don't know how you can say that.

bmap() is a userspace API, not really a kernel one.  And it assumes a
block-backed filesystem.

 bmap is so public that it is exported to userspace through an IOCTL
 and is used by lilo (admitted only for reading, not writing).  More
 significantly it is used by swapfile which is a completely independent
 subsystem from the filesystem.  Contrast this with a_ops.  The primary
 users of a_ops are routines like generic_file_{read,write} and
 friends.  These are tools - library routines - that are used by
 filesystems to implement their 'file_operations' which are the real
 public interface.  As far as these uses go, it is not a public
 interface.  Where a filesystem doesn't use some library routines, it
 does not need to implement the matching functionality in the a_op
 interface.

Well yeah.  If one wants to do filesystem I/O in-kernel then one should use
the filesystem I/O entry points: read() and write() (and get flamed ;)).

If one wants to cut in at a lower level than that then we get into a pickle
because different types of filesytems do quite different things to
implement read() and write().

 The other main user is the 'VM' which might try to flush out or
 invalidate pages.  However the VM isn't using this interface to
 interact with files, but only to interact with pages, and it doesn't
 much care what is done with the pages providing they get clean, or get
 released, or whatever.
 
 The way I perceive Linux design/development, active usage is far more
 significant than documented design.  If some feature of an interface
 isn't being actively used - by in-kernel code - then you cannot be
 sure that feature will be uniformly implemented, or that it won't
 change subtly next week.
 
 So when I went looking for the best way to get md/bitmap to write to a
 file, I didn't just look at the interface specs (which are pretty
 poorly documented anyway), I looked at existing code.
 I can find 3 different parts of the kernel that write to a file.
 They are
swap-file
loop
nfsd
 
 nfsd uses vfs_read/vfs_write  which have too many failure/delay modes
   for me to safely use.
 loop uses prepare_write/commit_write (if available) or f_op-write
   (not vfs_write - I wonder why) which is not much better than what
   nfsd uses.  And as far as I can tell loop never actually flushes data to 
   storage (hope no-one thinks a journalling filesystem on loop is a
   good idea) so it isn't a good model to follow.
   [[If loop was a good model to follow, I would have rejected the
   code for writing to a file in the first place, only accepted code
   for writing to a device, and insisted on using loop to close the gap]]
 
 That leaves swap-file as the only in-kernel user of a filesystem that
 accesses the file in a way similar to what I need.  Is it any surprise
 that I chose to copy it?

swapfile doesn't use vfs_read() for swapin...

 
  
  But if the pagecache is dirty then invalidate_inode_pages() will leave it
  dirty and the VM will later write it back, corrupting your bitmap file. 
  You should get i_writecount, fsync the file and then run
  invalidate_inode_pages().
 
 Well, as I am currently reading the file through the pagecache but
 yes, I should put an fsync in there (and the invalidate before read is
 fairly pointless.  It is the invalidate after close that is important).

umm, the code in its present form _will_ corrupt MD bitmap files. 
invalidate_inode_pages() will not invalidate dirty pagecache, and that
dirty pagecache will later get written back, undoing any of your
intervening direct-io writes.  Most of the time, MD will do another
direct-io write _after_ the VM has done that writeback and things will get
fixed up.  But there is a window.  So that fsync() is required.

  
  You might as well use kernel_read() too, if you insist on begin oddball ;)
 
 I didn't know about that one..
 
 If you would feel more comfortable with kernel_read I would be more
 than happy to use it.

We might as well save some 

Re: [PATCH 001 of 3] md: Change md/bitmap file handling to use bmap to file blocks-fix

2006-05-15 Thread Andrew Morton
NeilBrown [EMAIL PROTECTED] wrote:

  +do_sync_file_range(file, 0, LLONG_MAX,
  +   SYNC_FILE_RANGE_WRITE |
  +   SYNC_FILE_RANGE_WAIT_AFTER);

That needs a SYNC_FILE_RANGE_WAIT_BEFORE too.  Otherwise any dirty,
under-writeback pages will remain dirty.  I'll make that change.
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 008 of 8] md/bitmap: Change md/bitmap file handling to use bmap to file blocks.

2006-05-14 Thread Andrew Morton
Neil Brown [EMAIL PROTECTED] wrote:

 On Saturday May 13, [EMAIL PROTECTED] wrote:
  Paul Clements [EMAIL PROTECTED] wrote:
  
   Andrew Morton wrote:
   
The loss of pagecache coherency seems sad.  I assume there's never a
requirement for userspace to read this file.
   
   Actually, there is. mdadm reads the bitmap file, so that would be 
   broken. Also, it's just useful for a user to be able to read the bitmap 
   (od -x, or similar) to figure out approximately how much more he's got 
   to resync to get an array in-sync. Other than reading the bitmap file, I 
   don't know of any way to determine that.
  
  Read it with O_DIRECT :(
 
 Which is exactly what the next release of mdadm does.
 As the patch comment said:
 
 : With this approach the pagecache may contain data which is inconsistent 
 with 
 : what is on disk.  To alleviate the problems this can cause, md invalidates
 : the pagecache when releasing the file.  If the file is to be examined
 : while the array is active (a non-critical but occasionally useful function),
 : O_DIRECT io must be used.  And new version of mdadm will have support for 
 this.

Which doesn't help `od -x' and is going to cause older mdadm userspace to
mysteriously and subtly fail.  Or does the user-kernel interface have
versioning which will prevent this?


-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 008 of 8] md/bitmap: Change md/bitmap file handling to use bmap to file blocks.

2006-05-13 Thread Andrew Morton
Neil Brown [EMAIL PROTECTED] wrote:

 On Friday May 12, [EMAIL PROTECTED] wrote:
  NeilBrown [EMAIL PROTECTED] wrote:
  
   If md is asked to store a bitmap in a file, it tries to hold onto the
   page cache pages for that file, manipulate them directly, and call a
   cocktail of operations to write the file out.  I don't believe this is
   a supportable approach.
  
  erk.  I think it's better than...
  
   This patch changes the approach to use the same approach as swap files.
   i.e. bmap is used to enumerate all the block address of parts of the file
   and we write directly to those blocks of the device.
  
  That's going in at a much lower level.  Even swapfiles don't assume
  buffer_heads.
 
 I'm not assuming buffer_heads.  I'm creating buffer heads and using
 them for my own purposes.  These are my pages and my buffer heads.
 None of them belong to the filesystem.

Right, so it's incoherent with pagecache and userspace can no longer
usefully read this file.

 The buffer_heads are simply a convenient data-structure to record the
 several block addresses for each page.  I could have equally created
 an array storing all the addresses, and built the required bios by
 hand at write time.  But buffer_heads did most of the work for me, so
 I used them.

OK.

 Yes, it is a lower level, but
  1/ I am certain that there will be no kmalloc problems and
  2/ Because it is exactly the level used by swapfile, I know that it
 is sufficiently 'well defined' that no-one is going to break it.

It would be nicer of course to actually use the mm/page_io.c code.  That
would involve implementing swap_bmap() and reimplementing the
get_swap_bio() stuff in terms of a_ops-bmap().

But the swap code can afford to skip blockruns which aren't page-sized and
it uses that capability nicely.  You cannot do that.

  
  All this (and a set_fs(KERNEL_DS), ug) looks like a step backwards to me. 
  Operating at the pagecache a_ops level looked better, and more
  filesystem-independent.
 
 If you really want filesystem independence, you need to use vfs_read
 and vfs_write to read/write the file.

yup.

  I have a patch which did that,
 but decided that the possibility of kmalloc failure at awkward times
 would make that not suitable.

submit_bh() can and will allocate memory, although most decent device
drivers should be OK.

There are tricks we can do with writepage.  If the backing filesystem uses
buffer_heads and if you hold a ref on the page then we know that there
won't be any buffer_head allocations nor any disk reads in the writepage
path.  It'll go direct into bio_alloc+submit_bio, just as you're doing now.
IOW: no gain.

 So I now use vfs_read to read in the file (just like knfsd) and
 bmap/submit_bh to write out the file (just like swapfile).
 
 I don't think a_ops really provides an interface that I can use, partly
 because, as I said in a previous email, it isn't really a public
 interface to a filesystem.

It's publicer than bmap+submit_bh!

  
  I haven't looked at this patch at all closely yet.  Do I really need to?
 
 I assume you are asking that because you hope I will retract the
 patch.

Was kinda hoping that would be the outcome.  It's rather gruesome, using
set_fs()+vfs_read() on one side and submit_bh() on the other.

Are you sure the handling at EOF for a non-multiple-of-PAGE_SIZE file
is OK?

The loss of pagecache coherency seems sad.  I assume there's never a
requirement for userspace to read this file.

invalidate_inode_pages() is best-effort.  If someone else has the page
locked or if the page is mmapped then the attempt to take down the
pagecache will fail.  That's relatively-OK, because it'll just lead to
userspace seeing wrong stuff, and we've already conceded that.

But if the pagecache is dirty then invalidate_inode_pages() will leave it
dirty and the VM will later write it back, corrupting your bitmap file. 
You should get i_writecount, fsync the file and then run
invalidate_inode_pages().

Or not run invalidate_inode_pages() - it doesn't gain anything and will
just reduce the observeability of bugs.  Better off leaving the pagecache
there all the time so that any rarely-occurring bugs become all-the-time
bugs.

You might as well use kernel_read() too, if you insist on begin oddball ;)
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 008 of 8] md/bitmap: Change md/bitmap file handling to use bmap to file blocks.

2006-05-13 Thread Andrew Morton
Paul Clements [EMAIL PROTECTED] wrote:

 Andrew Morton wrote:
 
  The loss of pagecache coherency seems sad.  I assume there's never a
  requirement for userspace to read this file.
 
 Actually, there is. mdadm reads the bitmap file, so that would be 
 broken. Also, it's just useful for a user to be able to read the bitmap 
 (od -x, or similar) to figure out approximately how much more he's got 
 to resync to get an array in-sync. Other than reading the bitmap file, I 
 don't know of any way to determine that.

Read it with O_DIRECT :(
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 002 of 8] md/bitmap: Remove bitmap writeback daemon.

2006-05-12 Thread Andrew Morton
NeilBrown [EMAIL PROTECTED] wrote:

  ./drivers/md/bitmap.c |  115 
 ++

hmm.  I hope we're not doing any of that filesystem I/O within the context
of submit_bio() or kblockd or anything like that.  Looks OK from a quick
scan.

a_ops-commit_write() already ran set_page_dirty(), so you don't need that
in there.

I assume this always works in units of a complete page?  It's strange to do
prepare_write() followed immediately by commit_write().  Normally
prepare_write() will do some prereading, but it's smart enough to not do
that if the caller is preparing to write the whole page.

We normally use PAGE_CACHE_SIZE for these things, not PAGE_SIZE.  Same diff.

If you have a page and you want to write the whole thing out then there's
really no need to run prepare_write or commit_write at all.  Just
initialise the whole page, run set_page_dirty() then write_one_page().

Perhaps it should check that the backing filesystem actually implements
commit_write(), prepare_write(), readpage(), etc.  Some might not, and the
user will get taught not to do that via an oops.


-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 008 of 8] md/bitmap: Change md/bitmap file handling to use bmap to file blocks.

2006-05-12 Thread Andrew Morton
NeilBrown [EMAIL PROTECTED] wrote:

 If md is asked to store a bitmap in a file, it tries to hold onto the
 page cache pages for that file, manipulate them directly, and call a
 cocktail of operations to write the file out.  I don't believe this is
 a supportable approach.

erk.  I think it's better than...

 This patch changes the approach to use the same approach as swap files.
 i.e. bmap is used to enumerate all the block address of parts of the file
 and we write directly to those blocks of the device.

That's going in at a much lower level.  Even swapfiles don't assume
buffer_heads.

When playing with bmap() one needs to be careful that nobody has truncated
the file on you, else you end up writing to disk blocks which aren't part
of the file any more.

All this (and a set_fs(KERNEL_DS), ug) looks like a step backwards to me. 
Operating at the pagecache a_ops level looked better, and more
filesystem-independent.

I haven't looked at this patch at all closely yet.  Do I really need to?
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 004 of 11] md: Increase the delay before marking metadata clean, and make it configurable.

2006-05-01 Thread Andrew Morton
Neil Brown [EMAIL PROTECTED] wrote:

  On Sunday April 30, [EMAIL PROTECTED] wrote:
   NeilBrown [EMAIL PROTECTED] wrote:
   

When a md array has been idle (no writes) for 20msecs it is marked as
'clean'.  This delay turns out to be too short for some real
workloads.  So increase it to 200msec (the time to update the metadata
should be a tiny fraction of that) and make it sysfs-configurable.


...

+   safe_mode_delay
+ When an md array has seen no write requests for a certain period
+ of time, it will be marked as 'clean'.  When another write
+ request arrive, the array is marked as 'dirty' before the write
+ commenses.  This is known as 'safe_mode'.
+ The 'certain period' is controlled by this file which stores the
+ period as a number of seconds.  The default is 200msec (0.200).
+ Writing a value of 0 disables safemode.
+
   
   Why not make the units milliseconds?  Rename this to safe_mode_delay_msecs
   to remove any doubt.
 
  Because umpteen years ago when I was adding thread-usage statistics to
  /proc/net/rpc/nfsd I used milliseconds and Linus asked me to make it
  seconds - a much more obvious unit.  See Email below.
  It seems very sensible to me.

That's output.  It's easier to do the conversion with output.  And I guess
one could argue that lots of people read /proc files, but few write to
them.

Generally I don't think we should be teaching the kernel to accept
pretend-floating-point numbers like this, especially when a) delay in
milliseconds is such a simple concept and b) it's so easy to go from float
to milliseconds in userspace.

Do you really expect that humans (really dumb ones ;)) will be echoing
numbers into this file?  Or will it mainly be a thing for mdadm to fiddle
with?
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 004 of 11] md: Increase the delay before marking metadata clean, and make it configurable.

2006-04-30 Thread Andrew Morton
NeilBrown [EMAIL PROTECTED] wrote:

 
 When a md array has been idle (no writes) for 20msecs it is marked as
 'clean'.  This delay turns out to be too short for some real
 workloads.  So increase it to 200msec (the time to update the metadata
 should be a tiny fraction of that) and make it sysfs-configurable.
 
 
 ...
 
 +   safe_mode_delay
 + When an md array has seen no write requests for a certain period
 + of time, it will be marked as 'clean'.  When another write
 + request arrive, the array is marked as 'dirty' before the write
 + commenses.  This is known as 'safe_mode'.
 + The 'certain period' is controlled by this file which stores the
 + period as a number of seconds.  The default is 200msec (0.200).
 + Writing a value of 0 disables safemode.
 +

Why not make the units milliseconds?  Rename this to safe_mode_delay_msecs
to remove any doubt.

 +static ssize_t
 +safe_delay_store(mddev_t *mddev, const char *cbuf, size_t len)
 +{
 + int scale=1;
 + int dot=0;
 + int i;
 + unsigned long msec;
 + char buf[30];
 + char *e;
 + /* remove a period, and count digits after it */
 + if (len = sizeof(buf))
 + return -EINVAL;
 + strlcpy(buf, cbuf, len);
 + buf[len] = 0;
 + for (i=0; ilen; i++) {
 + if (dot) {
 + if (isdigit(buf[i])) {
 + buf[i-1] = buf[i];
 + scale *= 10;
 + }
 + buf[i] = 0;
 + } else if (buf[i] == '.') {
 + dot=1;
 + buf[i] = 0;
 + }
 + }
 + msec = simple_strtoul(buf, e, 10);
 + if (e == buf || (*e  *e != '\n'))
 + return -EINVAL;
 + msec = (msec * 1000) / scale;
 + if (msec == 0)
 + mddev-safemode_delay = 0;
 + else {
 + mddev-safemode_delay = (msec*HZ)/1000;
 + if (mddev-safemode_delay == 0)
 + mddev-safemode_delay = 1;
 + }
 + return len;

And most of that goes away.


-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 001 of 3] md: Don't clear bits in bitmap when writing to one device fails during recovery.

2006-03-29 Thread Andrew Morton
NeilBrown [EMAIL PROTECTED] wrote:

 + if (!uptodate) {
  +int sync_blocks = 0;
  +sector_t s = r1_bio-sector;
  +long sectors_to_go = r1_bio-sectors;
  +/* make sure these bits doesn't get cleared. */
  +do {
  +bitmap_end_sync(mddev-bitmap, r1_bio-sector,
  +sync_blocks, 1);
  +s += sync_blocks;
  +sectors_to_go -= sync_blocks;
  +} while (sectors_to_go  0);
   md_error(mddev, conf-mirrors[mirror].rdev);
  +}

Can mddev-bitmap be NULL?

If so, will the above loop do the right thing when this:

void bitmap_end_sync(struct bitmap *bitmap, sector_t offset, int *blocks, int 
aborted)
{
bitmap_counter_t *bmc;
unsigned long flags;
/*
if (offset == 0) printk(bitmap_end_sync 0 (%d)\n, aborted);
*/  if (bitmap == NULL) {
*blocks = 1024;
return;
}

triggers?
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 005 of 13] md: Allow stripes to be expanded in preparation for expanding an array.

2006-03-16 Thread Andrew Morton
NeilBrown [EMAIL PROTECTED] wrote:

 +static int resize_stripes(raid5_conf_t *conf, int newsize)
  +{
  +/* make all the stripes able to hold 'newsize' devices.
  + * New slots in each stripe get 'page' set to a new page.
  + * We allocate all the new stripes first, then if that succeeds,
  + * copy everything across.
  + * Finally we add new pages.  This could fail, but we leave
  + * the stripe cache at it's new size, just with some pages empty.
  + *
  + * We use GFP_NOIO allocations as IO to the raid5 is blocked
  + * at some points in this operation.
  + */
  +struct stripe_head *osh, *nsh;
  +struct list_head newstripes, oldstripes;

You can use LIST_HEAD() here, avoid the separate INIT_LIST_HEAD().


  +struct disk_info *ndisks;
  +int err = 0;
  +kmem_cache_t *sc;
  +int i;
  +
  +if (newsize = conf-pool_size)
  +return 0; /* never bother to shrink */
  +
  +sc = kmem_cache_create(conf-cache_name[1-conf-active_name],
  +   sizeof(struct 
 stripe_head)+(newsize-1)*sizeof(struct r5dev),
  +   0, 0, NULL, NULL);

kmem_cache_create() internally does a GFP_KERNEL allocation.

  +if (!sc)
  +return -ENOMEM;
  +INIT_LIST_HEAD(newstripes);
  +for (i = conf-max_nr_stripes; i; i--) {
  +nsh = kmem_cache_alloc(sc, GFP_NOIO);

So either this can use GFP_KERNEL, or we have a problem.
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 005 of 13] md: Allow stripes to be expanded in preparation for expanding an array.

2006-03-16 Thread Andrew Morton
NeilBrown [EMAIL PROTECTED] wrote:

 + wait_event_lock_irq(conf-wait_for_stripe,
  +!list_empty(conf-inactive_list),
  +conf-device_lock,
  +unplug_slaves(conf-mddev);
  +);

Boy, that's an ugly-looking thing, isn't it?

__wait_event_lock_irq() already puts a semicolon after `cmd' so I think the
one here isn't needed, which would make it a bit less of a surprise to look
at.

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 005 of 13] md: Allow stripes to be expanded in preparation for expanding an array.

2006-03-16 Thread Andrew Morton
NeilBrown [EMAIL PROTECTED] wrote:

 + /* Got them all.
  + * Return the new ones and free the old ones.
  + * At this point, we are holding all the stripes so the array
  + * is completely stalled, so now is a good time to resize
  + * conf-disks.
  + */
  +ndisks = kzalloc(newsize * sizeof(struct disk_info), GFP_NOIO);
  +if (ndisks) {
  +for (i=0; iconf-raid_disks; i++)
  +ndisks[i] = conf-disks[i];
  +kfree(conf-disks);
  +conf-disks = ndisks;
  +} else
  +err = -ENOMEM;
  +while(!list_empty(newstripes)) {
  +nsh = list_entry(newstripes.next, struct stripe_head, lru);
  +list_del_init(nsh-lru);
  +for (i=conf-raid_disks; i  newsize; i++)
  +if (nsh-dev[i].page == NULL) {
  +struct page *p = alloc_page(GFP_NOIO);
  +nsh-dev[i].page = p;
  +if (!p)
  +err = -ENOMEM;
  +}
  +release_stripe(nsh);
  +}
  +while(!list_empty(oldstripes)) {
  +osh = list_entry(oldstripes.next, struct stripe_head, lru);
  +list_del(osh-lru);
  +kmem_cache_free(conf-slab_cache, osh);
  +}
  +kmem_cache_destroy(conf-slab_cache);
  +conf-slab_cache = sc;
  +conf-active_name = 1-conf-active_name;
  +conf-pool_size = newsize;
  +return err;
  +}

Are you sure the -ENOMEM handling here is solid?  It looks strange.

There are a few more GFP_NOIOs in this function, which can possibly become
GFP_KERNEL.
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 006 of 13] md: Infrastructure to allow normal IO to continue while array is expanding.

2006-03-16 Thread Andrew Morton
NeilBrown [EMAIL PROTECTED] wrote:

  -retry:
   prepare_to_wait(conf-wait_for_overlap, w, 
 TASK_UNINTERRUPTIBLE);
  -sh = get_active_stripe(conf, new_sector, pd_idx, 
 (bi-bi_rwRWA_MASK));
  +sh = get_active_stripe(conf, new_sector, disks, pd_idx, 
 (bi-bi_rwRWA_MASK));
   if (sh) {
  -if (!add_stripe_bio(sh, bi, dd_idx, 
 (bi-bi_rwRW_MASK))) {
  -/* Add failed due to overlap.  Flush everything
  +if (unlikely(conf-expand_progress != MaxSector)) {
  +/* expansion might have moved on while waiting 
 for a
  + * stripe, so we much do the range check again.
  + */
  +int must_retry = 0;
  +spin_lock_irq(conf-device_lock);
  +if (logical_sector   conf-expand_progress 
  +disks == conf-previous_raid_disks)
  +/* mismatch, need to try again */
  +must_retry = 1;
  +spin_unlock_irq(conf-device_lock);
  +if (must_retry) {
  +release_stripe(sh);
  +goto retry;
  +}
  +}

The locking in here looks strange.  We take the lock, do some arithmetic
and some tests and then drop the lock again.  Is it not possible that the
result of those tests now becomes invalid?

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 007 of 13] md: Core of raid5 resize process

2006-03-16 Thread Andrew Morton
NeilBrown [EMAIL PROTECTED] wrote:

 @@ -4539,7 +4543,9 @@ static void md_do_sync(mddev_t *mddev)
*/
   max_sectors = mddev-resync_max_sectors;
   mddev-resync_mismatches = 0;
  -} else
  +} else if (test_bit(MD_RECOVERY_RESHAPE, mddev-recovery))
  +max_sectors = mddev-size  1;
  +else
   /* recovery follows the physical size of devices */
   max_sectors = mddev-size  1;
   

This change is a no-op.   Intentional?
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 010 of 13] md: Only checkpoint expansion progress occasionally.

2006-03-16 Thread Andrew Morton
NeilBrown [EMAIL PROTECTED] wrote:

 diff ./drivers/md/raid5.c~current~ ./drivers/md/raid5.c
  --- ./drivers/md/raid5.c~current~2006-03-17 11:48:58.0 +1100
  +++ ./drivers/md/raid5.c 2006-03-17 11:48:58.0 +1100
  @@ -1747,8 +1747,9 @@ static int make_request(request_queue_t 

That's a fairly complex function..

I wonder about this:

spin_lock_irq(conf-device_lock);
if (--bi-bi_phys_segments == 0) {
int bytes = bi-bi_size;

if ( bio_data_dir(bi) == WRITE )
md_write_end(mddev);
bi-bi_size = 0;
bi-bi_end_io(bi, bytes, 0);
}
spin_unlock_irq(conf-device_lock);

bi_end_io() can be somewhat expensive.  Does it need to happen under the lock?
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH md 013 of 20] Count corrected read errors per drive.

2005-12-12 Thread Andrew Morton
NeilBrown [EMAIL PROTECTED] wrote:

 +  errors
  +An approximate count of read errors that have been detected on
  +this device but have not caused the device to be evicted from
  +the array (either because they were corrected or because they
  +happened while the array was read-only).  When using version-1
  +metadata, this value persists across restarts of the array.

Looks funny.  Mixture of tabs and spaces I guess.

fixes it
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH md 002 of 14] Allow raid1 to check consistency

2005-12-05 Thread Andrew Morton
Neil Brown [EMAIL PROTECTED] wrote:

 Would you accept:
  --
  Signed-off-by: Neil Brown [EMAIL PROTECTED]
 
  ### Diffstat output
   ./mm/swap.c |2 ++
   1 file changed, 2 insertions(+)
 
  diff ./mm/swap.c~current~ ./mm/swap.c
  --- ./mm/swap.c~current~ 2005-12-06 10:29:16.0 +1100
  +++ ./mm/swap.c  2005-12-06 10:29:25.0 +1100
  @@ -36,6 +36,8 @@ int page_cluster;
   
   void put_page(struct page *page)
   {
  +if (unlikely(page==NULL))
  +return;
   if (unlikely(PageCompound(page))) {
   page = (struct page *)page_private(page);
   if (put_page_testzero(page)) {
 

eek.  That's an all-over-the-place-fast-path!

 
  Or should I open code this in md ?

Yes please.
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH md 006 of 14] Make /proc/mdstat pollable.

2005-12-01 Thread Andrew Morton
NeilBrown [EMAIL PROTECTED] wrote:

 +DECLARE_WAIT_QUEUE_HEAD(md_event_waiters);
  

static scope?
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH md 008 of 14] Convert md to use kzalloc throughout

2005-12-01 Thread Andrew Morton
NeilBrown [EMAIL PROTECTED] wrote:

 - conf = kmalloc (sizeof (*conf) + mddev-raid_disks*sizeof(dev_info_t),
  +conf = kzalloc (sizeof (*conf) + mddev-raid_disks*sizeof(dev_info_t),
  -new = (mddev_t *) kmalloc(sizeof(*new), GFP_KERNEL);
  +new = (mddev_t *) kzalloc(sizeof(*new), GFP_KERNEL);
  -rdev = (mdk_rdev_t *) kmalloc(sizeof(*rdev), GFP_KERNEL);
  +rdev = (mdk_rdev_t *) kzalloc(sizeof(*rdev), GFP_KERNEL);

It'd be nice to nuke the unneeded cast while we're there.

edits the diff

OK, I did that.
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH md 010 of 14] Convert various kmap calls to kmap_atomic

2005-12-01 Thread Andrew Morton
NeilBrown [EMAIL PROTECTED] wrote:

 + paddr = kmap_atomic(page, KM_USER0);
  +memset(paddr + offset, 0xff,
  PAGE_SIZE - offset);

This page which is being altered is a user-visible one, no?  A pagecache
page?

We must always run flush_dcache_page() against a page when the kernel
modifies a user-visible page's contents.  Otherwise, on some architectures,
the modification won't be visible at the different virtual address.
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH md 002 of 2] Make md threads interruptible again.

2005-11-14 Thread Andrew Morton
NeilBrown [EMAIL PROTECTED] wrote:

 
 Despite the fact that md threads don't need to be signalled, and won't
 respond to signals anyway, we need to have an 'interruptible' wait,
 else they stay in 'D' state and add to the load average.
 
 ...
 + if (signal_pending(current))
 + flush_signals(current);

Kernel threads don't accept signals by default, so the above is unneeded.

I'll leave it in there now, since that's how things used to be, but please
make a note to nuke that check, then test it to make sure I'm telling the
truth, then fix it up later on?
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: File corruption on LVM2 on top of software RAID1

2005-08-05 Thread Andrew Morton
Simon Matter [EMAIL PROTECTED] wrote:

 While looking at some data corruption vulnerability reports on
  Securityfocus I wonder why this issue does not get any attention from
  distributors. I have an open bugzilla report with RedHat, have an open
  customer service request with RedHat, have mailed peoples directly. No
  real feedback.
  I'm now in the process of restoring intergrity of my data with the help of
  backups and mirrored data. Maybe I just care too much about other peoples
  data, but I know that this bug will corrupt files on hundreds or thousands
  of servers today and most people simply don't know it. Did I miss
  something?

I guess the bug hit really rarely and maybe the reports were lost in the
permanent background noise of dodgy hardware.

We only found and fixed it last week, due to much sleuthing by Matthew
Stapleton.  I assume vendor updates are in the pipeline.
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: File corruption on LVM2 on top of software RAID1

2005-08-04 Thread Andrew Morton
Simon Matter [EMAIL PROTECTED] wrote:

 Hi,
 
 Please CC me as I'm not subscribed to the kernel list.
 
 I had a hard time identifying a serious problem in the 2.6 linux kernel.

Yes, it is a serious problem.

 It all started while evaluating RHEL4 for new servers. My data integrity
 tests gave me bad results - which I couldn't believe - and my first idea
 was - of course - bad hardware. I ordered new SCSI disks instead of the
 IDE disks, took another server, spent some money again, tried again and
 again. That's all long ago now...
 
 In my tests I get corrupt files on LVM2 which is on top of software raid1.
 (This is a common setup even mentioned in the software RAID HOWTO and has
 worked for me on RedHat 9 / kernel 2.4 for a long time now and it's my
 favourite configuration). Now, I tested two different distributions, three
 kernels, three different filesystems and three different hardware. I can
 always reproduce it with the following easy scripts:
 
 LOGF=/root/diff.log
 while true; do
   rm -rf /home/XXX2
   rsync -a /home/XXX/ /home/XXX2
   date  $LOGF
   diff -r /home/XXX /home/XXX2  $LOGF
 done
 
 the files in /home/XXX are ~15G of ISO images and rpms.
 
 diff.log looks like this:
 Wed Aug  3 13:45:57 CEST 2005
 Binary files /home/XXX/ES3-U3/rhel-3-U3-i386-es-disc3.iso and
 /home/XXX2/ES3-U3/rhel-3-U3-i386-es-disc3.iso differ
 Wed Aug  3 14:09:14 CEST 2005
 Binary files /home/XXX/8.0/psyche-i386-disc1.iso and
 /home/XXX2/8.0/psyche-i386-disc1.iso differ
 Wed Aug  3 14:44:17 CEST 2005
 Binary files /home/XXX/7.3/valhalla-i386-disc3.iso and
 /home/XXX2/7.3/valhalla-i386-disc3.iso differ
 Wed Aug  3 15:15:05 CEST 2005
 Wed Aug  3 15:45:40 CEST 2005
 
 
 Tested software:
 1) RedHat EL4
 kernel-2.6.9-11.EL
 vanilla 2.6.12.3 kernel
 filesystems: EXT2, EXT3, XFS
 
 2) NOVELL/SUSE 9.3
 kernel-default-2.6.11.4-21.7
 filesystem: EXT3
 
 Tested Hardware:
 1)
 - ASUS P2B-S board
 - CPU PIII 450MHz
 - Intel 440BX/ZX/DX Chipset
 - 4x128M memory (ECC enabled)
 - 2x IDE disks Seagate Barracuda 400G, connected to onboard Intel PIIX4
 Ultra 33
 - Promise Ultra100TX2 adapter for additional tests
 
 2)
 - DELL PowerEdge 1400
 - CPU PIII 800MHz
 - ServerWorks OSB4 Chipset
 - 4x256M memory (ECC enabled)
 - 2x U320 SCSI disks Maxtor Atlas 10K 146G
 - onboard Adaptec aic7899 Ultra160 SCSI adapter
 
 3)
 - DELL PowerEdge 1850
 - CPU P4 XEON 2.8GHz
 - 2G memory
 - 2x U320 SCSI disks Maxtor Atlas 10K 73G SCA
 - onboard LSI53C1030 SCSI adapter
 
 I've put some files toghether from the last test on the PE1850 server:
 http://www.invoca.ch/bugs/linux-2.6-corruption-on-lvm2-on-raid1/
 
 I've also filed a bug with RedHat:
 https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=164696
 
 I have spent a lot of time on this bug because I consider it very serious.
 I'm not a kernel hacker but if there is anything I can do to fix this, let
 me know.
 

Thanks for doing that.

There's one fix against 2.6.12.3 which is needed, but 2.6.9 didn't have the
bug which this fix addresses.  So unless 2.6.9 has a different problem,
this won't help.

But still, you should include this in testing please.



diff -puN fs/bio.c~bio_clone-fix fs/bio.c
--- devel/fs/bio.c~bio_clone-fix2005-07-28 00:39:40.0 -0700
+++ devel-akpm/fs/bio.c 2005-07-28 01:02:34.0 -0700
@@ -261,6 +261,7 @@ inline void __bio_clone(struct bio *bio,
 */
bio-bi_vcnt = bio_src-bi_vcnt;
bio-bi_size = bio_src-bi_size;
+   bio-bi_idx = bio_src-bi_idx;
bio_phys_segments(q, bio);
bio_hw_segments(q, bio);
 }
_

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH md 000 of 7] Introduction

2005-08-03 Thread Andrew Morton
NeilBrown [EMAIL PROTECTED] wrote:

 
 Following are 7 patches for md in 2.6.13-rc4
 They are all fairly well tested, with the possible exception of '4' -
 I haven't actually tried throwing BIO_RW_BARRIER requests are any md
 devices.  However the code is very straight forward.
 
 I'm happy (even keen) for these to go into 2.6.13.
 If it's getting a bit late, then 2 is probably the most important.
 The others we can probably live without.
 

hm, OK.  Merging 1) and 2) seems sane.  I must say that I worry about 4)
and would prefer to defer things if poss.

If there are others there which you really would prefer to see in 2.6.13
then please let me know.


 
 
  [PATCH md 001 of 7] Remove a stray debugging printk.
  [PATCH md 002 of 7] Make 'md' and alias for 'md-mod'
  [PATCH md 003 of 7] Fix minor error in raid10 read-balancing calculation.
  [PATCH md 004 of 7] Fail IO requests to md that require a barrier.
  [PATCH md 005 of 7] Always honour md bitmap being read from disk
  [PATCH md 006 of 7] Yet another attempt to get bitmap-based resync to do the 
 right thing in all cases...
  [PATCH md 007 of 7] Make sure md bitmap updates are flushed when array is 
 stopped.
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH md 000 of 7] Introduction

2005-08-03 Thread Andrew Morton
Neil Brown [EMAIL PROTECTED] wrote:

 On Wednesday August 3, [EMAIL PROTECTED] wrote:
  NeilBrown [EMAIL PROTECTED] wrote:
  
   
   Following are 7 patches for md in 2.6.13-rc4
   They are all fairly well tested, with the possible exception of '4' -
   I haven't actually tried throwing BIO_RW_BARRIER requests are any md
   devices.  However the code is very straight forward.
   
   I'm happy (even keen) for these to go into 2.6.13.
   If it's getting a bit late, then 2 is probably the most important.
   The others we can probably live without.
   
  
  hm, OK.  Merging 1) and 2) seems sane.  I must say that I worry about 4)
  and would prefer to defer things if poss.
 
 Fair enough.  4 can wait until 2.6.14 (only a couple of months away,
 right :-)

Thereabouts..

  
  If there are others there which you really would prefer to see in 2.6.13
  then please let me know.
  
 
 5, 6, and 7 I would really prefer to be in 2.6.13.

OK.

 The intent-bitmap stuff is broken (in small but potentially
 significant ways) without them, and they are complete no-ops if
 bitmaps aren't enabled:
   5 only touches bitmap.c
   6 removes the setting for R1BIO_Degraded which is not used and
  slightly re-arranges an 'if' statement.  All other changes are 
  complete no-ops if bitmaps aren't enabled.
   7: if bitmaps aren't enabled, all this does is call wait_event
  exactly the same way as will very possibly be called a few lines
  later when md_update_sb is called.
 
 So I'm convinced (and hopefully convincing) that they don't have any
 significant effect if bitmaps aren't enabled, and fix genuine problems
 with bitmaps, and so are appropriate for 2.6.13...

We'll see ;)
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH md 0 of 12] Introduction

2005-03-23 Thread Andrew Morton
NeilBrown [EMAIL PROTECTED] wrote:

 
  Andrew: Are you happy to keep collecting these as a list of patches
  (bugs followed by bug-fixes :-), or would it be easier if I merged all
  the bug fixes into earlier patches and just resent a small number of
  add-functionality patches??

What I'll generally do is to topologically sort the patches.  Work out what
patch each of the new patches fix and then rename the patches to
name-of-the-patch-whcih-is-being-fixed-fix.patch.

So if you had

foo.patch
bar.patch

and then sent be [patch] frob the nozzle

I'd work out that this new patch is fixing foo.patch and I'd name it
foo-fix.patch and the sequence would become:

foo.patch
foo-fix.patch
bar.patch

and then at some time in the future I'll collapse foo-fix.patch,
foo-fix-fix.patch, etc into foo.patch.

Of course it doesn't always work out that simply ;)

In this case it's not clear that a tsort will work out right.  I'll take a
look.

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH md 0 of 12] Introduction

2005-03-23 Thread Andrew Morton
Andrew Morton [EMAIL PROTECTED] wrote:

 Of course it doesn't always work out that simply ;)
 
  In this case it's not clear that a tsort will work out right.  I'll take a
  look.

Well it wasn't completely clean, and I forgot to rename the patches, but
this is how the tsort came out:


 md-merge-md_enter_safemode-into-md_check_recovery.patch
 md-improve-locking-on-safemode-and-move-superblock-writes.patch
 md-improve-the-interface-to-sync_request.patch
 md-optimised-resync-using-bitmap-based-intent-logging.patch
+md-a-couple-of-tidyups-relating-to-the-bitmap-file.patch
+md-call-bitmap_daemon_work-regularly.patch
+md-print-correct-pid-for-newly-created-bitmap-writeback-daemon.patch
+md-minor-code-rearrangement-in-bitmap_init_from_disk.patch
+md-make-sure-md-bitmap-is-cleared-on-a-clean-start.patch
 md-printk-fix.patch
+md-improve-debug-printing-of-bitmap-superblock.patch
+md-check-return-value-of-write_page-rather-than-ignore-it.patch
+md-enable-the-bitmap-write-back-daemon-and-wait-for-it.patch
+md-dont-skip-bitmap-pages-due-to-lack-of-bit-that-we-just-cleared.patch
 md-optimised-resync-using-bitmap-based-intent-logging-fix.patch
 md-raid1-support-for-bitmap-intent-logging.patch
+md-fix-bug-when-raid1-attempts-a-partial-reconstruct.patch
 md-raid1-support-for-bitmap-intent-logging-fix.patch
 md-optimise-reconstruction-when-re-adding-a-recently-failed-drive.patch
 md-fix-deadlock-due-to-md-thread-processing-delayed-requests.patch
+md-allow-md-intent-bitmap-to-be-stored-near-the-superblock.patch
+md-allow-md-to-update-multiple-superblocks-in-parallel.patch

So we can later work out which of these to fold into which other ones.

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH md 0 of 4] Introduction

2005-03-07 Thread Andrew Morton
NeilBrown [EMAIL PROTECTED] wrote:

 The first two are trivial and should apply equally to 2.6.11
 
  The second two fix bugs that were introduced by the recent 
  bitmap-based-intent-logging patches and so are not relevant
  to 2.6.11 yet. 

The changelog for the Fix typo in super_1_sync patch doesn't actually say
what the patch does.  What are the user-visible consequences of not fixing
this?


Is the bitmap stuff now ready for Linus?
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Fw: [Bugme-new] [Bug 4211] New: md configuration destroys disk GPT label

2005-02-14 Thread Andrew Morton


Begin forwarded message:

Date: Mon, 14 Feb 2005 05:29:22 -0800
From: [EMAIL PROTECTED]
To: [EMAIL PROTECTED]
Subject: [Bugme-new] [Bug 4211] New: md configuration destroys disk GPT label


http://bugme.osdl.org/show_bug.cgi?id=4211

   Summary: md configuration destroys disk GPT label
Kernel Version: 2.4.18-e.31smp
Status: NEW
  Severity: high
 Owner: [EMAIL PROTECTED]
 Submitter: [EMAIL PROTECTED]


Distribution: RedHat Advanced Server 2.1

Hardware Environment: NEC System, 4 CPU IA-64, 8 GB RAM, 
  direct connected FC disks

Software Environment: OS + Oracle installation

Problem Description: 
md configuration based on complete disks instead of partitions destroys the GPT
label of the used disks

Steps to reproduce:
Setup a Software Raid, raid level 0 or 1, based on complete disks,

raiddev /dev/md0
raid-level  0
nr-raid-disks   2
persistent-superblock 1
chunk-size  16
device  /dev/sdb
raid-disk   0
device  /dev/sdc
raid-disk   1

create the raiddevice, create a free choosable filesystem, I worked with ext2,
and put some I/O load on the filesystem. 
After umounting raiddevice, try to print the partition table using 'parted', you
will get 

[EMAIL PROTECTED] /]# parted /dev/sdb print
Primary GPT is invalid, using alternate GPT.
Error: This disk contains a valid Alternate GUID Partition Table but the Primary
GPT and Protective MBR are invalid.  This generally means that the disk had GPT
partitions on it, but then a legacy partition editing tool was used to change
the partition table stored in the MBR.
Which data is valid,  GPT or MBR?
Yes will assume that the GPT information is correct, and will rewrite the PMBR
and Primary GPT.
No will assume that the MBR is correct, and erase the GPT information.
Ignore will assume that the MBR is correct, but not change the disk.
Yes No Ignore ? 

The same setup works fine, if I am using partitions instead of the hole disk ...

raiddev /dev/md0
raid-level  0
nr-raid-disks   2
persistent-superblock 1
chunk-size  16
device  /dev/sdb1
raid-disk   0
device  /dev/sdc1
raid-disk   1

[EMAIL PROTECTED] /]# parted /dev/sdb print
Disk geometry for /dev/sdb: 0.000-34688.000 megabytes
Disk label type: GPT
MinorStart   End Filesystem  Name  Flags
1  0.017  34687.500  ext2  lba


Regards
 Alex

--- You are receiving this mail because: ---
You are on the CC list for the bug, or are watching someone who is.
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH md 6 of 6] Fix handling of overlapping requests in raid5

2005-02-07 Thread Andrew Morton
NeilBrown [EMAIL PROTECTED] wrote:

 + retry:
   sh = get_active_stripe(conf, new_sector, pd_idx, 
 (bi-bi_rwRWA_MASK));
   if (sh) {
  -
  -while (!add_stripe_bio(sh, bi, dd_idx, 
 (bi-bi_rwRW_MASK))) {
  -/* add failed due to overlap.  Flush everything
  +if (!add_stripe_bio(sh, bi, dd_idx, 
 (bi-bi_rwRW_MASK))) {
  +/* Add failed due to overlap.  Flush everything
* and wait a while
  - * FIXME - overlapping requests should be 
 handled better
*/
   raid5_unplug_device(mddev-queue);
  -set_current_state(TASK_UNINTERRUPTIBLE);
  -schedule_timeout(1);
  +release_stripe(sh);
  +schedule();
  +goto retry;

Worrisome.  If the calling process has SCHED_RR or SCHED_FIFO policy, this
could cause a lockup, perhaps.

Some sort of real synchronisation scheme would be nicer.  Or at the least,
what was wrong with the schedule_timeout(1)?
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH md 6 of 6] Fix handling of overlapping requests in raid5

2005-02-07 Thread Andrew Morton
Neil Brown [EMAIL PROTECTED] wrote:

 On Monday February 7, [EMAIL PROTECTED] wrote:
  NeilBrown [EMAIL PROTECTED] wrote:
  
   + retry:
 sh = get_active_stripe(conf, new_sector, pd_idx, 
   (bi-bi_rwRWA_MASK));
 if (sh) {
-
-while (!add_stripe_bio(sh, bi, dd_idx, 
   (bi-bi_rwRW_MASK))) {
-/* add failed due to overlap.  Flush 
   everything
+if (!add_stripe_bio(sh, bi, dd_idx, 
   (bi-bi_rwRW_MASK))) {
+/* Add failed due to overlap.  Flush 
   everything
  * and wait a while
- * FIXME - overlapping requests should 
   be handled better
  */
 raid5_unplug_device(mddev-queue);
-set_current_state(TASK_UNINTERRUPTIBLE);
-schedule_timeout(1);
+release_stripe(sh);
+schedule();
+goto retry;
  
  Worrisome.  If the calling process has SCHED_RR or SCHED_FIFO policy, this
  could cause a lockup, perhaps.
 
 Is that just that I should have the prepare_to_wait after the retry:
 label rather than before, or is there something more subtle.

umm, yes.  We always exit from schedule() in state TASK_RUNNING, so we need
to run prepare_to_wait() each time around.  See __wait_on_bit() for a
canonical example.

 ### Diffstat output
  ./drivers/md/raid5.c |2 +-
  1 files changed, 1 insertion(+), 1 deletion(-)
 
 diff ./drivers/md/raid5.c~current~ ./drivers/md/raid5.c
 --- ./drivers/md/raid5.c~current~ 2005-02-07 13:34:23.0 +1100
 +++ ./drivers/md/raid5.c  2005-02-08 13:34:38.0 +1100
 @@ -1433,9 +1433,9 @@ static int make_request (request_queue_t
   (unsigned long long)new_sector, 
   (unsigned long long)logical_sector);
  
 + retry:
   prepare_to_wait(conf-wait_for_overlap, w, 
 TASK_UNINTERRUPTIBLE);
  
 - retry:
   sh = get_active_stripe(conf, new_sector, pd_idx, 
 (bi-bi_rwRWA_MASK));
   if (sh) {
   if (!add_stripe_bio(sh, bi, dd_idx, 
 (bi-bi_rwRW_MASK))) {

You missed one.

--- 25/drivers/md/raid5.c~md-fix-handling-of-overlapping-requests-in-raid5-fix  
2005-02-07 19:04:18.0 -0800
+++ 25-akpm/drivers/md/raid5.c  2005-02-07 19:04:48.0 -0800
@@ -1433,9 +1433,8 @@ static int make_request (request_queue_t
(unsigned long long)new_sector, 
(unsigned long long)logical_sector);
 
-   prepare_to_wait(conf-wait_for_overlap, w, 
TASK_UNINTERRUPTIBLE);
-
retry:
+   prepare_to_wait(conf-wait_for_overlap, w, 
TASK_UNINTERRUPTIBLE);
sh = get_active_stripe(conf, new_sector, pd_idx, 
(bi-bi_rwRWA_MASK));
if (sh) {
if (!add_stripe_bio(sh, bi, dd_idx, 
(bi-bi_rwRW_MASK))) {
diff -puN 
drivers/md/raid6main.c~md-fix-handling-of-overlapping-requests-in-raid5-fix 
drivers/md/raid6main.c
--- 
25/drivers/md/raid6main.c~md-fix-handling-of-overlapping-requests-in-raid5-fix  
2005-02-07 19:04:18.0 -0800
+++ 25-akpm/drivers/md/raid6main.c  2005-02-07 19:04:54.0 -0800
@@ -1593,9 +1593,8 @@ static int make_request (request_queue_t
   (unsigned long long)new_sector,
   (unsigned long long)logical_sector);
 
-   prepare_to_wait(conf-wait_for_overlap, w, 
TASK_UNINTERRUPTIBLE);
-
retry:
+   prepare_to_wait(conf-wait_for_overlap, w, 
TASK_UNINTERRUPTIBLE);
sh = get_active_stripe(conf, new_sector, pd_idx, 
(bi-bi_rwRWA_MASK));
if (sh) {
if (!add_stripe_bio(sh, bi, dd_idx, 
(bi-bi_rwRW_MASK))) {

_

(That piece of the code looks pretty fugly in an 80-col xterm btw..)

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html