Re: [PATCH 001 of 6] md: Fix an occasional deadlock in raid5
On Tue, 15 Jan 2008 21:01:17 -0800 (PST) dean gaudet [EMAIL PROTECTED] wrote: On Mon, 14 Jan 2008, NeilBrown wrote: raid5's 'make_request' function calls generic_make_request on underlying devices and if we run out of stripe heads, it could end up waiting for one of those requests to complete. This is bad as recursive calls to generic_make_request go on a queue and are not even attempted until make_request completes. So: don't make any generic_make_request calls in raid5 make_request until all waiting has been done. We do this by simply setting STRIPE_HANDLE instead of calling handle_stripe(). If we need more stripe_heads, raid5d will get called to process the pending stripe_heads which will call generic_make_request from a different thread where no deadlock will happen. This change by itself causes a performance hit. So add a change so that raid5_activate_delayed is only called at unplug time, never in raid5. This seems to bring back the performance numbers. Calling it in raid5d was sometimes too soon... Cc: Dan Williams [EMAIL PROTECTED] Signed-off-by: Neil Brown [EMAIL PROTECTED] probably doesn't matter, but for the record: Tested-by: dean gaudet [EMAIL PROTECTED] this time i tested with internal and external bitmaps and it survived 8h and 14h resp. under the parallel tar workload i used to reproduce the hang. btw this should probably be a candidate for 2.6.22 and .23 stable. hm, Neil said The first fixes a bug which could make it a candidate for 24-final. However it is a deadlock that seems to occur very rarely, and has been in mainline since 2.6.22. So letting it into one more release shouldn't be a big problem. While the fix is fairly simple, it could have some unexpected consequences, so I'd rather go for the next cycle. food fight! - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 001 of 6] md: Fix an occasional deadlock in raid5
On Wed, 16 Jan 2008 00:09:31 -0700 Dan Williams [EMAIL PROTECTED] wrote: heheh. it's really easy to reproduce the hang without the patch -- i could hang the box in under 20 min on 2.6.22+ w/XFS and raid5 on 7x750GB. i'll try with ext3... Dan's experiences suggest it won't happen with ext3 (or is even more rare), which would explain why this has is overall a rare problem. Hmmm... how rare? http://marc.info/?l=linux-kernelm=119461747005776w=2 There is nothing specific that prevents other filesystems from hitting it, perhaps XFS is just better at submitting large i/o's. -stable should get some kind of treatment. I'll take altered performance over a hung system. We can always target 2.6.25-rc1 then 2.6.24.1 if Neil is still feeling wimpy. - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 001 of 7] md: Support 'external' metadata for md arrays.
On Fri, 14 Dec 2007 17:26:08 +1100 NeilBrown [EMAIL PROTECTED] wrote: + if (strncmp(buf, external:, 9) == 0) { + int namelen = len-9; + if (namelen = sizeof(mddev-metadata_type)) + namelen = sizeof(mddev-metadata_type)-1; + strncpy(mddev-metadata_type, buf+9, namelen); + mddev-metadata_type[namelen] = 0; + if (namelen mddev-metadata_type[namelen-1] == '\n') + mddev-metadata_type[--namelen] = 0; + mddev-persistent = 0; + mddev-external = 1; size_t would be a more appropriate type for `namelen'. - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 004 of 7] md: Allow devices to be shared between md arrays.
On Fri, 14 Dec 2007 17:26:28 +1100 NeilBrown [EMAIL PROTECTED] wrote: + mddev_unlock(rdev-mddev); + ITERATE_MDDEV(mddev, tmp) { + mdk_rdev_t *rdev2; + + mddev_lock(mddev); + ITERATE_RDEV(mddev, rdev2, tmp2) + if (test_bit(AllReserved, rdev2-flags) || + (rdev-bdev == rdev2-bdev + rdev != rdev2 + overlaps(rdev-data_offset, rdev-size, + rdev2-data_offset, rdev2-size))) { + overlap = 1; + break; + } + mddev_unlock(mddev); + if (overlap) { + mddev_put(mddev); + break; + } + } eww, ITERATE_MDDEV() and ITERATE_RDEV() are an eyesore. for_each_mddev() and for_each_rdev() would at least mean the reader doesn't need to check the implementation when wondering what that `break' is breaking from. #define In_sync 2 /* device is in_sync with rest of array */ #define WriteMostly 4 /* Avoid reading if at all possible */ #define BarriersNotsupp 5 /* BIO_RW_BARRIER is not supported */ +#define AllReserved 6 /* If whole device is reserved for The naming style here is inconsistent. A task for the keen would be to convert these to an enum and add some namespacing prefix to them. - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Kernel 2.6.23.9 / P35 Chipset + WD 750GB Drives (reset port)
On Sat, 1 Dec 2007 06:26:08 -0500 (EST) Justin Piszcz [EMAIL PROTECTED] wrote: I am putting a new machine together and I have dual raptor raid 1 for the root, which works just fine under all stress tests. Then I have the WD 750 GiB drive (not RE2, desktop ones for ~150-160 on sale now adays): I ran the following: dd if=/dev/zero of=/dev/sdc dd if=/dev/zero of=/dev/sdd dd if=/dev/zero of=/dev/sde (as it is always a very good idea to do this with any new disk) And sometime along the way(?) (i had gone to sleep and let it run), this occurred: [42880.680144] ata3.00: exception Emask 0x10 SAct 0x0 SErr 0x401 action 0x2 frozen Gee we're seeing a lot of these lately. [42880.680231] ata3.00: irq_stat 0x00400040, connection status changed [42880.680290] ata3.00: cmd ec/00:00:00:00:00/00:00:00:00:00/00 tag 0 cdb 0x0 data 512 in [42880.680292] res 40/00:ac:d8:64:54/00:00:57:00:00/40 Emask 0x10 (ATA bus error) [42881.841899] ata3: soft resetting port [42885.966320] ata3: SATA link up 3.0 Gbps (SStatus 123 SControl 300) [42915.919042] ata3.00: qc timeout (cmd 0xec) [42915.919094] ata3.00: failed to IDENTIFY (I/O error, err_mask=0x5) [42915.919149] ata3.00: revalidation failed (errno=-5) [42915.919206] ata3: failed to recover some devices, retrying in 5 secs [42920.912458] ata3: hard resetting port [42926.411363] ata3: port is slow to respond, please be patient (Status 0x80) [42930.943080] ata3: COMRESET failed (errno=-16) [42930.943130] ata3: hard resetting port [42931.399628] ata3: SATA link up 3.0 Gbps (SStatus 123 SControl 300) [42931.413523] ata3.00: configured for UDMA/133 [42931.413586] ata3: EH pending after completion, repeating EH (cnt=4) [42931.413655] ata3: EH complete [42931.413719] sd 2:0:0:0: [sdc] 1465149168 512-byte hardware sectors (750156 MB) [42931.413809] sd 2:0:0:0: [sdc] Write Protect is off [42931.413856] sd 2:0:0:0: [sdc] Mode Sense: 00 3a 00 00 [42931.413867] sd 2:0:0:0: [sdc] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA Usually when I see this sort of thing with another box I have full of raptors, it was due to a bad raptor and I never saw it again after I replaced the disk that it happened on, but that was using the Intel P965 chipset. For this board, it is a Gigabyte GSP-P35-DS4 (Rev 2.0) and I have all of the drives (2 raptors, 3 750s connected to the Intel ICH9 Southbridge). I am going to do some further testing but does this indicate a bad drive? Bad cable? Bad connector? As you can see above, /dev/sdc stopped responding for a little bit and then the kernel reset the port. Why is this though? What is the likely root cause? Should I replace the drive? Obviously this is not normal and cannot be good at all, the idea is to put these drives in a RAID5 and if one is going to timeout that is going to cause the array to go degraded and thus be worthless in a raid5 configuration. Can anyone offer any insight here? It would be interesting to try 2.6.21 or 2.6.22. - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Kernel 2.6.23.9 / P35 Chipset + WD 750GB Drives (reset port)
On Thu, 6 Dec 2007 17:38:08 -0500 (EST) Justin Piszcz [EMAIL PROTECTED] wrote: On Thu, 6 Dec 2007, Andrew Morton wrote: On Sat, 1 Dec 2007 06:26:08 -0500 (EST) Justin Piszcz [EMAIL PROTECTED] wrote: I am putting a new machine together and I have dual raptor raid 1 for the root, which works just fine under all stress tests. Then I have the WD 750 GiB drive (not RE2, desktop ones for ~150-160 on sale now adays): I ran the following: dd if=/dev/zero of=/dev/sdc dd if=/dev/zero of=/dev/sdd dd if=/dev/zero of=/dev/sde (as it is always a very good idea to do this with any new disk) And sometime along the way(?) (i had gone to sleep and let it run), this occurred: [42880.680144] ata3.00: exception Emask 0x10 SAct 0x0 SErr 0x401 action 0x2 frozen Gee we're seeing a lot of these lately. [42880.680231] ata3.00: irq_stat 0x00400040, connection status changed [42880.680290] ata3.00: cmd ec/00:00:00:00:00/00:00:00:00:00/00 tag 0 cdb 0x0 data 512 in [42880.680292] res 40/00:ac:d8:64:54/00:00:57:00:00/40 Emask 0x10 (ATA bus error) [42881.841899] ata3: soft resetting port [42885.966320] ata3: SATA link up 3.0 Gbps (SStatus 123 SControl 300) [42915.919042] ata3.00: qc timeout (cmd 0xec) [42915.919094] ata3.00: failed to IDENTIFY (I/O error, err_mask=0x5) [42915.919149] ata3.00: revalidation failed (errno=-5) [42915.919206] ata3: failed to recover some devices, retrying in 5 secs [42920.912458] ata3: hard resetting port [42926.411363] ata3: port is slow to respond, please be patient (Status 0x80) [42930.943080] ata3: COMRESET failed (errno=-16) [42930.943130] ata3: hard resetting port [42931.399628] ata3: SATA link up 3.0 Gbps (SStatus 123 SControl 300) [42931.413523] ata3.00: configured for UDMA/133 [42931.413586] ata3: EH pending after completion, repeating EH (cnt=4) [42931.413655] ata3: EH complete [42931.413719] sd 2:0:0:0: [sdc] 1465149168 512-byte hardware sectors (750156 MB) [42931.413809] sd 2:0:0:0: [sdc] Write Protect is off [42931.413856] sd 2:0:0:0: [sdc] Mode Sense: 00 3a 00 00 [42931.413867] sd 2:0:0:0: [sdc] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA Usually when I see this sort of thing with another box I have full of raptors, it was due to a bad raptor and I never saw it again after I replaced the disk that it happened on, but that was using the Intel P965 chipset. For this board, it is a Gigabyte GSP-P35-DS4 (Rev 2.0) and I have all of the drives (2 raptors, 3 750s connected to the Intel ICH9 Southbridge). I am going to do some further testing but does this indicate a bad drive? Bad cable? Bad connector? As you can see above, /dev/sdc stopped responding for a little bit and then the kernel reset the port. Why is this though? What is the likely root cause? Should I replace the drive? Obviously this is not normal and cannot be good at all, the idea is to put these drives in a RAID5 and if one is going to timeout that is going to cause the array to go degraded and thus be worthless in a raid5 configuration. Can anyone offer any insight here? It would be interesting to try 2.6.21 or 2.6.22. This was due to NCQ issues (disabling it fixed the problem). I cannot locate any further email discussion on this topic. Disabling NCQ at either compile time or runtime is not a fix and further work should be done here to maek the kernel run acceptably on that hardware. - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2.6.23-rc7 0/3] async_tx and md-accel fixes for 2.6.23
On Thu, 20 Sep 2007 18:27:35 -0700 Dan Williams [EMAIL PROTECTED] wrote: Fix a couple bugs and provide documentation for the async_tx api. Neil, please 'ack' patch #3. git://lost.foo-projects.org/~dwillia2/git/iop async-tx-fixes-for-linus Well it looks like Neil is on vacation or is hiding from us or something. In which case our options would be to go ahead and merge your #3 with our fingers crossed, or wait and do it in 2.6.23.1 or .2. How serious is this bug which you're fixing? Will it affect users of legacy MD setups, or only those who have enabled fancy dma-engine features? Either way, we need to get #2 (at least) in to Linus. - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: md: raid10: fix use-after-free of bio
On Fri, 27 Jul 2007 16:46:23 +0200 Maik Hampel [EMAIL PROTECTED] wrote: In case of read errors raid10d tries to print a nice error message, unfortunately using data from an already put bio. Signed-off-by: Maik Hampel [EMAIL PROTECTED] diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c index f730a14..ea1b3e3 100644 --- a/drivers/md/raid10.c +++ b/drivers/md/raid10.c @@ -1557,7 +1557,6 @@ static void raid10d(mddev_t *mddev) bio = r10_bio-devs[r10_bio-read_slot].bio; r10_bio-devs[r10_bio-read_slot].bio = mddev-ro ? IO_BLOCKED : NULL; - bio_put(bio); mirror = read_balance(conf, r10_bio); if (mirror == -1) { printk(KERN_ALERT raid10: %s: unrecoverable I/O @@ -1567,6 +1566,7 @@ static void raid10d(mddev_t *mddev) raid_end_bio_io(r10_bio); } else { const int do_sync = bio_sync(r10_bio-master_bio); + bio_put(bio); rdev = conf-mirrors[mirror].rdev; if (printk_ratelimit()) printk(KERN_ERR raid10: %s: redirecting sector %llu to Surely we just leaked that bio if (mirror == -1)? better: --- a/drivers/md/raid10.c~md-raid10-fix-use-after-free-of-bio +++ a/drivers/md/raid10.c @@ -1534,7 +1534,6 @@ static void raid10d(mddev_t *mddev) bio = r10_bio-devs[r10_bio-read_slot].bio; r10_bio-devs[r10_bio-read_slot].bio = mddev-ro ? IO_BLOCKED : NULL; - bio_put(bio); mirror = read_balance(conf, r10_bio); if (mirror == -1) { printk(KERN_ALERT raid10: %s: unrecoverable I/O @@ -1542,8 +1541,10 @@ static void raid10d(mddev_t *mddev) bdevname(bio-bi_bdev,b), (unsigned long long)r10_bio-sector); raid_end_bio_io(r10_bio); + bio_put(bio); } else { const int do_sync = bio_sync(r10_bio-master_bio); + bio_put(bio); rdev = conf-mirrors[mirror].rdev; if (printk_ratelimit()) printk(KERN_ERR raid10: %s: redirecting sector %llu to _ - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [GIT PATCH 0/2] stripe-queue for 2.6.23 consideration
On Sun, 22 Jul 2007 02:44:52 -0700 Dan Williams [EMAIL PROTECTED] wrote: The stripe-queue patches are showing solid performance improvement. git://lost.foo-projects.org/~dwillia2/git/iop md-for-linus It's a slippery little tree that one. I was using md-accel-linus, only I nuked it because it was empty, or was causing problems or something. Oh well, I shall switch to md-for-linus and shall see what happens. - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [-mm PATCH 0/2] 74% decrease in dispatched writes, stripe-queue take3
On Fri, 13 Jul 2007 15:35:42 -0700 Dan Williams [EMAIL PROTECTED] wrote: The following patches replace the stripe-queue patches currently in -mm. I have a little practical problem here: am presently unable to compile anything much due to all the git rejects coming out of git-md-accel.patch. It'd be appreciated if you could keep on top of that, please. It's a common problem at this time of the kernel cycle. The quilt trees are much worse - Greg's stuff is an unholy mess. Ho hum. - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [-mm PATCH 0/2] 74% decrease in dispatched writes, stripe-queue take3
On Fri, 13 Jul 2007 15:57:26 -0700 Williams, Dan J [EMAIL PROTECTED] wrote: -Original Message- From: Andrew Morton [mailto:[EMAIL PROTECTED] The following patches replace the stripe-queue patches currently in -mm. I have a little practical problem here: am presently unable to compile anything much due to all the git rejects coming out of git-md-accel.patch. It'd be appreciated if you could keep on top of that, please. It's a common problem at this time of the kernel cycle. The quilt trees are much worse - Greg's stuff is an unholy mess. Ho hum. Sorry, please drop git-md-accel.patch and git-ioat.patch as they have been merged into Linus' tree. But your ongoing maintenance activity will continue to be held in those trees, won't it? - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [-mm PATCH 0/2] 74% decrease in dispatched writes, stripe-queue take3
On Fri, 13 Jul 2007 16:28:30 -0700 Williams, Dan J [EMAIL PROTECTED] wrote: -Original Message- From: Andrew Morton [mailto:[EMAIL PROTECTED] But your ongoing maintenance activity will continue to be held in those trees, won't it? For now: git://lost.foo-projects.org/~dwillia2/git/iop ioat-md-accel-for-linus is where the latest combined tree is located. However, Shannon Nelson is coming online to own the i/oat driver so we may need to revisit this situation. We want to avoid the git-ioat/git-md-accel collisions that happened in the past. I will talk with Shannon about how we will coordinate this going forward. The code ownership looks like this: ioat dma driver - Shannon net dma offload implementation - Shannon dmaengine core - shared async_tx api - shared iop-adma dma driver - Dan md-accel implementation - Dan oh my, how scary. I'll go into hiding until the dust has settled. Please send me the git URLs when it's all set up, thanks. - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 2.6.21-mm2 boot failure, raid autodetect, bd_set_size+0xb/0x80
On Fri, 11 May 2007 20:03:34 +0200 [EMAIL PROTECTED] wrote: From: Andrew Morton [EMAIL PROTECTED] Date: Wed, May 09, 2007 at 01:23:22AM -0700 ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.21/2.6.21-mm2/ It won't boot here. AMD64 platform, raid6 partition. 2.6.21-rc7-mm2 runs fine, it's dmesg says: md: created md4 md: bindhda9 md: bindhdc9 md: running: hdc9hda9 raid1: raid set md4 active with 2 out of 2 mirrors md4: bitmap initialized from disk: read 10/10 pages, set 46 bits, status: 0 created bitmap (156 pages) for device md4 md: considering hdc8 ... md: adding hdc8 ... snip where 2.6.21-mm2 halts with md: created md4 md: bindhda9 md: bindhdc9 md: running: hdc9hda9 raid1: raid set md4 active with 2 out of 2 mirrors md4: bitmap initialized from disk: read 10/10 pages, set 46 bits, status: 0 created bitmap (156 pages) for device md4 Unable to handle kernel null pointer dereference bd_set_size+0xb/0x80 Yes, Neil had a whoops and a dud patch spent a day in mainline. Hopefully the below revert (from mainline) will fix it. From: Linux Kernel Mailing List [EMAIL PROTECTED] To: [EMAIL PROTECTED] Subject: Revert md: improve partition detection in md array Date: Thu, 10 May 2007 01:59:03 GMT Sender: [EMAIL PROTECTED] Gitweb: http://git.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=44ce6294d07555c3d313757105fd44b78208407f Commit: 44ce6294d07555c3d313757105fd44b78208407f Parent: 497f050c42e46a4b1f6a9bcd8827fa5d97fe1feb Author: Linus Torvalds [EMAIL PROTECTED] AuthorDate: Wed May 9 18:51:36 2007 -0700 Committer: Linus Torvalds [EMAIL PROTECTED] CommitDate: Wed May 9 18:51:36 2007 -0700 Revert md: improve partition detection in md array This reverts commit 5b479c91da90eef605f851508744bfe8269591a0. Quoth Neil Brown: It causes an oops when auto-detecting raid arrays, and it doesn't seem easy to fix. The array may not be 'open' when do_md_run is called, so bdev-bd_disk might be NULL, so bd_set_size can oops. This whole approach of opening an md device before it has been assembled just seems to get more and more painful. I think I'm going to have to come up with something clever to provide both backward comparability with usage expectation, and sane integration into the rest of the kernel. Signed-off-by: Linus Torvalds [EMAIL PROTECTED] --- drivers/md/md.c | 26 ++ drivers/md/raid1.c|1 + drivers/md/raid5.c|2 ++ include/linux/raid/md_k.h |1 + 4 files changed, 22 insertions(+), 8 deletions(-) diff --git a/drivers/md/md.c b/drivers/md/md.c index 2901d0c..65814b0 100644 --- a/drivers/md/md.c +++ b/drivers/md/md.c @@ -3104,7 +3104,6 @@ static int do_md_run(mddev_t * mddev) struct gendisk *disk; struct mdk_personality *pers; char b[BDEVNAME_SIZE]; - struct block_device *bdev; if (list_empty(mddev-disks)) /* cannot run an array with no devices.. */ @@ -3332,13 +3331,7 @@ static int do_md_run(mddev_t * mddev) md_wakeup_thread(mddev-thread); md_wakeup_thread(mddev-sync_thread); /* possibly kick off a reshape */ - bdev = bdget_disk(mddev-gendisk, 0); - if (bdev) { - bd_set_size(bdev, mddev-array_size 1); - blkdev_ioctl(bdev-bd_inode, NULL, BLKRRPART, 0); - bdput(bdev); - } - + mddev-changed = 1; md_new_event(mddev); kobject_uevent(mddev-gendisk-kobj, KOBJ_CHANGE); return 0; @@ -3460,6 +3453,7 @@ static int do_md_stop(mddev_t * mddev, int mode) mddev-pers = NULL; set_capacity(disk, 0); + mddev-changed = 1; if (mddev-ro) mddev-ro = 0; @@ -4599,6 +4593,20 @@ static int md_release(struct inode *inode, struct file * file) return 0; } +static int md_media_changed(struct gendisk *disk) +{ + mddev_t *mddev = disk-private_data; + + return mddev-changed; +} + +static int md_revalidate(struct gendisk *disk) +{ + mddev_t *mddev = disk-private_data; + + mddev-changed = 0; + return 0; +} static struct block_device_operations md_fops = { .owner = THIS_MODULE, @@ -4606,6 +4614,8 @@ static struct block_device_operations md_fops = .release= md_release, .ioctl = md_ioctl, .getgeo = md_getgeo, + .media_changed = md_media_changed, + .revalidate_disk= md_revalidate, }; static int md_thread(void * arg) diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c index 1b7130c..97ee870 100644 --- a/drivers/md/raid1.c +++ b/drivers/md/raid1.c @@ -2063,6 +2063,7 @@ static int raid1_resize(mddev_t *mddev, sector_t sectors) */ mddev-array_size
Re: [PATCH 002 of 2] md: Improve the is_mddev_idle test
On Thu, 10 May 2007 16:22:31 +1000 NeilBrown [EMAIL PROTECTED] wrote: The test currently looks for any (non-fuzz) difference, either positive or negative. This clearly is not needed. Any non-sync activity will cause the total sectors to grow faster than the sync_io count (never slower) so we only need to look for a positive differences. ... --- .prev/drivers/md/md.c 2007-05-10 15:51:54.0 +1000 +++ ./drivers/md/md.c 2007-05-10 16:05:10.0 +1000 @@ -5095,7 +5095,7 @@ static int is_mddev_idle(mddev_t *mddev) * * Note: the following is an unsigned comparison. */ - if ((curr_events - rdev-last_events + 4096) 8192) { + if ((long)curr_events - (long)rdev-last_events 4096) { rdev-last_events = curr_events; idle = 0; In which case would unsigned counters be more appropriate? - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Please revert 5b479c91da90eef605f851508744bfe8269591a0 (md partition rescan)
On Thu, 10 May 2007 16:51:31 +0200 (MEST) Jan Engelhardt [EMAIL PROTECTED] wrote: On May 9 2007 18:51, Linus Torvalds wrote: (But Andrew never saw your email, I suspect: [EMAIL PROTECTED] is probably some strange mixup of Andrew Morton and Andi Kleen in your mind ;) What do the letters kp stand for? Some say Kernel Programmer. My parents said Keith Paul. - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RAID1 out of memory error, was Re: 2.6.21-rc5-mm4
On Fri, 06 Apr 2007 02:33:03 +1000 Reuben Farrelly [EMAIL PROTECTED] wrote: Hi, On 3/04/2007 3:47 PM, Andrew Morton wrote: ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.21-rc5/2.6.21-rc5-mm4/ - The oops in git-net.patch has been fixed, so that tree has been restored. It is huge. - Added the device-mapper development tree to the -mm lineup (Alasdair Kergon). It is a quilt tree, living at ftp://ftp.kernel.org/pub/linux/kernel/people/agk/patches/2.6/editing/. - Added davidel's signalfd stuff. Looks like some damage, or maybe intolerance to on-disk damage, to RAID-1. md1 is the first array on the disk, and it refuses to start up on boot, or after boot. ... tornado ~ # mdadm --assemble /dev/md1 /dev/sda1 /dev/sdc1 mdadm: device /dev/md1 already active - cannot assemble it tornado ~ # mdadm --run /dev/md1 mdadm: failed to run array /dev/md1: Cannot allocate memory tornado ~ # and looking at a dmesg, this is logged: md: bindsdc1 md: bindsda1 raid1: raid set md1 active with 2 out of 2 mirrors md1: bitmap initialized from disk: read 0/1 pages, set 0 bits, status: -12 md1: failed to create bitmap (-12) md: pers-run() failed ... tornado ~ # uname -a Linux tornado 2.6.21-rc5-mm4 #1 SMP Thu Apr 5 23:47:42 EST 2007 x86_64 Intel(R) Pentium(R) 4 CPU 3.00GHz GenuineIntel GNU/Linux tornado ~ # The last known version that worked was 2.6.21-rc3-mm1 - I haven't been testing out the -mm releases so much lately. OK. I assume that bitmap-chunks in bitmap_init_from_disk() has some unexpectedly large value. I don't _think_ there's anything in -mm which would have triggered this. Does mainline do the same thing? I guess it's possible that the code in git-md-accel.patch accidentally broke things. Perhaps try disabling CONFIG_DMA_ENGINE? Also, Andrew, can you please restart posting/cc'ing your -mm announcements to the [EMAIL PROTECTED] list? Seems this stopped around about 2.6.20, it was handy. hm. I always Bcc [EMAIL PROTECTED] I assume that its filters didn't get updated after the s/osdl/linux-foundation/ thing. I'll talk to people, thanks. .config is up at http://www.reub.net/files/kernel/configs/2.6.21-rc5-mm4 - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] md: Avoid a deadlock when removing a device from an md array via sysfs.
On Mon, 2 Apr 2007 17:44:17 +1000 NeilBrown [EMAIL PROTECTED] wrote: (This patch should go in 2.6.21 as it fixes a recent regression - NB) A device can be removed from an md array via e.g. echo remove /sys/block/md3/md/dev-sde/state This will try to remove the 'dev-sde' subtree which will deadlock since commit e7b0d26a86943370c04d6833c6edba2a72a6e240 With this patch we run the kobject_del via schedule_work so as to avoid the deadlock. Cc: Alan Stern [EMAIL PROTECTED] Signed-off-by: Neil Brown [EMAIL PROTECTED] ### Diffstat output ./drivers/md/md.c | 13 - ./include/linux/raid/md_k.h |1 + 2 files changed, 13 insertions(+), 1 deletion(-) diff .prev/drivers/md/md.c ./drivers/md/md.c --- .prev/drivers/md/md.c 2007-04-02 17:43:03.0 +1000 +++ ./drivers/md/md.c 2007-04-02 17:38:46.0 +1000 @@ -1389,6 +1389,12 @@ static int bind_rdev_to_array(mdk_rdev_t return err; } +static void delayed_delete(struct work_struct *ws) +{ + mdk_rdev_t *rdev = container_of(ws, mdk_rdev_t, del_work); + kobject_del(rdev-kobj); +} + static void unbind_rdev_from_array(mdk_rdev_t * rdev) { char b[BDEVNAME_SIZE]; @@ -1401,7 +1407,12 @@ static void unbind_rdev_from_array(mdk_r printk(KERN_INFO md: unbind%s\n, bdevname(rdev-bdev,b)); rdev-mddev = NULL; sysfs_remove_link(rdev-kobj, block); - kobject_del(rdev-kobj); + + /* We need to delay this, otherwise we can deadlock when + * writing to 'remove' to dev/state + */ + INIT_WORK(rdev-del_work, delayed_delete); + schedule_work(rdev-del_work); } /* diff .prev/include/linux/raid/md_k.h ./include/linux/raid/md_k.h --- .prev/include/linux/raid/md_k.h 2007-04-02 17:43:03.0 +1000 +++ ./include/linux/raid/md_k.h 2007-04-02 17:36:32.0 +1000 @@ -104,6 +104,7 @@ struct mdk_rdev_s * for reporting to userspace and storing * in superblock. */ + struct work_struct del_work;/* used for delayed sysfs removal */ }; What guarantees that *rdev is still valid when delayed_delete() runs? And what guarantees that the md module hasn't been rmmodded when delayed_delete() tries to run? - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] md: Fix for raid6 reshape.
On Fri, 2 Mar 2007 15:56:55 +1100 NeilBrown [EMAIL PROTECTED] wrote: - conf-expand_progress = (sector_nr + i)*(conf-raid_disks-1); + conf-expand_progress = (sector_nr + i) * new_data_disks); ahem. - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 006 of 6] md: Add support for reshape of a raid6
On Tue, 20 Feb 2007 17:35:16 +1100 NeilBrown [EMAIL PROTECTED] wrote: + for (i = conf-raid_disks ; i-- ; ) { That statement should be dragged out, shot, stomped on then ceremonially incinerated. What's wrong with doing for (i = 0; i conf-raid_disks; i++) { in a manner which can be understood without alcoholic fortification? ho hum. - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 006 of 6] md: Add support for reshape of a raid6
On Thu, 22 Feb 2007 00:36:22 +0100 Oleg Verych [EMAIL PROTECTED] wrote: From: Andrew Morton Newsgroups: gmane.linux.raid,gmane.linux.kernel Subject: Re: [PATCH 006 of 6] md: Add support for reshape of a raid6 Date: Wed, 21 Feb 2007 14:48:06 -0800 Hallo. On Tue, 20 Feb 2007 17:35:16 +1100 NeilBrown [EMAIL PROTECTED] wrote: + for (i = conf-raid_disks ; i-- ; ) { That statement should be dragged out, shot, stomped on then ceremonially incinerated. What's wrong with doing for (i = 0; i conf-raid_disks; i++) { in a manner which can be understood without alcoholic fortification? ho hum. In case someone likes to do job, GCC usually ought to do, i would suggest something like this instead: if (expanded test_bit(STRIPE_EXPANDING, sh-state)) { /* Need to write out all blocks after computing PQ */ - sh-disks = conf-raid_disks; + i = conf-raid_disks; + sh-disks = i; - sh-pd_idx = stripe_to_pdidx(sh-sector, conf, -conf-raid_disks); + sh-pd_idx = stripe_to_pdidx(sh-sector, conf, i); compute_parity6(sh, RECONSTRUCT_WRITE); - for (i = conf-raid_disks ; i-- ; ) { + do { set_bit(R5_LOCKED, sh-dev[i].flags); locked++; set_bit(R5_Wantwrite, sh-dev[i].flags); - } + } while (--i); clear_bit(STRIPE_EXPANDING, sh-state); } else if (expanded) { In any case this is subject of scripts/bloat-o-meter. This: --- a/drivers/md/raid5.c~a +++ a/drivers/md/raid5.c @@ -2364,7 +2364,7 @@ static void handle_stripe6(struct stripe sh-pd_idx = stripe_to_pdidx(sh-sector, conf, conf-raid_disks); compute_parity6(sh, RECONSTRUCT_WRITE); - for (i = conf-raid_disks ; i-- ; ) { + for (i = 0; i conf-raid_disks; ++) { set_bit(R5_LOCKED, sh-dev[i].flags); locked++; set_bit(R5_Wantwrite, sh-dev[i].flags); _ reduces the size of drivers/md/raid5.o's .text by two bytes. - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 006 of 6] md: Add support for reshape of a raid6
On Thu, 22 Feb 2007 13:39:56 +1100 Neil Brown [EMAIL PROTECTED] wrote: I must right code that Andrew can read. That's write. But more importantly, things that people can immediately see and understand help reduce the possibility of mistakes. Now and in the future. If we did all loops like that, then it'd be the the best way to do it in new code, because people's eyes and brains are locked into that idiom and we just don't have to think about it when we see it. - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Kernel 2.6.19.2 New RAID 5 Bug (oops when writing Samba - RAID5)
On Wed, 24 Jan 2007 18:37:15 -0500 (EST) Justin Piszcz [EMAIL PROTECTED] wrote: Without digging too deeply, I'd say you've hit the same bug Sami Farin and others have reported starting with 2.6.19: pages mapped with kmap_atomic() become unmapped during memcpy() or similar operations. Try disabling preempt -- that seems to be the common factor. - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html After I run some other tests, I am going to re-run this test and see if it OOPSes again with PREEMPT off. Strange. The below debug patch might catch it - please run with this applied. --- a/arch/i386/mm/highmem.c~kmap_atomic-debugging +++ a/arch/i386/mm/highmem.c @@ -30,7 +30,43 @@ void *kmap_atomic(struct page *page, enu { enum fixed_addresses idx; unsigned long vaddr; + static unsigned warn_count = 10; + if (unlikely(warn_count == 0)) + goto skip; + + if (unlikely(in_interrupt())) { + if (in_irq()) { + if (type != KM_IRQ0 type != KM_IRQ1 + type != KM_BIO_SRC_IRQ type != KM_BIO_DST_IRQ + type != KM_BOUNCE_READ) { + WARN_ON(1); + warn_count--; + } + } else if (!irqs_disabled()) { /* softirq */ + if (type != KM_IRQ0 type != KM_IRQ1 + type != KM_SOFTIRQ0 type != KM_SOFTIRQ1 + type != KM_SKB_SUNRPC_DATA + type != KM_SKB_DATA_SOFTIRQ + type != KM_BOUNCE_READ) { + WARN_ON(1); + warn_count--; + } + } + } + + if (type == KM_IRQ0 || type == KM_IRQ1 || type == KM_BOUNCE_READ) { + if (!irqs_disabled()) { + WARN_ON(1); + warn_count--; + } + } else if (type == KM_SOFTIRQ0 || type == KM_SOFTIRQ1) { + if (irq_count() == 0 !irqs_disabled()) { + WARN_ON(1); + warn_count--; + } + } +skip: /* even !CONFIG_PREEMPT needs this, for in_atomic in do_page_fault */ pagefault_disable(); if (!PageHighMem(page)) _ - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 002 of 4] md: Make 'repair' actually work for raid1.
On Tue, 23 Jan 2007 11:26:52 +1100 NeilBrown [EMAIL PROTECTED] wrote: + for (j = 0; j vcnt ; j++) + memcpy(page_address(sbio-bi_io_vec[j].bv_page), + page_address(pbio-bi_io_vec[j].bv_page), +PAGE_SIZE); I trust these BIOs are known to only contain suitably-allocated, MD-private pages? Because if these pages can be user pages then this change is spectacularly buggy ;) - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 2.6.20-rc5: cp 18gb 18gb.2 = OOM killer, reproducible just like 2.16.19.2
On Sun, 21 Jan 2007 14:27:34 -0500 (EST) Justin Piszcz [EMAIL PROTECTED] wrote: Why does copying an 18GB on a 74GB raptor raid1 cause the kernel to invoke the OOM killer and kill all of my processes? What's that? Software raid or hardware raid? If the latter, which driver? Doing this on a single disk 2.6.19.2 is OK, no issues. However, this happens every time! Anything to try? Any other output needed? Can someone shed some light on this situation? Thanks. The last lines of vmstat 1 (right before it kill -9'd my shell/ssh) procs ---memory-- ---swap-- -io -system-- cpu r b swpd free buff cache si sobibo in cs us sy id wa 0 7764 50348 12 126998800 53632 172 1902 4600 1 8 29 62 0 7764 49420 12 126000400 53632 34368 1871 6357 2 11 48 40 The wordwrapping is painful :( The last lines of dmesg: [ 5947.199985] lowmem_reserve[]: 0 0 0 [ 5947.12] DMA: 0*4kB 1*8kB 1*16kB 0*32kB 1*64kB 1*128kB 1*256kB 0*512kB 1*1024kB 1*2048kB 0*4096kB = 3544kB [ 5947.200010] Normal: 1*4kB 0*8kB 1*16kB 1*32kB 0*64kB 1*128kB 0*256kB 1*512kB 0*1024kB 1*2048kB 0*4096kB = 2740kB [ 5947.200035] HighMem: 98*4kB 35*8kB 9*16kB 69*32kB 4*64kB 1*128kB 1*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 3664kB [ 5947.200052] Swap cache: add 789, delete 189, find 16/17, race 0+0 [ 5947.200055] Free swap = 2197628kB [ 5947.200058] Total swap = 2200760kB [ 5947.200060] Free swap: 2197628kB [ 5947.205664] 517888 pages of RAM [ 5947.205671] 288512 pages of HIGHMEM [ 5947.205673] 5666 reserved pages [ 5947.205675] 257163 pages shared [ 5947.205678] 600 pages swap cached [ 5947.205680] 88876 pages dirty [ 5947.205682] 115111 pages writeback [ 5947.205684] 5608 pages mapped [ 5947.205686] 49367 pages slab [ 5947.205688] 541 pages pagetables [ 5947.205795] Out of memory: kill process 1853 (named) score 9937 or a child [ 5947.205801] Killed process 1853 (named) [ 5947.206616] bash invoked oom-killer: gfp_mask=0x84d0, order=0, oomkilladj=0 [ 5947.206621] [c013e33b] out_of_memory+0x17b/0x1b0 [ 5947.206631] [c013fcac] __alloc_pages+0x29c/0x2f0 [ 5947.206636] [c01479ad] __pte_alloc+0x1d/0x90 [ 5947.206643] [c0148bf7] copy_page_range+0x357/0x380 [ 5947.206649] [c0119d75] copy_process+0x765/0xfc0 [ 5947.206655] [c012c3f9] alloc_pid+0x1b9/0x280 [ 5947.206662] [c011a839] do_fork+0x79/0x1e0 [ 5947.206674] [c015f91f] do_pipe+0x5f/0xc0 [ 5947.206680] [c0101176] sys_clone+0x36/0x40 [ 5947.206686] [c0103138] syscall_call+0x7/0xb [ 5947.206691] [c0420033] __sched_text_start+0x853/0x950 [ 5947.206698] === Important information from the oom-killing event is missing. Please send it all. From your earlier reports we have several hundred MB of ZONE_NORMAL memory which has gone awol. Please include /proc/meminfo from after the oom-killing. Please work out what is using all that slab memory, via /proc/slabinfo. After the oom-killing, please see if you can free up the ZONE_NORMAL memory via a few `echo 3 /proc/sys/vm/drop_caches' commands. See if you can work out what happened to the missing couple-of-hundred MB from ZONE_NORMAL. - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 2.6.20-rc5: cp 18gb 18gb.2 = OOM killer, reproducible just like 2.16.19.2
On Tue, 23 Jan 2007 11:37:09 +1100 Donald Douwsma [EMAIL PROTECTED] wrote: Andrew Morton wrote: On Sun, 21 Jan 2007 14:27:34 -0500 (EST) Justin Piszcz [EMAIL PROTECTED] wrote: Why does copying an 18GB on a 74GB raptor raid1 cause the kernel to invoke the OOM killer and kill all of my processes? What's that? Software raid or hardware raid? If the latter, which driver? I've hit this using local disk while testing xfs built against 2.6.20-rc4 (SMP x86_64) dmesg follows, I'm not sure if anything in this is useful after the first event as our automated tests continued on after the failure. This looks different. ... Mem-info: Node 0 DMA per-cpu: CPU0: Hot: hi:0, btch: 1 usd: 0 Cold: hi:0, btch: 1 usd: 0 CPU1: Hot: hi:0, btch: 1 usd: 0 Cold: hi:0, btch: 1 usd: 0 CPU2: Hot: hi:0, btch: 1 usd: 0 Cold: hi:0, btch: 1 usd: 0 CPU3: Hot: hi:0, btch: 1 usd: 0 Cold: hi:0, btch: 1 usd: 0 Node 0 DMA32 per-cpu: CPU0: Hot: hi: 186, btch: 31 usd: 31 Cold: hi: 62, btch: 15 usd: 53 CPU1: Hot: hi: 186, btch: 31 usd: 2 Cold: hi: 62, btch: 15 usd: 60 CPU2: Hot: hi: 186, btch: 31 usd: 20 Cold: hi: 62, btch: 15 usd: 47 CPU3: Hot: hi: 186, btch: 31 usd: 25 Cold: hi: 62, btch: 15 usd: 56 Active:76 inactive:495856 dirty:0 writeback:0 unstable:0 free:3680 slab:9119 mapped:32 pagetables:637 No dirty pages, no pages under writeback. Node 0 DMA free:8036kB min:24kB low:28kB high:36kB active:0kB inactive:1856kB present:9376kB pages_scanned:3296 all_unreclaimable? yes lowmem_reserve[]: 0 2003 2003 Node 0 DMA32 free:6684kB min:5712kB low:7140kB high:8568kB active:304kB inactive:1981624kB present:2052068kB Inactive list is filled. pages_scanned:4343329 all_unreclaimable? yes We scanned our guts out and decided that nothing was reclaimable. No available memory (MPOL_BIND): kill process 3492 (hald) score 0 or a child No available memory (MPOL_BIND): kill process 7914 (top) score 0 or a child No available memory (MPOL_BIND): kill process 4166 (nscd) score 0 or a child No available memory (MPOL_BIND): kill process 17869 (xfs_repair) score 0 or a child But in all cases a constrained memory policy was in use. - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch] md: pass down BIO_RW_SYNC in raid{1,10}
On Mon, 8 Jan 2007 10:08:34 +0100 Lars Ellenberg [EMAIL PROTECTED] wrote: md raidX make_request functions strip off the BIO_RW_SYNC flag, thus introducing additional latency. fixing this in raid1 and raid10 seems to be straight forward enough. for our particular usage case in DRBD, passing this flag improved some initialization time from ~5 minutes to ~5 seconds. That sounds like a significant fix. This patch also applies to 2.6.19 and I have tagged it for a -stable backport. Neil, are you OK with that? Acked-by: NeilBrown [EMAIL PROTECTED] Signed-off-by: Lars Ellenberg [EMAIL PROTECTED] --- diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c index b30f74b..164b25d 100644 --- a/drivers/md/raid1.c +++ b/drivers/md/raid1.c @@ -775,6 +775,7 @@ static int make_request(request_queue_t struct bio_list bl; struct page **behind_pages = NULL; const int rw = bio_data_dir(bio); + const int do_sync = bio_sync(bio); int do_barriers; /* @@ -835,7 +836,7 @@ static int make_request(request_queue_t read_bio-bi_sector = r1_bio-sector + mirror-rdev-data_offset; read_bio-bi_bdev = mirror-rdev-bdev; read_bio-bi_end_io = raid1_end_read_request; - read_bio-bi_rw = READ; + read_bio-bi_rw = READ | do_sync; read_bio-bi_private = r1_bio; generic_make_request(read_bio); @@ -906,7 +907,7 @@ #endif mbio-bi_sector = r1_bio-sector + conf-mirrors[i].rdev-data_offset; mbio-bi_bdev = conf-mirrors[i].rdev-bdev; mbio-bi_end_io = raid1_end_write_request; - mbio-bi_rw = WRITE | do_barriers; + mbio-bi_rw = WRITE | do_barriers | do_sync; mbio-bi_private = r1_bio; if (behind_pages) { @@ -941,6 +942,8 @@ #endif blk_plug_device(mddev-queue); spin_unlock_irqrestore(conf-device_lock, flags); + if (do_sync) + md_wakeup_thread(mddev-thread); #if 0 while ((bio = bio_list_pop(bl)) != NULL) generic_make_request(bio); @@ -1541,6 +1544,7 @@ static void raid1d(mddev_t *mddev) * We already have a nr_pending reference on these rdevs. */ int i; + const int do_sync = bio_sync(r1_bio-master_bio); clear_bit(R1BIO_BarrierRetry, r1_bio-state); clear_bit(R1BIO_Barrier, r1_bio-state); for (i=0; i conf-raid_disks; i++) @@ -1561,7 +1565,7 @@ static void raid1d(mddev_t *mddev) conf-mirrors[i].rdev-data_offset; bio-bi_bdev = conf-mirrors[i].rdev-bdev; bio-bi_end_io = raid1_end_write_request; - bio-bi_rw = WRITE; + bio-bi_rw = WRITE | do_sync; bio-bi_private = r1_bio; r1_bio-bios[i] = bio; generic_make_request(bio); @@ -1593,6 +1597,7 @@ static void raid1d(mddev_t *mddev) (unsigned long long)r1_bio-sector); raid_end_bio_io(r1_bio); } else { + const int do_sync = bio_sync(r1_bio-master_bio); r1_bio-bios[r1_bio-read_disk] = mddev-ro ? IO_BLOCKED : NULL; r1_bio-read_disk = disk; @@ -1608,7 +1613,7 @@ static void raid1d(mddev_t *mddev) bio-bi_sector = r1_bio-sector + rdev-data_offset; bio-bi_bdev = rdev-bdev; bio-bi_end_io = raid1_end_read_request; - bio-bi_rw = READ; + bio-bi_rw = READ | do_sync; bio-bi_private = r1_bio; unplug = 1; generic_make_request(bio); diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c index f014191..a9401c0 100644 --- a/drivers/md/raid10.c +++ b/drivers/md/raid10.c @@ -782,6 +782,7 @@ static int make_request(request_queue_t int i; int chunk_sects = conf-chunk_mask + 1; const int rw = bio_data_dir(bio); + const int do_sync = bio_sync(bio); struct bio_list bl; unsigned long flags; @@ -863,7 +864,7 @@ static int make_request(request_queue_t mirror-rdev-data_offset; read_bio-bi_bdev = mirror-rdev-bdev; read_bio-bi_end_io = raid10_end_read_request; - read_bio-bi_rw = READ; + read_bio-bi_rw = READ | do_sync; read_bio-bi_private =
Re: sata badness in 2.6.20-rc1? [Was: Re: md patches in -mm]
On Tue, 19 Dec 2006 15:26:00 -0800 (PST) Luben Tuikov [EMAIL PROTECTED] wrote: The reason was that my dev tree was tainted by this bug: if (good_bytes - scsi_end_request(cmd, 1, good_bytes, !!result) == NULL) + scsi_end_request(cmd, 1, good_bytes, result == 0) == NULL) return; in scsi_io_completion(). I had there !!result which is wrong, and when I diffed against master, it produced a bad patch. Oh. I thought that got sorted out. It's a shame this wasn't made clear to me.. As James mentioned one of the chunks is good and can go in. Please send a new patch, not referential to any previous patch or email, including full changelogging. - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: sata badness in 2.6.20-rc1? [Was: Re: md patches in -mm]
On Sun, 17 Dec 2006 12:00:12 +0100 Rafael J. Wysocki [EMAIL PROTECTED] wrote: Okay, I have identified the patch that causes the problem to appear, which is fix-sense-key-medium-error-processing-and-retry.patch With this patch reverted -rc1-mm1 is happily running on my test box. That was rather unexpected. Thanks. - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: md patches in -mm
On Fri, 15 Dec 2006 20:21:46 +0100 [EMAIL PROTECTED] wrote: From: Neil Brown [EMAIL PROTECTED] Date: Wed, Dec 06, 2006 at 06:20:57PM +1100 i.e. current -mm is good for 2.6.20 (though I have a few other little things I'll be sending in soon, they aren't related to the raid6 problem). 2.6.20-rc1-mm1 doesn't boot on my box, due to the fact that e2fsck gives Buffer I/O error on device /dev/md0, logical block 0 and after that 1,2,3,4,5,6,7,8,9, at which points it complains it can't read the superblock. It seems the raid6 problem hasn't gone away completely, after all. Odd. The only md patch in rc1-mm1 is the truly ancient md-dm-reduce-stack-usage-with-stacked-block-devices.patch Does 2.6.20-rc1 work? - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: sata badness in 2.6.20-rc1? [Was: Re: md patches in -mm]
On Sat, 16 Dec 2006 07:50:01 +1100 Neil Brown [EMAIL PROTECTED] wrote: On Friday December 15, [EMAIL PROTECTED] wrote: From: Neil Brown [EMAIL PROTECTED] Date: Wed, Dec 06, 2006 at 06:20:57PM +1100 i.e. current -mm is good for 2.6.20 (though I have a few other little things I'll be sending in soon, they aren't related to the raid6 problem). 2.6.20-rc1-mm1 doesn't boot on my box, due to the fact that e2fsck gives Buffer I/O error on device /dev/md0, logical block 0 But before that raid5: device sdh1 operational as raid disk 1 raid5: device sdg1 operational as raid disk 0 raid5: device sdf1 operational as raid disk 5 raid5: device sde1 operational as raid disk 6 raid5: device sdd1 operational as raid disk 7 raid5: device sdc1 operational as raid disk 3 raid5: device sdb1 operational as raid disk 2 raid5: device sda1 operational as raid disk 4 raid5: allocated 8462kB for md0 raid5: raid level 6 set md0 active with 8 out of 8 devices, algorithm 2 RAID5 conf printout: --- rd:8 wd:8 disk 0, o:1, dev:sdg1 disk 1, o:1, dev:sdh1 disk 2, o:1, dev:sdb1 disk 3, o:1, dev:sdc1 disk 4, o:1, dev:sda1 disk 5, o:1, dev:sdf1 disk 6, o:1, dev:sde1 disk 7, o:1, dev:sdd1 md0: bitmap initialized from disk: read 15/15 pages, set 1 bits, status: 0 created bitmap (233 pages) for device md0 md: super_written gets error=-5, uptodate=0 raid5: Disk failure on sde1, disabling device. Operation continuing on 7 devices md: super_written gets error=-5, uptodate=0 raid5: Disk failure on sdg1, disabling device. Operation continuing on 6 devices md: super_written gets error=-5, uptodate=0 raid5: Disk failure on sdf1, disabling device. Operation continuing on 5 devices md: super_written gets error=-5, uptodate=0 raid5: Disk failure on sdc1, disabling device. Operation continuing on 4 devices md: super_written gets error=-5, uptodate=0 raid5: Disk failure on sdb1, disabling device. Operation continuing on 3 devices md: super_written gets error=-5, uptodate=0 raid5: Disk failure on sdh1, disabling device. Operation continuing on 2 devices md: super_written gets error=-5, uptodate=0 raid5: Disk failure on sdd1, disabling device. Operation continuing on 1 devices md: super_written gets error=-5, uptodate=0 raid5: Disk failure on sda1, disabling device. Operation continuing on 0 devices Oh dear, that array isn't much good any more.! That is the second report I have had of this with sata drives. This was raid456, the other was raid1. Two different sata drivers are involved (sata_nv in this case, sata_uli in the other case). I think something bad happened in sata land just recently. The device driver is returning -EIO for a write without printing any messages. OK, this is bad. The wheels do appear to have fallen off sata in rc1-mm1. Jeff, I shall send all the sata patches which I have at you one single time and I shall then drop the lot. So please don't flub them. I'll then do a rc1-mm2 without them. - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: sata badness in 2.6.20-rc1? [Was: Re: md patches in -mm]
On Fri, 15 Dec 2006 13:05:52 -0800 Andrew Morton [EMAIL PROTECTED] wrote: Jeff, I shall send all the sata patches which I have at you one single time and I shall then drop the lot. So please don't flub them. I'll then do a rc1-mm2 without them. hm, this is looking like a lot of work for not much gain. Rafael, are you able to do a quick chop and tell us whether these: pci-move-pci_vdevice-from-libata-to-core.patch pata_cs5530-suspend-resume-support-tweak.patch ata-fix-platform_device_register_simple-error-check.patch initializer-entry-defined-twice-in-pata_rz1000.patch pata_via-suspend-resume-support-fix.patch sata_nv-add-suspend-resume-support.patch libata-simulate-report-luns-for-atapi-devices.patch user-of-the-jiffies-rounding-patch-ata-subsystem.patch libata-fix-oops-with-sparsemem.patch sata_nv-fix-kfree-ordering-in-remove.patch libata-dont-initialize-sg-in-ata_exec_internal-if-dma_none-take-2.patch pci-quirks-fix-the-festering-mess-that-claims-to-handle-ide-quirks-ide-fix.patch are innocent? Thanks. - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 002 of 6] md: Change lifetime rules for 'md' devices.
On Tue, 31 Oct 2006 17:00:51 +1100 NeilBrown [EMAIL PROTECTED] wrote: Currently md devices are created when first opened and remain in existence until the module is unloaded. This isn't a major problem, but it somewhat ugly. This patch changes the lifetime rules so that an md device will disappear on the last close if it has no state. This kills the G5: EXT3-fs: recovery complete. EXT3-fs: mounted filesystem with ordered data mode. Oops: Kernel access of bad area, sig: 11 [#1] SMP NR_CPUS=4 Modules linked in: NIP: C01A31B8 LR: C018E5DC CTR: C01A3404 REGS: c17ff4a0 TRAP: 0300 Not tainted (2.6.19-rc4-mm1) MSR: 90009032 EE,ME,IR,DR CR: 8448 XER: DAR: 6B6B6B6B6B6B6BB3, DSISR: 4000 TASK = cff2b7f0[1899] 'nash' THREAD: c17fc000 CPU: 1 GPR00: 0008 C17FF720 C06B26D0 6B6B6B6B6B6B6B7B GPR04: 0002 0001 0001 000200D0 GPR08: 00050C00 C01A3404 C01A318C GPR12: 8444 C0535680 100FE350 100FE7B8 GPR16: GPR20: 10005CD4 1009 10007E54 GPR24: C0472C80 6B6B6B6B6B6B6B7B C1FD2530 GPR28: C7B8C2F0 6B6B6B6B6B6B6B7B C057DAE8 C79A2550 NIP [C01A31B8] .kobject_uevent+0xac/0x55c LR [C018E5DC] .__elv_unregister_queue+0x20/0x44 Call Trace: [C17FF720] [C0562508] read_pipe_fops+0xd0/0xd8 (unreliable) [C17FF840] [C018E5DC] .__elv_unregister_queue+0x20/0x44 [C17FF8D0] [C0195548] .blk_unregister_queue+0x58/0x9c [C17FF960] [C019683C] .unlink_gendisk+0x1c/0x50 [C17FF9F0] [C0122840] .del_gendisk+0x98/0x22c [C17FFA90] [C035B56C] .mddev_put+0xa0/0xe0 [C17FFB20] [C0362178] .md_release+0x84/0x9c [C17FFBA0] [C00FDDE0] .__blkdev_put+0x204/0x220 [C17FFC50] [C00C765C] .__fput+0x234/0x274 [C17FFD00] [C00C5264] .filp_close+0x6c/0xfc [C17FFD90] [C00C53B8] .sys_close+0xc4/0x178 [C17FFE30] [C000872C] syscall_exit+0x0/0x40 Instruction dump: 4e800420 0020 0250 0278 0280 0258 0260 0268 0270 3b20 2fb9 419e003c e81a0038 2fa0 7c1d0378 409e0070 7eth0: no IPv6 routers present It happens during initscripts. The machine has no MD devices. config is at http://userweb.kernel.org/~akpm/config-g5.txt Also, it'd be nice to enable CONFIG_MUST_CHECK and take a look at a few things... drivers/md/md.c: In function `bind_rdev_to_array': drivers/md/md.c:1379: warning: ignoring return value of `kobject_add', declared with attribute warn_unused_result drivers/md/md.c:1385: warning: ignoring return value of `sysfs_create_link', declared with attribute warn_unused_result drivers/md/md.c: In function `md_probe': drivers/md/md.c:2986: warning: ignoring return value of `kobject_register', declared with attribute warn_unused_result drivers/md/md.c: In function `do_md_run': drivers/md/md.c:3135: warning: ignoring return value of `sysfs_create_group', declared with attribute warn_unused_result drivers/md/md.c:3150: warning: ignoring return value of `sysfs_create_link', declared with attribute warn_unused_result drivers/md/md.c: In function `md_check_recovery': drivers/md/md.c:5446: warning: ignoring return value of `sysfs_create_link', declared with attribute warn_unused_result - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 001 of 4] md: Define backing_dev_info.congested_fn for raid0 and linear
On Tue, 29 Aug 2006 15:39:24 +1000 NeilBrown [EMAIL PROTECTED] wrote: Each backing_dev needs to be able to report whether it is congested, either by modulating BDI_*_congested in -state, or by defining a -congested_fn. md/raid did neither of these. This patch add a congested_fn which simply checks all component devices to see if they are congested. Signed-off-by: Neil Brown [EMAIL PROTECTED] +static int linear_congested(void *data, int bits) +{ + mddev_t *mddev = data; + linear_conf_t *conf = mddev_to_conf(mddev); + int i, ret = 0; + + for (i = 0; i mddev-raid_disks !ret ; i++) { + request_queue_t *q = bdev_get_queue(conf-disks[i].rdev-bdev); + ret |= bdi_congested(q-backing_dev_info, bits); nit: `ret = ' would suffice here. +static int raid0_congested(void *data, int bits) +{ + mddev_t *mddev = data; + raid0_conf_t *conf = mddev_to_conf(mddev); + mdk_rdev_t **devlist = conf-strip_zone[0].dev; + int i, ret = 0; + + for (i = 0; i mddev-raid_disks !ret ; i++) { + request_queue_t *q = bdev_get_queue(devlist[i]-bdev); + + ret |= bdi_congested(q-backing_dev_info, bits); And here. + } + return ret; +} + static int create_strip_zones (mddev_t *mddev) { @@ -236,6 +251,8 @@ static int create_strip_zones (mddev_t * mddev-queue-unplug_fn = raid0_unplug; mddev-queue-issue_flush_fn = raid0_issue_flush; + mddev-queue-backing_dev_info.congested_fn = raid0_congested; + mddev-queue-backing_dev_info.congested_data = mddev; printk(raid0: done.\n); return 0; - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 000 of 4] md: Introduction
On Thu, 24 Aug 2006 17:40:56 +1000 NeilBrown [EMAIL PROTECTED] wrote: Following are 4 patches against 2.6.18-rc4-mm2 The first 2 are bug fixes which should go in 2.6.18, and apply equally well to that tree as to -mm. The latter two should stay in -mm until after 2.6.18. The second patch is maybe bigger than it absolutely needs to be as a bugfix. If you like I can stripe out all the rcu-extra-carefulness as a separate patch and just leave the important bit which involves moving the atomic_add down twenty-something lines. Thanks, NeilBrown [PATCH 001 of 4] md: Fix recent breakage of md/raid1 array checking [PATCH 002 of 4] md: Fix issues with referencing rdev in md/raid1. [PATCH 003 of 4] md: new sysfs interface for setting bits in the write-intent-bitmap [PATCH 004 of 4] md: Remove unnecessary variable x in stripe_to_pdidx(). The second patch is against -mm and doesn't come within a mile of applying to mainline. - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 008 of 8] md/bitmap: Change md/bitmap file handling to use bmap to file blocks.
Neil Brown [EMAIL PROTECTED] wrote: ... I have a patch which did that, but decided that the possibility of kmalloc failure at awkward times would make that not suitable. submit_bh() can and will allocate memory, although most decent device drivers should be OK. submit_bh (like all decent device drivers) uses a mempool for memory allocation so we can be sure that the delay in getting memory is bounded by the delay for a few IO requests to complete, and we can be sure the allocation won't fail. This is perfectly fine. My point is that some device drivers will apparently call into the mamory allocator (should be GFP_NOIO) on the request submission path. I don't recall whcih drivers - but not mainstream ones. I don't think a_ops really provides an interface that I can use, partly because, as I said in a previous email, it isn't really a public interface to a filesystem. It's publicer than bmap+submit_bh! I don't know how you can say that. bmap() is a userspace API, not really a kernel one. And it assumes a block-backed filesystem. bmap is so public that it is exported to userspace through an IOCTL and is used by lilo (admitted only for reading, not writing). More significantly it is used by swapfile which is a completely independent subsystem from the filesystem. Contrast this with a_ops. The primary users of a_ops are routines like generic_file_{read,write} and friends. These are tools - library routines - that are used by filesystems to implement their 'file_operations' which are the real public interface. As far as these uses go, it is not a public interface. Where a filesystem doesn't use some library routines, it does not need to implement the matching functionality in the a_op interface. Well yeah. If one wants to do filesystem I/O in-kernel then one should use the filesystem I/O entry points: read() and write() (and get flamed ;)). If one wants to cut in at a lower level than that then we get into a pickle because different types of filesytems do quite different things to implement read() and write(). The other main user is the 'VM' which might try to flush out or invalidate pages. However the VM isn't using this interface to interact with files, but only to interact with pages, and it doesn't much care what is done with the pages providing they get clean, or get released, or whatever. The way I perceive Linux design/development, active usage is far more significant than documented design. If some feature of an interface isn't being actively used - by in-kernel code - then you cannot be sure that feature will be uniformly implemented, or that it won't change subtly next week. So when I went looking for the best way to get md/bitmap to write to a file, I didn't just look at the interface specs (which are pretty poorly documented anyway), I looked at existing code. I can find 3 different parts of the kernel that write to a file. They are swap-file loop nfsd nfsd uses vfs_read/vfs_write which have too many failure/delay modes for me to safely use. loop uses prepare_write/commit_write (if available) or f_op-write (not vfs_write - I wonder why) which is not much better than what nfsd uses. And as far as I can tell loop never actually flushes data to storage (hope no-one thinks a journalling filesystem on loop is a good idea) so it isn't a good model to follow. [[If loop was a good model to follow, I would have rejected the code for writing to a file in the first place, only accepted code for writing to a device, and insisted on using loop to close the gap]] That leaves swap-file as the only in-kernel user of a filesystem that accesses the file in a way similar to what I need. Is it any surprise that I chose to copy it? swapfile doesn't use vfs_read() for swapin... But if the pagecache is dirty then invalidate_inode_pages() will leave it dirty and the VM will later write it back, corrupting your bitmap file. You should get i_writecount, fsync the file and then run invalidate_inode_pages(). Well, as I am currently reading the file through the pagecache but yes, I should put an fsync in there (and the invalidate before read is fairly pointless. It is the invalidate after close that is important). umm, the code in its present form _will_ corrupt MD bitmap files. invalidate_inode_pages() will not invalidate dirty pagecache, and that dirty pagecache will later get written back, undoing any of your intervening direct-io writes. Most of the time, MD will do another direct-io write _after_ the VM has done that writeback and things will get fixed up. But there is a window. So that fsync() is required. You might as well use kernel_read() too, if you insist on begin oddball ;) I didn't know about that one.. If you would feel more comfortable with kernel_read I would be more than happy to use it. We might as well save some
Re: [PATCH 001 of 3] md: Change md/bitmap file handling to use bmap to file blocks-fix
NeilBrown [EMAIL PROTECTED] wrote: +do_sync_file_range(file, 0, LLONG_MAX, + SYNC_FILE_RANGE_WRITE | + SYNC_FILE_RANGE_WAIT_AFTER); That needs a SYNC_FILE_RANGE_WAIT_BEFORE too. Otherwise any dirty, under-writeback pages will remain dirty. I'll make that change. - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 008 of 8] md/bitmap: Change md/bitmap file handling to use bmap to file blocks.
Neil Brown [EMAIL PROTECTED] wrote: On Saturday May 13, [EMAIL PROTECTED] wrote: Paul Clements [EMAIL PROTECTED] wrote: Andrew Morton wrote: The loss of pagecache coherency seems sad. I assume there's never a requirement for userspace to read this file. Actually, there is. mdadm reads the bitmap file, so that would be broken. Also, it's just useful for a user to be able to read the bitmap (od -x, or similar) to figure out approximately how much more he's got to resync to get an array in-sync. Other than reading the bitmap file, I don't know of any way to determine that. Read it with O_DIRECT :( Which is exactly what the next release of mdadm does. As the patch comment said: : With this approach the pagecache may contain data which is inconsistent with : what is on disk. To alleviate the problems this can cause, md invalidates : the pagecache when releasing the file. If the file is to be examined : while the array is active (a non-critical but occasionally useful function), : O_DIRECT io must be used. And new version of mdadm will have support for this. Which doesn't help `od -x' and is going to cause older mdadm userspace to mysteriously and subtly fail. Or does the user-kernel interface have versioning which will prevent this? - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 008 of 8] md/bitmap: Change md/bitmap file handling to use bmap to file blocks.
Neil Brown [EMAIL PROTECTED] wrote: On Friday May 12, [EMAIL PROTECTED] wrote: NeilBrown [EMAIL PROTECTED] wrote: If md is asked to store a bitmap in a file, it tries to hold onto the page cache pages for that file, manipulate them directly, and call a cocktail of operations to write the file out. I don't believe this is a supportable approach. erk. I think it's better than... This patch changes the approach to use the same approach as swap files. i.e. bmap is used to enumerate all the block address of parts of the file and we write directly to those blocks of the device. That's going in at a much lower level. Even swapfiles don't assume buffer_heads. I'm not assuming buffer_heads. I'm creating buffer heads and using them for my own purposes. These are my pages and my buffer heads. None of them belong to the filesystem. Right, so it's incoherent with pagecache and userspace can no longer usefully read this file. The buffer_heads are simply a convenient data-structure to record the several block addresses for each page. I could have equally created an array storing all the addresses, and built the required bios by hand at write time. But buffer_heads did most of the work for me, so I used them. OK. Yes, it is a lower level, but 1/ I am certain that there will be no kmalloc problems and 2/ Because it is exactly the level used by swapfile, I know that it is sufficiently 'well defined' that no-one is going to break it. It would be nicer of course to actually use the mm/page_io.c code. That would involve implementing swap_bmap() and reimplementing the get_swap_bio() stuff in terms of a_ops-bmap(). But the swap code can afford to skip blockruns which aren't page-sized and it uses that capability nicely. You cannot do that. All this (and a set_fs(KERNEL_DS), ug) looks like a step backwards to me. Operating at the pagecache a_ops level looked better, and more filesystem-independent. If you really want filesystem independence, you need to use vfs_read and vfs_write to read/write the file. yup. I have a patch which did that, but decided that the possibility of kmalloc failure at awkward times would make that not suitable. submit_bh() can and will allocate memory, although most decent device drivers should be OK. There are tricks we can do with writepage. If the backing filesystem uses buffer_heads and if you hold a ref on the page then we know that there won't be any buffer_head allocations nor any disk reads in the writepage path. It'll go direct into bio_alloc+submit_bio, just as you're doing now. IOW: no gain. So I now use vfs_read to read in the file (just like knfsd) and bmap/submit_bh to write out the file (just like swapfile). I don't think a_ops really provides an interface that I can use, partly because, as I said in a previous email, it isn't really a public interface to a filesystem. It's publicer than bmap+submit_bh! I haven't looked at this patch at all closely yet. Do I really need to? I assume you are asking that because you hope I will retract the patch. Was kinda hoping that would be the outcome. It's rather gruesome, using set_fs()+vfs_read() on one side and submit_bh() on the other. Are you sure the handling at EOF for a non-multiple-of-PAGE_SIZE file is OK? The loss of pagecache coherency seems sad. I assume there's never a requirement for userspace to read this file. invalidate_inode_pages() is best-effort. If someone else has the page locked or if the page is mmapped then the attempt to take down the pagecache will fail. That's relatively-OK, because it'll just lead to userspace seeing wrong stuff, and we've already conceded that. But if the pagecache is dirty then invalidate_inode_pages() will leave it dirty and the VM will later write it back, corrupting your bitmap file. You should get i_writecount, fsync the file and then run invalidate_inode_pages(). Or not run invalidate_inode_pages() - it doesn't gain anything and will just reduce the observeability of bugs. Better off leaving the pagecache there all the time so that any rarely-occurring bugs become all-the-time bugs. You might as well use kernel_read() too, if you insist on begin oddball ;) - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 008 of 8] md/bitmap: Change md/bitmap file handling to use bmap to file blocks.
Paul Clements [EMAIL PROTECTED] wrote: Andrew Morton wrote: The loss of pagecache coherency seems sad. I assume there's never a requirement for userspace to read this file. Actually, there is. mdadm reads the bitmap file, so that would be broken. Also, it's just useful for a user to be able to read the bitmap (od -x, or similar) to figure out approximately how much more he's got to resync to get an array in-sync. Other than reading the bitmap file, I don't know of any way to determine that. Read it with O_DIRECT :( - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 002 of 8] md/bitmap: Remove bitmap writeback daemon.
NeilBrown [EMAIL PROTECTED] wrote: ./drivers/md/bitmap.c | 115 ++ hmm. I hope we're not doing any of that filesystem I/O within the context of submit_bio() or kblockd or anything like that. Looks OK from a quick scan. a_ops-commit_write() already ran set_page_dirty(), so you don't need that in there. I assume this always works in units of a complete page? It's strange to do prepare_write() followed immediately by commit_write(). Normally prepare_write() will do some prereading, but it's smart enough to not do that if the caller is preparing to write the whole page. We normally use PAGE_CACHE_SIZE for these things, not PAGE_SIZE. Same diff. If you have a page and you want to write the whole thing out then there's really no need to run prepare_write or commit_write at all. Just initialise the whole page, run set_page_dirty() then write_one_page(). Perhaps it should check that the backing filesystem actually implements commit_write(), prepare_write(), readpage(), etc. Some might not, and the user will get taught not to do that via an oops. - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 008 of 8] md/bitmap: Change md/bitmap file handling to use bmap to file blocks.
NeilBrown [EMAIL PROTECTED] wrote: If md is asked to store a bitmap in a file, it tries to hold onto the page cache pages for that file, manipulate them directly, and call a cocktail of operations to write the file out. I don't believe this is a supportable approach. erk. I think it's better than... This patch changes the approach to use the same approach as swap files. i.e. bmap is used to enumerate all the block address of parts of the file and we write directly to those blocks of the device. That's going in at a much lower level. Even swapfiles don't assume buffer_heads. When playing with bmap() one needs to be careful that nobody has truncated the file on you, else you end up writing to disk blocks which aren't part of the file any more. All this (and a set_fs(KERNEL_DS), ug) looks like a step backwards to me. Operating at the pagecache a_ops level looked better, and more filesystem-independent. I haven't looked at this patch at all closely yet. Do I really need to? - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 004 of 11] md: Increase the delay before marking metadata clean, and make it configurable.
Neil Brown [EMAIL PROTECTED] wrote: On Sunday April 30, [EMAIL PROTECTED] wrote: NeilBrown [EMAIL PROTECTED] wrote: When a md array has been idle (no writes) for 20msecs it is marked as 'clean'. This delay turns out to be too short for some real workloads. So increase it to 200msec (the time to update the metadata should be a tiny fraction of that) and make it sysfs-configurable. ... + safe_mode_delay + When an md array has seen no write requests for a certain period + of time, it will be marked as 'clean'. When another write + request arrive, the array is marked as 'dirty' before the write + commenses. This is known as 'safe_mode'. + The 'certain period' is controlled by this file which stores the + period as a number of seconds. The default is 200msec (0.200). + Writing a value of 0 disables safemode. + Why not make the units milliseconds? Rename this to safe_mode_delay_msecs to remove any doubt. Because umpteen years ago when I was adding thread-usage statistics to /proc/net/rpc/nfsd I used milliseconds and Linus asked me to make it seconds - a much more obvious unit. See Email below. It seems very sensible to me. That's output. It's easier to do the conversion with output. And I guess one could argue that lots of people read /proc files, but few write to them. Generally I don't think we should be teaching the kernel to accept pretend-floating-point numbers like this, especially when a) delay in milliseconds is such a simple concept and b) it's so easy to go from float to milliseconds in userspace. Do you really expect that humans (really dumb ones ;)) will be echoing numbers into this file? Or will it mainly be a thing for mdadm to fiddle with? - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 004 of 11] md: Increase the delay before marking metadata clean, and make it configurable.
NeilBrown [EMAIL PROTECTED] wrote: When a md array has been idle (no writes) for 20msecs it is marked as 'clean'. This delay turns out to be too short for some real workloads. So increase it to 200msec (the time to update the metadata should be a tiny fraction of that) and make it sysfs-configurable. ... + safe_mode_delay + When an md array has seen no write requests for a certain period + of time, it will be marked as 'clean'. When another write + request arrive, the array is marked as 'dirty' before the write + commenses. This is known as 'safe_mode'. + The 'certain period' is controlled by this file which stores the + period as a number of seconds. The default is 200msec (0.200). + Writing a value of 0 disables safemode. + Why not make the units milliseconds? Rename this to safe_mode_delay_msecs to remove any doubt. +static ssize_t +safe_delay_store(mddev_t *mddev, const char *cbuf, size_t len) +{ + int scale=1; + int dot=0; + int i; + unsigned long msec; + char buf[30]; + char *e; + /* remove a period, and count digits after it */ + if (len = sizeof(buf)) + return -EINVAL; + strlcpy(buf, cbuf, len); + buf[len] = 0; + for (i=0; ilen; i++) { + if (dot) { + if (isdigit(buf[i])) { + buf[i-1] = buf[i]; + scale *= 10; + } + buf[i] = 0; + } else if (buf[i] == '.') { + dot=1; + buf[i] = 0; + } + } + msec = simple_strtoul(buf, e, 10); + if (e == buf || (*e *e != '\n')) + return -EINVAL; + msec = (msec * 1000) / scale; + if (msec == 0) + mddev-safemode_delay = 0; + else { + mddev-safemode_delay = (msec*HZ)/1000; + if (mddev-safemode_delay == 0) + mddev-safemode_delay = 1; + } + return len; And most of that goes away. - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 001 of 3] md: Don't clear bits in bitmap when writing to one device fails during recovery.
NeilBrown [EMAIL PROTECTED] wrote: + if (!uptodate) { +int sync_blocks = 0; +sector_t s = r1_bio-sector; +long sectors_to_go = r1_bio-sectors; +/* make sure these bits doesn't get cleared. */ +do { +bitmap_end_sync(mddev-bitmap, r1_bio-sector, +sync_blocks, 1); +s += sync_blocks; +sectors_to_go -= sync_blocks; +} while (sectors_to_go 0); md_error(mddev, conf-mirrors[mirror].rdev); +} Can mddev-bitmap be NULL? If so, will the above loop do the right thing when this: void bitmap_end_sync(struct bitmap *bitmap, sector_t offset, int *blocks, int aborted) { bitmap_counter_t *bmc; unsigned long flags; /* if (offset == 0) printk(bitmap_end_sync 0 (%d)\n, aborted); */ if (bitmap == NULL) { *blocks = 1024; return; } triggers? - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 005 of 13] md: Allow stripes to be expanded in preparation for expanding an array.
NeilBrown [EMAIL PROTECTED] wrote: +static int resize_stripes(raid5_conf_t *conf, int newsize) +{ +/* make all the stripes able to hold 'newsize' devices. + * New slots in each stripe get 'page' set to a new page. + * We allocate all the new stripes first, then if that succeeds, + * copy everything across. + * Finally we add new pages. This could fail, but we leave + * the stripe cache at it's new size, just with some pages empty. + * + * We use GFP_NOIO allocations as IO to the raid5 is blocked + * at some points in this operation. + */ +struct stripe_head *osh, *nsh; +struct list_head newstripes, oldstripes; You can use LIST_HEAD() here, avoid the separate INIT_LIST_HEAD(). +struct disk_info *ndisks; +int err = 0; +kmem_cache_t *sc; +int i; + +if (newsize = conf-pool_size) +return 0; /* never bother to shrink */ + +sc = kmem_cache_create(conf-cache_name[1-conf-active_name], + sizeof(struct stripe_head)+(newsize-1)*sizeof(struct r5dev), + 0, 0, NULL, NULL); kmem_cache_create() internally does a GFP_KERNEL allocation. +if (!sc) +return -ENOMEM; +INIT_LIST_HEAD(newstripes); +for (i = conf-max_nr_stripes; i; i--) { +nsh = kmem_cache_alloc(sc, GFP_NOIO); So either this can use GFP_KERNEL, or we have a problem. - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 005 of 13] md: Allow stripes to be expanded in preparation for expanding an array.
NeilBrown [EMAIL PROTECTED] wrote: + wait_event_lock_irq(conf-wait_for_stripe, +!list_empty(conf-inactive_list), +conf-device_lock, +unplug_slaves(conf-mddev); +); Boy, that's an ugly-looking thing, isn't it? __wait_event_lock_irq() already puts a semicolon after `cmd' so I think the one here isn't needed, which would make it a bit less of a surprise to look at. - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 005 of 13] md: Allow stripes to be expanded in preparation for expanding an array.
NeilBrown [EMAIL PROTECTED] wrote: + /* Got them all. + * Return the new ones and free the old ones. + * At this point, we are holding all the stripes so the array + * is completely stalled, so now is a good time to resize + * conf-disks. + */ +ndisks = kzalloc(newsize * sizeof(struct disk_info), GFP_NOIO); +if (ndisks) { +for (i=0; iconf-raid_disks; i++) +ndisks[i] = conf-disks[i]; +kfree(conf-disks); +conf-disks = ndisks; +} else +err = -ENOMEM; +while(!list_empty(newstripes)) { +nsh = list_entry(newstripes.next, struct stripe_head, lru); +list_del_init(nsh-lru); +for (i=conf-raid_disks; i newsize; i++) +if (nsh-dev[i].page == NULL) { +struct page *p = alloc_page(GFP_NOIO); +nsh-dev[i].page = p; +if (!p) +err = -ENOMEM; +} +release_stripe(nsh); +} +while(!list_empty(oldstripes)) { +osh = list_entry(oldstripes.next, struct stripe_head, lru); +list_del(osh-lru); +kmem_cache_free(conf-slab_cache, osh); +} +kmem_cache_destroy(conf-slab_cache); +conf-slab_cache = sc; +conf-active_name = 1-conf-active_name; +conf-pool_size = newsize; +return err; +} Are you sure the -ENOMEM handling here is solid? It looks strange. There are a few more GFP_NOIOs in this function, which can possibly become GFP_KERNEL. - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 006 of 13] md: Infrastructure to allow normal IO to continue while array is expanding.
NeilBrown [EMAIL PROTECTED] wrote: -retry: prepare_to_wait(conf-wait_for_overlap, w, TASK_UNINTERRUPTIBLE); -sh = get_active_stripe(conf, new_sector, pd_idx, (bi-bi_rwRWA_MASK)); +sh = get_active_stripe(conf, new_sector, disks, pd_idx, (bi-bi_rwRWA_MASK)); if (sh) { -if (!add_stripe_bio(sh, bi, dd_idx, (bi-bi_rwRW_MASK))) { -/* Add failed due to overlap. Flush everything +if (unlikely(conf-expand_progress != MaxSector)) { +/* expansion might have moved on while waiting for a + * stripe, so we much do the range check again. + */ +int must_retry = 0; +spin_lock_irq(conf-device_lock); +if (logical_sector conf-expand_progress +disks == conf-previous_raid_disks) +/* mismatch, need to try again */ +must_retry = 1; +spin_unlock_irq(conf-device_lock); +if (must_retry) { +release_stripe(sh); +goto retry; +} +} The locking in here looks strange. We take the lock, do some arithmetic and some tests and then drop the lock again. Is it not possible that the result of those tests now becomes invalid? - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 007 of 13] md: Core of raid5 resize process
NeilBrown [EMAIL PROTECTED] wrote: @@ -4539,7 +4543,9 @@ static void md_do_sync(mddev_t *mddev) */ max_sectors = mddev-resync_max_sectors; mddev-resync_mismatches = 0; -} else +} else if (test_bit(MD_RECOVERY_RESHAPE, mddev-recovery)) +max_sectors = mddev-size 1; +else /* recovery follows the physical size of devices */ max_sectors = mddev-size 1; This change is a no-op. Intentional? - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 010 of 13] md: Only checkpoint expansion progress occasionally.
NeilBrown [EMAIL PROTECTED] wrote: diff ./drivers/md/raid5.c~current~ ./drivers/md/raid5.c --- ./drivers/md/raid5.c~current~2006-03-17 11:48:58.0 +1100 +++ ./drivers/md/raid5.c 2006-03-17 11:48:58.0 +1100 @@ -1747,8 +1747,9 @@ static int make_request(request_queue_t That's a fairly complex function.. I wonder about this: spin_lock_irq(conf-device_lock); if (--bi-bi_phys_segments == 0) { int bytes = bi-bi_size; if ( bio_data_dir(bi) == WRITE ) md_write_end(mddev); bi-bi_size = 0; bi-bi_end_io(bi, bytes, 0); } spin_unlock_irq(conf-device_lock); bi_end_io() can be somewhat expensive. Does it need to happen under the lock? - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH md 013 of 20] Count corrected read errors per drive.
NeilBrown [EMAIL PROTECTED] wrote: + errors +An approximate count of read errors that have been detected on +this device but have not caused the device to be evicted from +the array (either because they were corrected or because they +happened while the array was read-only). When using version-1 +metadata, this value persists across restarts of the array. Looks funny. Mixture of tabs and spaces I guess. fixes it - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH md 002 of 14] Allow raid1 to check consistency
Neil Brown [EMAIL PROTECTED] wrote: Would you accept: -- Signed-off-by: Neil Brown [EMAIL PROTECTED] ### Diffstat output ./mm/swap.c |2 ++ 1 file changed, 2 insertions(+) diff ./mm/swap.c~current~ ./mm/swap.c --- ./mm/swap.c~current~ 2005-12-06 10:29:16.0 +1100 +++ ./mm/swap.c 2005-12-06 10:29:25.0 +1100 @@ -36,6 +36,8 @@ int page_cluster; void put_page(struct page *page) { +if (unlikely(page==NULL)) +return; if (unlikely(PageCompound(page))) { page = (struct page *)page_private(page); if (put_page_testzero(page)) { eek. That's an all-over-the-place-fast-path! Or should I open code this in md ? Yes please. - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH md 006 of 14] Make /proc/mdstat pollable.
NeilBrown [EMAIL PROTECTED] wrote: +DECLARE_WAIT_QUEUE_HEAD(md_event_waiters); static scope? - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH md 008 of 14] Convert md to use kzalloc throughout
NeilBrown [EMAIL PROTECTED] wrote: - conf = kmalloc (sizeof (*conf) + mddev-raid_disks*sizeof(dev_info_t), +conf = kzalloc (sizeof (*conf) + mddev-raid_disks*sizeof(dev_info_t), -new = (mddev_t *) kmalloc(sizeof(*new), GFP_KERNEL); +new = (mddev_t *) kzalloc(sizeof(*new), GFP_KERNEL); -rdev = (mdk_rdev_t *) kmalloc(sizeof(*rdev), GFP_KERNEL); +rdev = (mdk_rdev_t *) kzalloc(sizeof(*rdev), GFP_KERNEL); It'd be nice to nuke the unneeded cast while we're there. edits the diff OK, I did that. - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH md 010 of 14] Convert various kmap calls to kmap_atomic
NeilBrown [EMAIL PROTECTED] wrote: + paddr = kmap_atomic(page, KM_USER0); +memset(paddr + offset, 0xff, PAGE_SIZE - offset); This page which is being altered is a user-visible one, no? A pagecache page? We must always run flush_dcache_page() against a page when the kernel modifies a user-visible page's contents. Otherwise, on some architectures, the modification won't be visible at the different virtual address. - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH md 002 of 2] Make md threads interruptible again.
NeilBrown [EMAIL PROTECTED] wrote: Despite the fact that md threads don't need to be signalled, and won't respond to signals anyway, we need to have an 'interruptible' wait, else they stay in 'D' state and add to the load average. ... + if (signal_pending(current)) + flush_signals(current); Kernel threads don't accept signals by default, so the above is unneeded. I'll leave it in there now, since that's how things used to be, but please make a note to nuke that check, then test it to make sure I'm telling the truth, then fix it up later on? - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: File corruption on LVM2 on top of software RAID1
Simon Matter [EMAIL PROTECTED] wrote: While looking at some data corruption vulnerability reports on Securityfocus I wonder why this issue does not get any attention from distributors. I have an open bugzilla report with RedHat, have an open customer service request with RedHat, have mailed peoples directly. No real feedback. I'm now in the process of restoring intergrity of my data with the help of backups and mirrored data. Maybe I just care too much about other peoples data, but I know that this bug will corrupt files on hundreds or thousands of servers today and most people simply don't know it. Did I miss something? I guess the bug hit really rarely and maybe the reports were lost in the permanent background noise of dodgy hardware. We only found and fixed it last week, due to much sleuthing by Matthew Stapleton. I assume vendor updates are in the pipeline. - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: File corruption on LVM2 on top of software RAID1
Simon Matter [EMAIL PROTECTED] wrote: Hi, Please CC me as I'm not subscribed to the kernel list. I had a hard time identifying a serious problem in the 2.6 linux kernel. Yes, it is a serious problem. It all started while evaluating RHEL4 for new servers. My data integrity tests gave me bad results - which I couldn't believe - and my first idea was - of course - bad hardware. I ordered new SCSI disks instead of the IDE disks, took another server, spent some money again, tried again and again. That's all long ago now... In my tests I get corrupt files on LVM2 which is on top of software raid1. (This is a common setup even mentioned in the software RAID HOWTO and has worked for me on RedHat 9 / kernel 2.4 for a long time now and it's my favourite configuration). Now, I tested two different distributions, three kernels, three different filesystems and three different hardware. I can always reproduce it with the following easy scripts: LOGF=/root/diff.log while true; do rm -rf /home/XXX2 rsync -a /home/XXX/ /home/XXX2 date $LOGF diff -r /home/XXX /home/XXX2 $LOGF done the files in /home/XXX are ~15G of ISO images and rpms. diff.log looks like this: Wed Aug 3 13:45:57 CEST 2005 Binary files /home/XXX/ES3-U3/rhel-3-U3-i386-es-disc3.iso and /home/XXX2/ES3-U3/rhel-3-U3-i386-es-disc3.iso differ Wed Aug 3 14:09:14 CEST 2005 Binary files /home/XXX/8.0/psyche-i386-disc1.iso and /home/XXX2/8.0/psyche-i386-disc1.iso differ Wed Aug 3 14:44:17 CEST 2005 Binary files /home/XXX/7.3/valhalla-i386-disc3.iso and /home/XXX2/7.3/valhalla-i386-disc3.iso differ Wed Aug 3 15:15:05 CEST 2005 Wed Aug 3 15:45:40 CEST 2005 Tested software: 1) RedHat EL4 kernel-2.6.9-11.EL vanilla 2.6.12.3 kernel filesystems: EXT2, EXT3, XFS 2) NOVELL/SUSE 9.3 kernel-default-2.6.11.4-21.7 filesystem: EXT3 Tested Hardware: 1) - ASUS P2B-S board - CPU PIII 450MHz - Intel 440BX/ZX/DX Chipset - 4x128M memory (ECC enabled) - 2x IDE disks Seagate Barracuda 400G, connected to onboard Intel PIIX4 Ultra 33 - Promise Ultra100TX2 adapter for additional tests 2) - DELL PowerEdge 1400 - CPU PIII 800MHz - ServerWorks OSB4 Chipset - 4x256M memory (ECC enabled) - 2x U320 SCSI disks Maxtor Atlas 10K 146G - onboard Adaptec aic7899 Ultra160 SCSI adapter 3) - DELL PowerEdge 1850 - CPU P4 XEON 2.8GHz - 2G memory - 2x U320 SCSI disks Maxtor Atlas 10K 73G SCA - onboard LSI53C1030 SCSI adapter I've put some files toghether from the last test on the PE1850 server: http://www.invoca.ch/bugs/linux-2.6-corruption-on-lvm2-on-raid1/ I've also filed a bug with RedHat: https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=164696 I have spent a lot of time on this bug because I consider it very serious. I'm not a kernel hacker but if there is anything I can do to fix this, let me know. Thanks for doing that. There's one fix against 2.6.12.3 which is needed, but 2.6.9 didn't have the bug which this fix addresses. So unless 2.6.9 has a different problem, this won't help. But still, you should include this in testing please. diff -puN fs/bio.c~bio_clone-fix fs/bio.c --- devel/fs/bio.c~bio_clone-fix2005-07-28 00:39:40.0 -0700 +++ devel-akpm/fs/bio.c 2005-07-28 01:02:34.0 -0700 @@ -261,6 +261,7 @@ inline void __bio_clone(struct bio *bio, */ bio-bi_vcnt = bio_src-bi_vcnt; bio-bi_size = bio_src-bi_size; + bio-bi_idx = bio_src-bi_idx; bio_phys_segments(q, bio); bio_hw_segments(q, bio); } _ - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH md 000 of 7] Introduction
NeilBrown [EMAIL PROTECTED] wrote: Following are 7 patches for md in 2.6.13-rc4 They are all fairly well tested, with the possible exception of '4' - I haven't actually tried throwing BIO_RW_BARRIER requests are any md devices. However the code is very straight forward. I'm happy (even keen) for these to go into 2.6.13. If it's getting a bit late, then 2 is probably the most important. The others we can probably live without. hm, OK. Merging 1) and 2) seems sane. I must say that I worry about 4) and would prefer to defer things if poss. If there are others there which you really would prefer to see in 2.6.13 then please let me know. [PATCH md 001 of 7] Remove a stray debugging printk. [PATCH md 002 of 7] Make 'md' and alias for 'md-mod' [PATCH md 003 of 7] Fix minor error in raid10 read-balancing calculation. [PATCH md 004 of 7] Fail IO requests to md that require a barrier. [PATCH md 005 of 7] Always honour md bitmap being read from disk [PATCH md 006 of 7] Yet another attempt to get bitmap-based resync to do the right thing in all cases... [PATCH md 007 of 7] Make sure md bitmap updates are flushed when array is stopped. - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH md 000 of 7] Introduction
Neil Brown [EMAIL PROTECTED] wrote: On Wednesday August 3, [EMAIL PROTECTED] wrote: NeilBrown [EMAIL PROTECTED] wrote: Following are 7 patches for md in 2.6.13-rc4 They are all fairly well tested, with the possible exception of '4' - I haven't actually tried throwing BIO_RW_BARRIER requests are any md devices. However the code is very straight forward. I'm happy (even keen) for these to go into 2.6.13. If it's getting a bit late, then 2 is probably the most important. The others we can probably live without. hm, OK. Merging 1) and 2) seems sane. I must say that I worry about 4) and would prefer to defer things if poss. Fair enough. 4 can wait until 2.6.14 (only a couple of months away, right :-) Thereabouts.. If there are others there which you really would prefer to see in 2.6.13 then please let me know. 5, 6, and 7 I would really prefer to be in 2.6.13. OK. The intent-bitmap stuff is broken (in small but potentially significant ways) without them, and they are complete no-ops if bitmaps aren't enabled: 5 only touches bitmap.c 6 removes the setting for R1BIO_Degraded which is not used and slightly re-arranges an 'if' statement. All other changes are complete no-ops if bitmaps aren't enabled. 7: if bitmaps aren't enabled, all this does is call wait_event exactly the same way as will very possibly be called a few lines later when md_update_sb is called. So I'm convinced (and hopefully convincing) that they don't have any significant effect if bitmaps aren't enabled, and fix genuine problems with bitmaps, and so are appropriate for 2.6.13... We'll see ;) - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH md 0 of 12] Introduction
NeilBrown [EMAIL PROTECTED] wrote: Andrew: Are you happy to keep collecting these as a list of patches (bugs followed by bug-fixes :-), or would it be easier if I merged all the bug fixes into earlier patches and just resent a small number of add-functionality patches?? What I'll generally do is to topologically sort the patches. Work out what patch each of the new patches fix and then rename the patches to name-of-the-patch-whcih-is-being-fixed-fix.patch. So if you had foo.patch bar.patch and then sent be [patch] frob the nozzle I'd work out that this new patch is fixing foo.patch and I'd name it foo-fix.patch and the sequence would become: foo.patch foo-fix.patch bar.patch and then at some time in the future I'll collapse foo-fix.patch, foo-fix-fix.patch, etc into foo.patch. Of course it doesn't always work out that simply ;) In this case it's not clear that a tsort will work out right. I'll take a look. - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH md 0 of 12] Introduction
Andrew Morton [EMAIL PROTECTED] wrote: Of course it doesn't always work out that simply ;) In this case it's not clear that a tsort will work out right. I'll take a look. Well it wasn't completely clean, and I forgot to rename the patches, but this is how the tsort came out: md-merge-md_enter_safemode-into-md_check_recovery.patch md-improve-locking-on-safemode-and-move-superblock-writes.patch md-improve-the-interface-to-sync_request.patch md-optimised-resync-using-bitmap-based-intent-logging.patch +md-a-couple-of-tidyups-relating-to-the-bitmap-file.patch +md-call-bitmap_daemon_work-regularly.patch +md-print-correct-pid-for-newly-created-bitmap-writeback-daemon.patch +md-minor-code-rearrangement-in-bitmap_init_from_disk.patch +md-make-sure-md-bitmap-is-cleared-on-a-clean-start.patch md-printk-fix.patch +md-improve-debug-printing-of-bitmap-superblock.patch +md-check-return-value-of-write_page-rather-than-ignore-it.patch +md-enable-the-bitmap-write-back-daemon-and-wait-for-it.patch +md-dont-skip-bitmap-pages-due-to-lack-of-bit-that-we-just-cleared.patch md-optimised-resync-using-bitmap-based-intent-logging-fix.patch md-raid1-support-for-bitmap-intent-logging.patch +md-fix-bug-when-raid1-attempts-a-partial-reconstruct.patch md-raid1-support-for-bitmap-intent-logging-fix.patch md-optimise-reconstruction-when-re-adding-a-recently-failed-drive.patch md-fix-deadlock-due-to-md-thread-processing-delayed-requests.patch +md-allow-md-intent-bitmap-to-be-stored-near-the-superblock.patch +md-allow-md-to-update-multiple-superblocks-in-parallel.patch So we can later work out which of these to fold into which other ones. - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH md 0 of 4] Introduction
NeilBrown [EMAIL PROTECTED] wrote: The first two are trivial and should apply equally to 2.6.11 The second two fix bugs that were introduced by the recent bitmap-based-intent-logging patches and so are not relevant to 2.6.11 yet. The changelog for the Fix typo in super_1_sync patch doesn't actually say what the patch does. What are the user-visible consequences of not fixing this? Is the bitmap stuff now ready for Linus? - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Fw: [Bugme-new] [Bug 4211] New: md configuration destroys disk GPT label
Begin forwarded message: Date: Mon, 14 Feb 2005 05:29:22 -0800 From: [EMAIL PROTECTED] To: [EMAIL PROTECTED] Subject: [Bugme-new] [Bug 4211] New: md configuration destroys disk GPT label http://bugme.osdl.org/show_bug.cgi?id=4211 Summary: md configuration destroys disk GPT label Kernel Version: 2.4.18-e.31smp Status: NEW Severity: high Owner: [EMAIL PROTECTED] Submitter: [EMAIL PROTECTED] Distribution: RedHat Advanced Server 2.1 Hardware Environment: NEC System, 4 CPU IA-64, 8 GB RAM, direct connected FC disks Software Environment: OS + Oracle installation Problem Description: md configuration based on complete disks instead of partitions destroys the GPT label of the used disks Steps to reproduce: Setup a Software Raid, raid level 0 or 1, based on complete disks, raiddev /dev/md0 raid-level 0 nr-raid-disks 2 persistent-superblock 1 chunk-size 16 device /dev/sdb raid-disk 0 device /dev/sdc raid-disk 1 create the raiddevice, create a free choosable filesystem, I worked with ext2, and put some I/O load on the filesystem. After umounting raiddevice, try to print the partition table using 'parted', you will get [EMAIL PROTECTED] /]# parted /dev/sdb print Primary GPT is invalid, using alternate GPT. Error: This disk contains a valid Alternate GUID Partition Table but the Primary GPT and Protective MBR are invalid. This generally means that the disk had GPT partitions on it, but then a legacy partition editing tool was used to change the partition table stored in the MBR. Which data is valid, GPT or MBR? Yes will assume that the GPT information is correct, and will rewrite the PMBR and Primary GPT. No will assume that the MBR is correct, and erase the GPT information. Ignore will assume that the MBR is correct, but not change the disk. Yes No Ignore ? The same setup works fine, if I am using partitions instead of the hole disk ... raiddev /dev/md0 raid-level 0 nr-raid-disks 2 persistent-superblock 1 chunk-size 16 device /dev/sdb1 raid-disk 0 device /dev/sdc1 raid-disk 1 [EMAIL PROTECTED] /]# parted /dev/sdb print Disk geometry for /dev/sdb: 0.000-34688.000 megabytes Disk label type: GPT MinorStart End Filesystem Name Flags 1 0.017 34687.500 ext2 lba Regards Alex --- You are receiving this mail because: --- You are on the CC list for the bug, or are watching someone who is. - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH md 6 of 6] Fix handling of overlapping requests in raid5
NeilBrown [EMAIL PROTECTED] wrote: + retry: sh = get_active_stripe(conf, new_sector, pd_idx, (bi-bi_rwRWA_MASK)); if (sh) { - -while (!add_stripe_bio(sh, bi, dd_idx, (bi-bi_rwRW_MASK))) { -/* add failed due to overlap. Flush everything +if (!add_stripe_bio(sh, bi, dd_idx, (bi-bi_rwRW_MASK))) { +/* Add failed due to overlap. Flush everything * and wait a while - * FIXME - overlapping requests should be handled better */ raid5_unplug_device(mddev-queue); -set_current_state(TASK_UNINTERRUPTIBLE); -schedule_timeout(1); +release_stripe(sh); +schedule(); +goto retry; Worrisome. If the calling process has SCHED_RR or SCHED_FIFO policy, this could cause a lockup, perhaps. Some sort of real synchronisation scheme would be nicer. Or at the least, what was wrong with the schedule_timeout(1)? - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH md 6 of 6] Fix handling of overlapping requests in raid5
Neil Brown [EMAIL PROTECTED] wrote: On Monday February 7, [EMAIL PROTECTED] wrote: NeilBrown [EMAIL PROTECTED] wrote: + retry: sh = get_active_stripe(conf, new_sector, pd_idx, (bi-bi_rwRWA_MASK)); if (sh) { - -while (!add_stripe_bio(sh, bi, dd_idx, (bi-bi_rwRW_MASK))) { -/* add failed due to overlap. Flush everything +if (!add_stripe_bio(sh, bi, dd_idx, (bi-bi_rwRW_MASK))) { +/* Add failed due to overlap. Flush everything * and wait a while - * FIXME - overlapping requests should be handled better */ raid5_unplug_device(mddev-queue); -set_current_state(TASK_UNINTERRUPTIBLE); -schedule_timeout(1); +release_stripe(sh); +schedule(); +goto retry; Worrisome. If the calling process has SCHED_RR or SCHED_FIFO policy, this could cause a lockup, perhaps. Is that just that I should have the prepare_to_wait after the retry: label rather than before, or is there something more subtle. umm, yes. We always exit from schedule() in state TASK_RUNNING, so we need to run prepare_to_wait() each time around. See __wait_on_bit() for a canonical example. ### Diffstat output ./drivers/md/raid5.c |2 +- 1 files changed, 1 insertion(+), 1 deletion(-) diff ./drivers/md/raid5.c~current~ ./drivers/md/raid5.c --- ./drivers/md/raid5.c~current~ 2005-02-07 13:34:23.0 +1100 +++ ./drivers/md/raid5.c 2005-02-08 13:34:38.0 +1100 @@ -1433,9 +1433,9 @@ static int make_request (request_queue_t (unsigned long long)new_sector, (unsigned long long)logical_sector); + retry: prepare_to_wait(conf-wait_for_overlap, w, TASK_UNINTERRUPTIBLE); - retry: sh = get_active_stripe(conf, new_sector, pd_idx, (bi-bi_rwRWA_MASK)); if (sh) { if (!add_stripe_bio(sh, bi, dd_idx, (bi-bi_rwRW_MASK))) { You missed one. --- 25/drivers/md/raid5.c~md-fix-handling-of-overlapping-requests-in-raid5-fix 2005-02-07 19:04:18.0 -0800 +++ 25-akpm/drivers/md/raid5.c 2005-02-07 19:04:48.0 -0800 @@ -1433,9 +1433,8 @@ static int make_request (request_queue_t (unsigned long long)new_sector, (unsigned long long)logical_sector); - prepare_to_wait(conf-wait_for_overlap, w, TASK_UNINTERRUPTIBLE); - retry: + prepare_to_wait(conf-wait_for_overlap, w, TASK_UNINTERRUPTIBLE); sh = get_active_stripe(conf, new_sector, pd_idx, (bi-bi_rwRWA_MASK)); if (sh) { if (!add_stripe_bio(sh, bi, dd_idx, (bi-bi_rwRW_MASK))) { diff -puN drivers/md/raid6main.c~md-fix-handling-of-overlapping-requests-in-raid5-fix drivers/md/raid6main.c --- 25/drivers/md/raid6main.c~md-fix-handling-of-overlapping-requests-in-raid5-fix 2005-02-07 19:04:18.0 -0800 +++ 25-akpm/drivers/md/raid6main.c 2005-02-07 19:04:54.0 -0800 @@ -1593,9 +1593,8 @@ static int make_request (request_queue_t (unsigned long long)new_sector, (unsigned long long)logical_sector); - prepare_to_wait(conf-wait_for_overlap, w, TASK_UNINTERRUPTIBLE); - retry: + prepare_to_wait(conf-wait_for_overlap, w, TASK_UNINTERRUPTIBLE); sh = get_active_stripe(conf, new_sector, pd_idx, (bi-bi_rwRWA_MASK)); if (sh) { if (!add_stripe_bio(sh, bi, dd_idx, (bi-bi_rwRW_MASK))) { _ (That piece of the code looks pretty fugly in an 80-col xterm btw..) - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html