Checksums wrong on one disk of mirror
I recently installed a server with mirrored disks using software RAID. Everything was working fine for a few days until a normal reboot (not the first). Now the machine will not boot because it appears the superblock is wrong on some of the RAID devices on the first disk. The rough layout of the disks (sda and sdb): sdx1 (md0) - / sdx2 (md1) - /var sdx3 (md2) - /usr extended partition with swap sdx6 (md3) - /opt The exact error is: "invalid superblock checksum on sda3 sda3 has invalid sb, not importing!" Booting into a live CD, mdadm -E /dev/sdaX shows that the checksum is not what would be expected for sda1,2,3 but is fine for sda6. All of the checksums on drive sdb are correct. The state is "clean" for all partitions, working 2, active 2 and failed 0. The table for sdb1,2,3 shows that the first device has been removed and is no longer an active mirror. What is the best way to proceed here? Can I somehow sync from the second disk, which appears to have the correct checksums? Is there an easy way to fix this that wont involve loosing the data? Thanks. - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Checksums wrong on one disk of mirror
Quoting Neil Brown <[EMAIL PROTECTED]>: On Tuesday November 7, [EMAIL PROTECTED] wrote: Booting into a live CD, mdadm -E /dev/sdaX shows that the checksum is not what would be expected for sda1,2,3 but is fine for sda6. All of the checksums on drive sdb are correct. I'm surprised it doesn't boot then. How are the arrays being assembled? A more complete kernel log would help. Neil, Thanks for such a quick reply. I will post the kernel logs if the below is not enough information. The old dmesg should also still be on the partition. The state is "clean" for all partitions, working 2, active 2 and failed 0. The table for sdb1,2,3 shows that the first device has been removed and is no longer an active mirror. What is the best way to proceed here? Can I somehow sync from the second disk, which appears to have the correct checksums? Is there an easy way to fix this that wont involve loosing the data? While booted from the live CD you should be able to mdadm -AR /dev/md0 /dev/sdb1 mdadm /dev/md0 --add /dev/sda1 Fantastic, this works well for two of the partitions. However the third has a bad sector (as reported by smartmontools) on the disk with the "good" superblock. The disk cannot read the sector, so the syncing fails and starts over at 15.7% each time. Is it safe to mount that partition outside of the md, find the file, remove it so that the disk can remap that sector (it is shown as Currently_Pending in SMART right now) then resync the array? I guess this will cause problems and break the mirror. Or is the correct way to remove the "bad" superblock drive from the array, mount the md, remove the file then resync the array? If it is possible to do either of the above, how do I stop the recovery? It now starts automatically at live CD boot, repeating from 15.7% over and over. My knowledge of the tools is bad but I tried the following: # mdadm /dev/md0 --remove /dev/sda1 and # mdadm -f /dev/md0 --remove /dev/sda1 (no idea if the -f even makes sense there) It is very odd that the checksums are all wrong though. Kernel version? mdadm version? hardware architecture? Kernel installed from Ubuntu 6.06 sources, 2.6.15. Machine is a x86 Dell with two identical Maxtor DiamondMax drives on an Intel 82801 SATA controller. mdadm is version 1.12. Looking at the most recently available version this seems incredibly out of date, but seems to be the default installed in Ubuntu. Even Debian stable seems to have 1.9. I can bug this with them for an update if necessary. Is it possible that a broken init script has tried to fsck an individual drive instead of the md? /etc/fstab only uses /dev/md* references but I'll check other scripts when (if? :) I get the system back up and running. Whilst the machine is not critical and is only a new install, I'd like to keep fighting rather than give in if possible. Thanks, David - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Checksums wrong on one disk of mirror
Quoting David <[EMAIL PROTECTED]>: Or is the correct way to remove the "bad" superblock drive from the array, mount the md, remove the file then resync the array? Common sense says this is correct. If it is possible to do either of the above, how do I stop the recovery? It now starts automatically at live CD boot, repeating from 15.7% over and over. My knowledge of the tools is bad but I tried the following: # mdadm /dev/md0 --remove /dev/sda1 and # mdadm -f /dev/md0 --remove /dev/sda1 (no idea if the -f even makes sense there) Looking at http://smartmontools.sourceforge.net/BadBlockHowTo.txt I tried to figure out what file was in the bad blocks but it turned out there wasn't one, it was just unused space. My fix, for completeness, was this: Force failure of the corrupt half of the mirror, using # mdadm --manage /dev/md0 --fail /dev/sda Mount the other one and fill free space with zeros # mount /dev/md0 /mnt/test # dd if=/dev/zero of=/mnt/test/bigfile # sync smartctl now showed that the pending sector had been reallocated, so I removed the bigfile and hot added the other drive # mdadm --manage /dev/md0 --add /dev/sda The recovery went fine this time and both partitions were shown as correct and active. I had to fsck another md before it would boot correctly but the machine is now back up and working correctly. Thanks for your help previously, it helped me along the right lines to start fixing this one. David - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Swap initialised as an md?
I have two devices mirrored which are partitioned like this: Device Boot Start End Blocks Id System /dev/sda1 * 633071627915358108+ fd Linux raid autodetect /dev/sda2307162807168202920482875 fd Linux raid autodetect /dev/sda371682030 11264777920482875 fd Linux raid autodetect /dev/sda4 112647780 156248189218002055 Extended /dev/sda5 112647843 122881184 5116671 82 Linux swap / Solaris /dev/sda6 122881248 15624818916683471 fd Linux raid autodetect My aim was to have the two swap partitions both mounted, no RAID (as I didn't see any benefit to that, but if I'm wrong then I'd appreciate being told!). However it seems that sda5 seems to be recognised as an md anyway at boot, so swapon does not work correctly. When initialising the partitions with mkswap, the RAID array is confused and refuses to boot until the superblocks are fixed. At boot, the kernel says: [17179589.184000] md: md3 stopped. [17179589.184000] md: bind [17179589.188000] md: bind [17179589.188000] raid1: raid set md3 active with 2 out of 2 mirrors Then /proc/mdstat says: md3 : active raid1 sda5[0] sdb5[1] 5116544 blocks [2/2] [UU] In /etc/mdadm/mdadm.conf, the following is present which was created by the installer and only lists 4 arrays. In actual fact sdx6 is recognised as the fifth array md4. DEVICE partitions ARRAY /dev/md3 level=raid1 num-devices=2 UUID=75575384:5fbe10ed:a5a46544:209740b3 ARRAY /dev/md2 level=raid1 num-devices=2 UUID=5d133655:1d034197:c1c19528:56cc420a ARRAY /dev/md1 level=raid1 num-devices=2 UUID=2cda8230:b2fde7b4:97082351:880c918a ARRAY /dev/md0 level=raid1 num-devices=2 UUID=7f9abf32:c86071fd:3df4db9d:26ddd001 As /etc is on md0 I doubt this configuration file has anything to do with the kernel recognising and setting the arrays active. However, is there any reason that the swap partitions (which have the correct partition type) are initialised as an md? Can I stop it anyhow, or is the correct method to have them as an md with the md initialised as swap? Brief details are the same as my previous mails last week: 2.6.15, mdadm 1.12.0 (on md0, so I can't see that it is at fault). Thanks, David - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Software raid0 will crash the file-system, when each disk is 5TB
On Wed, 16 May 2007, Bill Davidsen wrote: Jeff Zheng wrote: Here is the information of the created raid0. Hope it is enough. If I read this correctly, the problem is with JFS rather than RAID? he had the same problem with xfs. David Lang - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: Software raid0 will crash the file-system, when each disk is 5TB
On Thu, 17 May 2007, Neil Brown wrote: On Thursday May 17, [EMAIL PROTECTED] wrote: The only difference of any significance between the working and non-working configurations is that in the non-working, the component devices are larger than 2Gig, and hence have sector offsets greater than 32 bits. Do u mean 2T here?, but in both configuartion, the component devices are larger than 2T (2.25T&5.5T). Yes, I meant 2T, and yes, the components are always over 2T. 2T decimal or 2T binary? So I'm at a complete loss. The raid0 code follows the same paths and does the same things and uses 64bit arithmetic where needed. So I have no idea how there could be a difference between these two cases. I'm at a loss... NeilBrown - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.
On Wed, 30 May 2007, David Chinner wrote: On Tue, May 29, 2007 at 04:03:43PM -0400, Phillip Susi wrote: David Chinner wrote: The use of barriers in XFS assumes the commit write to be on stable storage before it returns. One of the ordering guarantees that we need is that the transaction (commit write) is on disk before the metadata block containing the change in the transaction is written to disk and the current barrier behaviour gives us that. Barrier != synchronous write, Of course. FYI, XFS only issues barriers on *async* writes. But barrier semantics - as far as they've been described by everyone but you indicate that the barrier write is guaranteed to be on stable storage when it returns. this doesn't match what I have seen wtih barriers it's perfectly legal to have the following sequence of events 1. app writes block 10 to OS 2. app writes block 4 to OS 3. app writes barrier to OS 4. app writes block 5 to OS 5. app writes block 20 to OS 6. OS writes block 4 to disk drive 7. OS writes block 10 to disk drive 8. OS writes barrier to disk drive 9. OS writes block 5 to disk drive 10. OS writes block 20 to disk drive 11. disk drive writes block 10 to platter 12. disk drive writes block 4 to platter 13. disk drive writes block 20 to platter 14. disk drive writes block 5 to platter there is nothing that says that when the app finishes step #3 that the OS has even sent the data to the drive, let alone that the drive has flushed it to a platter if the disk drive doesn't support barriers then step #8 becomes 'issue flush' and steps 11 and 12 take place before step #9, 13, 14 David Lang - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.
On Wed, 30 May 2007, David Chinner wrote: On Tue, May 29, 2007 at 05:01:24PM -0700, [EMAIL PROTECTED] wrote: On Wed, 30 May 2007, David Chinner wrote: On Tue, May 29, 2007 at 04:03:43PM -0400, Phillip Susi wrote: David Chinner wrote: The use of barriers in XFS assumes the commit write to be on stable storage before it returns. One of the ordering guarantees that we need is that the transaction (commit write) is on disk before the metadata block containing the change in the transaction is written to disk and the current barrier behaviour gives us that. Barrier != synchronous write, Of course. FYI, XFS only issues barriers on *async* writes. But barrier semantics - as far as they've been described by everyone but you indicate that the barrier write is guaranteed to be on stable storage when it returns. this doesn't match what I have seen wtih barriers it's perfectly legal to have the following sequence of events 1. app writes block 10 to OS 2. app writes block 4 to OS 3. app writes barrier to OS 4. app writes block 5 to OS 5. app writes block 20 to OS hm - applications can't issue barriers to the filesystem. However, if you consider the barrier to be an "fsync()" for example, then it's still the filesystem that is issuing the barrier and there's a block that needs to be written that is associated with that barrier (either an inode or a transaction commit) that needs to be on stable storage before the filesystem returns to userspace. 6. OS writes block 4 to disk drive 7. OS writes block 10 to disk drive 8. OS writes barrier to disk drive 9. OS writes block 5 to disk drive 10. OS writes block 20 to disk drive Replace OS with filesystem, and combine 7+8 together - we don't have zero-length barriers and hence they are *always* associated with a write to a certain block on disk. i.e.: 1. FS writes block 4 to disk drive 2. FS writes block 10 to disk drive 3. FS writes *barrier* block X to disk drive 4. FS writes block 5 to disk drive 5. FS writes block 20 to disk drive The order that these are expected by the filesystem to hit stable storage are: 1. block 4 and 10 on stable storage in any order 2. barrier block X on stable storage 3. block 5 and 20 on stable storage in any order The point I'm trying to make is that in XFS, block 5 and 20 cannot be allowed to hit the disk before the barrier block because they have strict order dependency on block X being stable before them, just like block X has strict order dependency that block 4 and 10 must be stable before we start the barrier block write. 11. disk drive writes block 10 to platter 12. disk drive writes block 4 to platter 13. disk drive writes block 20 to platter 14. disk drive writes block 5 to platter if the disk drive doesn't support barriers then step #8 becomes 'issue flush' and steps 11 and 12 take place before step #9, 13, 14 No, you need a flush on either side of the block X write to maintain the same semantics as barrier writes currently have. We have filesystems that require barriers to prevent reordering of writes in both directions and to ensure that the block associated with the barrier is on stable storage when I/o completion is signalled. The existing barrier implementation (where it works) provide these requirements. We need barriers to retain these semantics, otherwise we'll still have to do special stuff in the filesystems to get the semantics that we need. one of us is misunderstanding barriers here. you are understanding barriers to be the same as syncronous writes. (and therefor the data is on persistant media before the call returns) I am understanding barriers to only indicate ordering requirements. things before the barrier can be reordered freely, things after the barrier can be reordered freely, but things cannot be reordered across the barrier. if I am understanding it correctly, the big win for barriers is that you do NOT have to stop and wait until the data is on persistant media before you can continue. in the past barriers have not been fully implmented in most cases, and as a result they have been simulated by forcing a full flush of the buffers to persistant media before any other writes are allowed. This has made them _in practice_ operate the same way as syncronous writes (matching your understanding), but the current thread is talking about fixing the implementation to the official symantics for all hardware that can actually support barriers (and fix it at the OS level) David Lang - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.
On Thu, 31 May 2007, Jens Axboe wrote: On Thu, May 31 2007, Phillip Susi wrote: David Chinner wrote: That sounds like a good idea - we can leave the existing WRITE_BARRIER behaviour unchanged and introduce a new WRITE_ORDERED behaviour that only guarantees ordering. The filesystem can then choose which to use where appropriate So what if you want a synchronous write, but DON'T care about the order? They need to be two completely different flags which you can choose to combine, or use individually. If you have a use case for that, we can easily support it as well... Depending on the drive capabilities (FUA support or not), it may be nearly as slow as a "real" barrier write. true, but a "real" barrier write could have significant side effects on other writes that wouldn't happen with a synchronous wrote (a sync wrote can have other, unrelated writes re-ordered around it, a barrier write can't) David Lang - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [dm-devel] Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.
On Fri, 1 Jun 2007, Tejun Heo wrote: but one thing we should bear in mind is that harddisks don't have humongous caches or very smart controller / instruction set. No matter how relaxed interface the block layer provides, in the end, it just has to issue whole-sale FLUSH CACHE on the device to guarantee data ordering on the media. if you are talking about individual drives you may be right for the moment (but 16M cache on drives is a _lot_ larger then people imagined would be there a few years ago) but when you consider the self-contained disk arrays it's an entirely different story. you can easily have a few gig of cache and a complete OS pretending to be a single drive as far as you are concerned. and the price of such devices is plummeting (in large part thanks to Linux moving into this space), you can now readily buy a 10TB array for $10k that looks like a single drive. David Lang - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
limits on raid
what is the limit for the number of devices that can be in a single array? I'm trying to build a 45x750G array and want to experiment with the different configurations. I'm trying to start with raid6, but mdadm is complaining about an invalid number of drives David Lang - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: limits on raid
On Fri, 15 Jun 2007, Neil Brown wrote: On Thursday June 14, [EMAIL PROTECTED] wrote: what is the limit for the number of devices that can be in a single array? I'm trying to build a 45x750G array and want to experiment with the different configurations. I'm trying to start with raid6, but mdadm is complaining about an invalid number of drives David Lang "man mdadm" search for "limits". (forgive typos). thanks. why does it still default to the old format after so many new versions? (by the way, the documetnation said 28 devices, but I couldn't get it to accept more then 27) it's now churning away 'rebuilding' the brand new array. a few questions/thoughts. why does it need to do a rebuild when makeing a new array? couldn't it just zero all the drives instead? (or better still just record most of the space as 'unused' and initialize it as it starts useing it?) while I consider zfs to be ~80% hype, one advantage it could have (but I don't know if it has) is that since the filesystem an raid are integrated into one layer they can optimize the case where files are being written onto unallocated space and instead of reading blocks from disk to calculate the parity they could just put zeros in the unallocated space, potentially speeding up the system by reducing the amount of disk I/O. .this wouldn't work if the filesystem is crowded, but a lot of large arrays are used for storing large files (i.e. sequential writes of large amounts of data) and it would seem that this could be a substantial win in these cases. is there any way that linux would be able to do this sort of thing? or is it impossible due to the layering preventing the nessasary knowledge from being in the right place? David Lang - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: limits on raid
On Sat, 16 Jun 2007, Neil Brown wrote: It would be possible to have a 'this is not initialised' flag on the array, and if that is not set, always do a reconstruct-write rather than a read-modify-write. But the first time you have an unclean shutdown you are going to resync all the parity anyway (unless you have a bitmap) so you may as well resync at the start. And why is it such a big deal anyway? The initial resync doesn't stop you from using the array. I guess if you wanted to put an array into production instantly and couldn't afford any slowdown due to resync, then you might want to skip the initial resync but is that really likely? in my case it takes 2+ days to resync the array before I can do any performance testing with it. for some reason it's only doing the rebuild at ~5M/sec (even though I've increased the min and max rebuild speeds and a dd to the array seems to be ~44M/sec, even during the rebuild) I want to test several configurations, from a 45 disk raid6 to a 45 disk raid0. at 2-3 days per test (or longer, depending on the tests) this becomes a very slow process. also, when a rebuild is slow enough (and has enough of a performance impact) it's not uncommon to want to operate in degraded mode just long enought oget to a maintinance window and then recreate the array and reload from backup. David Lang - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: limits on raid
On Sat, 16 Jun 2007, David Greaves wrote: [EMAIL PROTECTED] wrote: On Sat, 16 Jun 2007, Neil Brown wrote: I want to test several configurations, from a 45 disk raid6 to a 45 disk raid0. at 2-3 days per test (or longer, depending on the tests) this becomes a very slow process. Are you suggesting the code that is written to enhance data integrity is optimised (or even touched) to support this kind of test scenario? Seriously? :) actually, if it can be done without a huge impact to the maintainability of the code I think it would be a good idea for the simple reason that I think the increased experimentation would result in people finding out what raid level is really appropriate for their needs. there is a _lot_ of confusion around about what the performance implications of different raid levels are (especially when you consider things like raid 10/50/60 where you have two layers combined) and anything that encourages experimentation would be a good thing. also, when a rebuild is slow enough (and has enough of a performance impact) it's not uncommon to want to operate in degraded mode just long enought oget to a maintinance window and then recreate the array and reload from backup. so would mdadm --remove the rebuilding disk help? no. let me try again drive fails monday morning scenerio 1 replace the failed drive, start the rebuild. system will be slow (degraded mode + rebuild) for the next three days. scenerio 2 leave it in degraded mode until monday night (accepting the speed penalty for degraded mode, but not the rebuild penalty) monday night shutdown the system, put in the new drive, reinitialize the array, reload the system from backup. system is back to full speed tuesday morning. scenerio 2 isn't supported with md today, although it sounds as if the skip rebuild could do this except for raid 5 on my test system, the rebuild says it's running at 5M/s a DD to a file on the array says it's doing 45M/s (even while the rebuild is running), so it seems to me that there may be value in this approach. David Lang - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: limits on raid
On Sun, 17 Jun 2007, Wakko Warner wrote: you can also easily move an ext3 journal to an external journal with tune2fs (see man page). I only have 2 ext3 file systems (One of which is mounted R/O since it's full), all my others are reiserfs (v3). What benefit would I gain by using an external journel and how big would it need to be? if you have the journal on a drive by itself you end up doing (almost) sequential reads and writes to the journal and the disk head doesn't need to move much. this can greatly increase your write speeds since 1. the journal gets written faster (completeing the write as far as your software is concerned) 2. the heads don't need to seek back and forth from the journal to the final location that the data gets written. as for how large it should be, it all depends on the volume of your writes, once the journal fills up all writes stall until space is freed in the journal, IIRC Ext3 is limited to 128M, with todays drive sizes I don't see any reason to make it any smaller. David Lang - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: limits on raid
On Sun, 17 Jun 2007, dean gaudet wrote: On Sun, 17 Jun 2007, Wakko Warner wrote: What benefit would I gain by using an external journel and how big would it need to be? i don't know how big the journal needs to be... i'm limited by xfs' maximum journal size of 128MiB. i don't have much benchmark data -- but here are some rough notes i took when i was evaluating a umem NVRAM card. since the pata disks in the raid1 have write caching enabled it's somewhat of an unfair comparison, but the important info is the 88 seconds for internal journal vs. 81 seconds for external journal. if you turn on disk write caching the difference will be much larger. -dean time sh -c 'tar xf /var/tmp/linux-2.6.20.tar; sync' I know that sync will force everything to get as far as the journal, will it force the journal to be flushed? David Lang xfs journal raid5 bitmaptimes internalnone0.18s user 2.14s system 2% cpu 1:27.95 total internalinternal0.16s user 2.16s system 1% cpu 2:01.12 total raid1 none0.07s user 2.02s system 2% cpu 1:20.62 total raid1 internal0.14s user 2.01s system 1% cpu 1:55.18 total raid1 raid1 0.14s user 2.03s system 2% cpu 1:20.61 total umemnone0.13s user 2.07s system 2% cpu 1:20.77 total umeminternal0.15s user 2.16s system 2% cpu 1:51.28 total umemumem0.12s user 2.13s system 2% cpu 1:20.50 total raid5: - 4x seagate 7200.10 400GB on marvell MV88SX6081 - mdadm --create --level=5 --raid-devices=4 /dev/md4 /dev/sd[abcd]1 raid1: - 2x maxtor 6Y200P0 on 3ware 7504 - two 128MiB partitions starting at cyl 1 - mdadm --create --level=1 --raid-disks=2 --auto=yes --assume-clean /dev/md1 /dev/sd[fg]1 - mdadm --create --level=1 --raid-disks=2 --auto=yes --assume-clean /dev/md2 /dev/sd[fg]2 - md1 is used for external xfs journal - md2 has an ext3 filesystem for the external md4 bitmap xfs: - mkfs.xfs issued before each run using the defaults (aside from -l logdev=/dev/md1) - mount -o noatime,nodiratime[,logdev=/dev/md1] umem: - 512MiB Micro Memory MM-5415CN - 2 partitions similar to the raid1 setup - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: limits on raid
On Mon, 18 Jun 2007, Brendan Conoboy wrote: [EMAIL PROTECTED] wrote: in my case it takes 2+ days to resync the array before I can do any performance testing with it. for some reason it's only doing the rebuild at ~5M/sec (even though I've increased the min and max rebuild speeds and a dd to the array seems to be ~44M/sec, even during the rebuild) With performance like that, it sounds like you're saturating a bus somewhere along the line. If you're using scsi, for instance, it's very easy for a long chain of drives to overwhelm a channel. You might also want to consider some other RAID layouts like 1+0 or 5+0 depending upon your space vs. reliability needs. I plan to test the different configurations. however, if I was saturating the bus with the reconstruct how can I fire off a dd if=/dev/zero of=/mnt/test and get ~45M/sec whild only slowing the reconstruct to ~4M/sec? I'm putting 10x as much data through the bus at that point, it would seem to proove that it's not the bus that's saturated. David Lang - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: limits on raid
On Mon, 18 Jun 2007, Lennart Sorensen wrote: On Mon, Jun 18, 2007 at 10:28:38AM -0700, [EMAIL PROTECTED] wrote: I plan to test the different configurations. however, if I was saturating the bus with the reconstruct how can I fire off a dd if=/dev/zero of=/mnt/test and get ~45M/sec whild only slowing the reconstruct to ~4M/sec? I'm putting 10x as much data through the bus at that point, it would seem to proove that it's not the bus that's saturated. dd 45MB/s from the raid sounds reasonable. If you have 45 drives, doing a resync of raid5 or radi6 should probably involve reading all the disks, and writing new parity data to one drive. So if you are writing 5MB/s, then you are reading 44*5MB/s from the other drives, which is 220MB/s. If your resync drops to 4MB/s when doing dd, then you have 44*4MB/s which is 176MB/s or 44MB/s less read capacity, which surprisingly seems to match the dd speed you are getting. Seems like you are indeed very much saturating a bus somewhere. The numbers certainly agree with that theory. What kind of setup is the drives connected to? simple ultra-wide SCSI to a single controller. I didn't realize that the rate reported by /proc/mdstat was the write speed that was takeing place, I thought it was the total data rate (reads + writes). the next time this message gets changed it would be a good thing to clarify this. David Lang - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: limits on raid
On Mon, 18 Jun 2007, Brendan Conoboy wrote: [EMAIL PROTECTED] wrote: I plan to test the different configurations. however, if I was saturating the bus with the reconstruct how can I fire off a dd if=/dev/zero of=/mnt/test and get ~45M/sec whild only slowing the reconstruct to ~4M/sec? I'm putting 10x as much data through the bus at that point, it would seem to proove that it's not the bus that's saturated. I am unconvinced. If you take ~1MB/s for each active drive, add in SCSI overhead, 45M/sec seems reasonable. Have you look at a running iostat while all this is going on? Try it out- add up the kb/s from each drive and see how close you are to your maximum theoretical IO. I didn't try iostat, I did look at vmstat, and there the numbers look even worse, the bo column is ~500 for the resync by itself, but with the DD it's ~50,000. when I get access to the box again I'll try iostat to get more details Also, how's your CPU utilization? ~30% of one cpu for the raid 6 thread, ~5% of one cpu for the resync thread David Lang - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: limits on raid
On Mon, 18 Jun 2007, Lennart Sorensen wrote: On Mon, Jun 18, 2007 at 11:12:45AM -0700, [EMAIL PROTECTED] wrote: simple ultra-wide SCSI to a single controller. Hmm, isn't ultra-wide limited to 40MB/s? Is it Ultra320 wide? That could do a lot more, and 220MB/s sounds plausable for 320 scsi. yes, sorry, ultra 320 wide. I didn't realize that the rate reported by /proc/mdstat was the write speed that was takeing place, I thought it was the total data rate (reads + writes). the next time this message gets changed it would be a good thing to clarify this. Well I suppose itcould make sense to show rate of rebuild which you can then compare against the total size of tha raid, or you can have rate of write, which you then compare against the size of the drive being synced. Certainly I would expect much higer speeds if it was the overall raid size, while the numbers seem pretty reasonable as a write speed. 4MB/s would take for ever if it was the overall raid resync speed. I usually see SATA raid1 resync at 50 to 60MB/s or so, which matches the read and write speeds of the drives in the raid. as I read it right now what happens is the worst of the options, you show the total size of the array for the amount of work that needs to be done, but then show only the write speed for the rate pf progress being made through the job. total rebuild time was estimated at ~3200 min David Lang - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: limits on raid
On Mon, 18 Jun 2007, Brendan Conoboy wrote: [EMAIL PROTECTED] wrote: yes, sorry, ultra 320 wide. Exactly how many channels and drives? one channel, 2 OS drives plus the 45 drives in the array. yes I realize that there will be bottlenecks with this, the large capacity is to handle longer history (it's going to be a 30TB circular buffer being fed by a pair of OC-12 links) it appears that my big mistake was not understanding what /proc/mdstat is telling me. David Lang - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: limits on raid
On Mon, 18 Jun 2007, Wakko Warner wrote: Subject: Re: limits on raid [EMAIL PROTECTED] wrote: On Mon, 18 Jun 2007, Brendan Conoboy wrote: [EMAIL PROTECTED] wrote: yes, sorry, ultra 320 wide. Exactly how many channels and drives? one channel, 2 OS drives plus the 45 drives in the array. Given that the drives only have 4 ID bits, how can you have 47 drives on 1 cable? You'd need a minimum of 3 channels for 47 drives. Do you have some sort of external box that holds X number of drives and only uses a single ID? yes, I'm useing promise drive shelves, I have them configured to export the 15 drives as 15 LUNs on a single ID. I'm going to be useing this as a huge circular buffer that will just be overwritten eventually 99% of the time, but once in a while I will need to go back into the buffer and extract and process the data. David Lang - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: limits on raid
On Tue, 19 Jun 2007, Phillip Susi wrote: [EMAIL PROTECTED] wrote: one channel, 2 OS drives plus the 45 drives in the array. Huh? You can only have 16 devices on a scsi bus, counting the host adapter. And I don't think you can even manage that much reliably with the newer higher speed versions, at least not without some very special cables. 6 devices on the bus (2 OS drives, 3 promise drive shelves, controller card) yes I realize that there will be bottlenecks with this, the large capacity is to handle longer history (it's going to be a 30TB circular buffer being fed by a pair of OC-12 links) Building one of those nice packet sniffers for the NSA to install on AT&Ts network eh? ;) just for going back in time to track hacker actions at a bank. I'm hopeing that once I figure out the drives the rest of the software will basicly boil down to tcpdump with the right options to write to a circular buffer of files. David Lang - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: limits on raid
On Tue, 19 Jun 2007, Lennart Sorensen wrote: On Mon, Jun 18, 2007 at 02:56:10PM -0700, [EMAIL PROTECTED] wrote: yes, I'm useing promise drive shelves, I have them configured to export the 15 drives as 15 LUNs on a single ID. I'm going to be useing this as a huge circular buffer that will just be overwritten eventually 99% of the time, but once in a while I will need to go back into the buffer and extract and process the data. I would guess that if you ran 15 drives per channel on 3 different channels, you would resync in 1/3 the time. Well unless you end up saturating the PCI bus instead. hardware raid of course has an advantage there in that it doesn't have to go across the bus to do the work (although if you put 45 drives on one scsi channel on hardware raid, it will still be limited). I fully realize that the channel will be the bottleneck, I just didn't understand what /proc/mdstat was telling me. I thought that it was telling me that the resync was processing 5M/sec, not that it was writing 5M/sec on each of the two parity locations. David Lang - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: limits on raid
On Thu, 21 Jun 2007, David Chinner wrote: On Thu, Jun 21, 2007 at 12:56:44PM +1000, Neil Brown wrote: I have that - apparently naive - idea that drives use strong checksum, and will never return bad data, only good data or an error. If this isn't right, then it would really help to understand what the cause of other failures are before working out how to handle them The drive is not the only source of errors, though. You could have a path problem that is corrupting random bits between the drive and the filesystem. So the data on the disk might be fine, and reading it via a redundant path might be all that is needed. one of the 'killer features' of zfs is that it does checksums of every file on disk. so many people don't consider the disk infallable. several other filesystems also do checksums both bitkeeper and git do checksums of files to detect disk corruption as david C points out there are many points in the path where the data could get corrupted besides on the platter. David Lang - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: limits on raid
On Thu, 21 Jun 2007, Mattias Wadenstein wrote: On Thu, 21 Jun 2007, Neil Brown wrote: I have that - apparently naive - idea that drives use strong checksum, and will never return bad data, only good data or an error. If this isn't right, then it would really help to understand what the cause of other failures are before working out how to handle them In theory, that's how storage should work. In practice, silent data corruption does happen. If not from the disks themselves, somewhere along the path of cables, controllers, drivers, buses, etc. If you add in fcal, you'll get even more sources of failure, but usually you can avoid SANs (if you care about your data). heh, the pitch I get from the self proclaimed experts is that if you care about your data you put it on the san (so you can take advantage of the more expensive disk arrays, various backup advantages, and replication features that tend to be focused on the san becouse it's a big target) David Lang Well, here is a couple of the issues that I've seen myself: A hw-raid controller returning every 64th bit as 0, no matter what's on disk. With no error condition at all. (I've also heard from a collegue about this on every 64k, but not seen that myself.) An fcal switch occasionally resetting, garbling the blocks in transit with random data. Lost a few TB of user data that way. Add to this the random driver breakage that happens now and then. I've also had a few broken filesystems due to in-memory corruption due to bad ram, not sure there is much hope of fixing that though. Also, this presentation is pretty worrying on the frequency of silent data corruption: https://indico.desy.de/contributionDisplay.py?contribId=65&sessionId=42&confId=257 /Mattias Wadenstein - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: limits on raid
On Fri, 22 Jun 2007, David Greaves wrote: That's not a bad thing - until you look at the complexity it brings - and then consider the impact and exceptions when you do, eg hardware acceleration? md information fed up to the fs layer for xfs? simple long term maintenance? Often these problems are well worth the benefits of the feature. I _wonder_ if this is one where the right thing is to "just say no" :) In this case I think the advantages of a higher level system knowing what efficiant blocks to do writes/reads in can potentially be a HUGE advantage. if the uppper levels know that you ahve a 6 disk raid 6 array with a 64K chunk size then reads and writes in 256k chunks (aligned) should be able to be done at basicly the speed of a 4 disk raid 0 array. what's even more impressive is that this could be done even if the array is degraded (if you know the drives have failed you don't even try to read from them and you only have to reconstruct the missing info once per stripe) the current approach doesn't give the upper levels any chance to operate in this mode, they just don't have enough information to do so. the part about wanting to know raid 0 chunk size so that the upper layers can be sure that data that's supposed to be redundant is on seperate drives is also possible storage technology is headed in the direction of having the system do more and more of the layout decisions, and re-stripe the array as conditions change (similar to what md can already do with enlarging raid5/6 arrays) but unless you want to eventually put all that decision logic into the md layer you should make it possible for other layers to make queries to find out what's what and then they can give directions for what they want to have happen. so for several reasons I don't see this as something that's deserving of an atomatic 'no' David Lang - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: limits on raid
On Fri, 22 Jun 2007, Bill Davidsen wrote: By delaying parity computation until the first write to a stripe only the growth of a filesystem is slowed, and all data are protected without waiting for the lengthly check. The rebuild speed can be set very low, because on-demand rebuild will do most of the work. I'm very much for the fs layer reading the lower block structure so I don't have to fiddle with arcane tuning parameters - yes, *please* help make xfs self-tuning! Keeping life as straightforward as possible low down makes the upwards interface more manageable and that goal more realistic... Those two paragraphs are mutually exclusive. The fs can be simple because it rests on a simple device, even if the "simple device" is provided by LVM or md. And LVM and md can stay simple because they rest on simple devices, even if they are provided by PATA, SATA, nbd, etc. Independent layers make each layer more robust. If you want to compromise the layer separation, some approach like ZFS with full integration would seem to be promising. Note that layers allow specialized features at each point, trading integration for flexibility. My feeling is that full integration and independent layers each have benefits, as you connect the layers to expose operational details you need to handle changes in those details, which would seem to make layers more complex. What I'm looking for here is better performance in one particular layer, the md RAID5 layer. I like to avoid unnecessary complexity, but I feel that the current performance suggests room for improvement. they both have have benifits, but it shouldn't have to be either-or if you build the seperate layers and provide for ways that the upper layers can query the lower layers to find what's efficiant then you can have some uppoer layers that don't care about this and trat the lower layer as a simple block device, while other upper layers find out what sort of things are more efficiant to do and use the same lower layer in a more complex manner the alturnative is to duplicate effort (and code) to have two codebases that try to do the same thing, one stand-alone, and one as a part of an integrated solution (and it gets even worse if there end up being multiple integrated solutions) David Lang - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFD] Layering: Use-Case Composers (was: DRBD - what is it, anyways? [compare with e.g. NBD + MD raid])
On Sun, 12 Aug 2007, Jan Engelhardt wrote: On Aug 12 2007 13:35, Al Boldi wrote: Lars Ellenberg wrote: meanwhile, please, anyone interessted, the drbd paper for LinuxConf Eu 2007 is finalized. http://www.drbd.org/fileadmin/drbd/publications/ drbd8.linux-conf.eu.2007.pdf but it does give a good overview about what DRBD actually is, what exact problems it tries to solve, and what developments to expect in the near future. so you can make up your mind about "Do we need it?", and "Why DRBD? Why not NBD + MD-RAID?" I may have made a mistake when asking for how it compares to NBD+MD. Let me retry: what's the functional difference between GFS2 on a DRBD .vs. GFS2 on a DAS SAN? GFS is a distributed filesystem, DRDB is a replicated block device. you wouldn't do GFS on top of DRDB, you would do ext2/3, XFS, etc DRDB is much closer to the NBD+MD option. now, I am not an expert on either option, but three are a couple things that I would question about the DRDB+MD option 1. when the remote machine is down, how does MD deal with it for reads and writes? 2. MD over local drive will alternate reads between mirrors (or so I've been told), doing so over the network is wrong. 3. when writing, will MD wait for the network I/O to get the data saved on the backup before returning from the syscall? or can it sync the data out lazily Now, shared remote block access should theoretically be handled, as does DRBD, by a block layer driver, but realistically it may be more appropriate to let it be handled by the combining end user, like OCFS or GFS. there are times when you want to replicate at the block layer, and there are times when you want to have a filesystem do the work. don't force a filesystem on use-cases where a block device is the right answer. David Lang - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFD] Layering: Use-Case Composers (was: DRBD - what is it, anyways? [compare with e.g. NBD + MD raid])
per the message below MD (or DM) would need to be modified to work reasonably well with one of the disk components being over an unreliable link (like a network link) are the MD/DM maintainers interested in extending their code in this direction? or would they prefer to keep it simpler by being able to continue to assume that the raid components are connected over a highly reliable connection? if they are interested in adding (and maintaining) this functionality then there is a real possibility that NBD+MD/DM could eliminate the need for DRDB. however if they are not interested in adding all the code to deal with the network type issues, then the argument that DRDB should not be merged becouse you can do the same thing with MD/DM + NBD is invalid and can be dropped/ignored David Lang On Sun, 12 Aug 2007, Paul Clements wrote: Iustin Pop wrote: On Sun, Aug 12, 2007 at 07:03:44PM +0200, Jan Engelhardt wrote: > On Aug 12 2007 09:39, [EMAIL PROTECTED] wrote: > > now, I am not an expert on either option, but three are a couple > > things that I > > would question about the DRDB+MD option > > > > 1. when the remote machine is down, how does MD deal with it for reads > > and > > writes? > I suppose it kicks the drive and you'd have to re-add it by hand unless > done by > a cronjob. Yes, and with a bitmap configured on the raid1, you just resync the blocks that have been written while the connection was down. >From my tests, since NBD doesn't have a timeout option, MD hangs in the write to that mirror indefinitely, somewhat like when dealing with a broken IDE driver/chipset/disk. Well, if people would like to see a timeout option, I actually coded up a patch a couple of years ago to do just that, but I never got it into mainline because you can do almost as well by doing a check at user-level (I basically ping the nbd connection periodically and if it fails, I kill -9 the nbd-client). > > 2. MD over local drive will alternate reads between mirrors (or so > > I've been > > told), doing so over the network is wrong. > Certainly. In which case you set "write_mostly" (or even write_only, not > sure > of its name) on the raid component that is nbd. > > > 3. when writing, will MD wait for the network I/O to get the data > > saved on the > > backup before returning from the syscall? or can it sync the data out > > lazily > Can't answer this one - ask Neil :) MD has the write-mostly/write-behind options - which help in this case but only up to a certain amount. You can configure write_behind (aka, asynchronous writes) to buffer as much data as you have RAM to hold. At a certain point, presumably, you'd want to just break the mirror and take the hit of doing a resync once your network leg falls too far behind. -- Paul - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFD] Layering: Use-Case Composers (was: DRBD - what is it, anyways? [compare with e.g. NBD + MD raid])
On Mon, 13 Aug 2007, David Greaves wrote: [EMAIL PROTECTED] wrote: per the message below MD (or DM) would need to be modified to work reasonably well with one of the disk components being over an unreliable link (like a network link) are the MD/DM maintainers interested in extending their code in this direction? or would they prefer to keep it simpler by being able to continue to assume that the raid components are connected over a highly reliable connection? if they are interested in adding (and maintaining) this functionality then there is a real possibility that NBD+MD/DM could eliminate the need for DRDB. however if they are not interested in adding all the code to deal with the network type issues, then the argument that DRDB should not be merged becouse you can do the same thing with MD/DM + NBD is invalid and can be dropped/ignored David Lang As a user I'd like to see md/nbd be extended to cope with unreliable links. I think md could be better in handling link exceptions. My unreliable memory recalls sporadic issues with hot-plug leaving md hanging and certain lower level errors (or even very high latency) causing unsatisfactory behaviour in what is supposed to be a fault 'tolerant' subsystem. Would this just be relevant to network devices or would it improve support for jostled usb and sata hot-plugging I wonder? good question, I suspect that some of the error handling would be similar (for devices that are unreachable not haning the system for example), but a lot of the rest would be different (do you really want to try to auto-resync to a drive that you _think_ just reappeared, what if it's a different drive? how can you be sure?) the error rate of a network is gong to be significantly higher then for USB or SATA drives (although I suppose iscsi would be limilar) David Lang - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Raid5 software problems after loosing 4 disks for 48 hours
Wilson Wilson wrote: > Neil great stuff, its online now!!! Congratulations :) > > I am still unsure how this raid5 volume was partially readable with 4 > disks missing. My understanding each file is written across all disks > apart from one, which is used for CRC. So if 2 disks are offline the > whole thing should be unreadable. I'll try :) md doesn't operate at a file level, it operates on chunks. The chunk could be 64Kb in size. For raid5 each stripe is made of n-1 chunks. (raid6 would be n-2). When a stripe is read, if your file is in one of the chunks that's still there then you're in luck. I guess md knows it's degraded and gives as much data back as possible. This means that you have a certain probability of accessing a given file depending on it's size, the filesystem and the degree to which the array is degraded. FWIW I'd *never* try a r/w operation on such a degraded array. Speculation: I'm surprised you could mount such a 'sparse' array though. I wonder if some filesystems (like xfs) would just barf as they mounted because they have more distributed mount-time data structures and would spot the missing chunks. Others (ext3?) may just mount and try to read blocks on demand. David - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: New FAQ entry? (was IBM xSeries stop responding during RAID1 reconstruction)
OK :) David Niccolo Rigacci wrote: > Thanks to the several guys in this list, I have solved my problem > and elaborated this, can be a new FAQ entry? > > > > Q: Sometimes when a RAID volume is resyncing, the system seems to > locks-up: every disk activity is blocked until resync is done. > > A: This is not strictly related to Linux RAID, this is a problem > related to the Linux kernel and the disk subsytem: in no > circumstances a process should get all the disk resources > preventing others to access them. > > You can control the max speed at which RAID reconstruction is > done by setting it, say at 5 Mb/s: > > echo 5000 > /proc/sys/dev/raid/speed_limit_max > > This is just a workaround, you have to determine the max speed > that does not lock your system by trial and error and you cannot > predict what will be the disk load in the future when the RAID > will be resyncing for some reason. > > Starting from version 2.6, Linux kernel has several choices about > the I/O scheduler to be used. The default is the anticipatory > scheduler, which seems to be sub-optimal on resync high load. If > your kernel has the CFQ scheduler compiled in, use it during > resync. > > >From the command line you can see which schedulers are supported > and change it on the fly (remember to do it for each RAID disk): > > # cat /sys/block/hda/queue/scheduler > noop [anticipatory] deadline cfq > # echo cfq > /sys/block/hda/queue/scheduler > > Otherwise you can recompile your kernel and set CFQ as the > default I/O scheduler (CONFIG_DEFAULT_CFQ=y in Block layer, IO > Schedulers, Default I/O scheduler). > > > -- - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Large single raid and XFS or two small ones and EXT3?
Adam Talbot wrote: > OK, this topic I relay need to get in on. > I have spent the last few week bench marking my new 1.2TB, 6 disk, RAID6 > array. Very interesting. Thanks. Did you get around to any 'tuning'. Things like raid chunk size, external logs for xfs, blockdev readahead on the underlying devices and the raid device? David - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Large single raid and XFS or two small ones and EXT3?
On 6/23/06, Nix <[EMAIL PROTECTED]> wrote: On 23 Jun 2006, PFC suggested tentatively: > - ext3 is slow if you have many files in one directory, but has > more mature tools (resize, recovery etc) This is much less true if you turn on the dir_index feature. However, even with dir_index, deleting large files is still much slower with ext2/3 than xfs or jfs. -Dave - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: raid issues after power failure
Francois Barre wrote: > 2006/7/1, Ákos Maróy <[EMAIL PROTECTED]>: >> Neil Brown wrote: >> > Try adding '--force' to the -A line. >> > That tells mdadm to try really hard to assemble the array. >> >> thanks, this seems to have solved the issue... >> >> >> Akos >> >> > > Well, Neil, I'm wondering, > It seemed to me that Akos' description of the problem was that > re-adding the drive (with mdadm not complaining about anything) would > trigger a resync that would not even start. > But as your '--force' does the trick, it implies that the resync was > not really triggered after all without it... Or did I miss a bit of > log Akos provided that did say so ? > Could there be a place here for an error message ? > > More generally, could it be usefull to build up a recovery howto, > based on the experiences on this list (I guess 90% of the posts a > related to recoveries) ? > Not in terms of a standard disk loss, but in terms of a power failure > or a major disk problem. You know, re-creating the array, rolling the > dices, and *tada !* your data is back again... I could not find a bit > of doc about this. > Francois, I have started to put a wiki in place here: http://linux-raid.osdl.org/ My reasoning was *exactly* that - there is reference information for md but sometimes the incantations need a little explanation and often the diagnostics are not obvious... I've been subscribed to linux-raid since the middle of last year and I've been going through old messages looking for nuggets to base some docs around. I haven't had a huge amount of time recently so I've just scribbled on it for now - I wanted to present something a little more polished to the community - but since you're asking... So don't consider this an official announcement of a useable work yet - more a 'Please contact me if you would like to contribute' (just so I can keep track of interested parties) and we can build something up... David - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] enable auto=yes by default when using udev
Neil Brown wrote: > I guess I could test for both, but then udev might change > again I'd really like a more robust check. > > Maybe I could test if /dev was a mount point? IIRC you can have diskless machines with a shared root and nfs mounted static /dev/ David -- - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: SWRaid Wiki
Francois Barre wrote: > Hello David, all, > > You pointed the http://linux-raid.osdl.org as a future ressource for > SwRAID and MD knowledge base. Yes. it's not ready for public use yet so I've not announced it formally - I just mention it to people when things pop up. > > In fact, the TODO page on the wiki is empty... Hmm, yes... maybe it should say "build todo list" One action I am pursuing is "take over the official RAID FAQ". I've made contact with the authors and we're discussing licenses etc... Horrid stuff but important to many. Speaking of which, Neil, if you read this - are the man pages under the GFDL or the GPL? > But I would like to help on feeding this wiki with all the clues and > experiences posted on the ML, That would be worthwhile. and it would first be interresting to > build up the TODO list, which could start by : > - reference various situations where help can be provided : recovery, > diagnostics, statistics, > - create a comprehensive list of success stories & good > design/techniques, in order to help people design their own RAID > systems. In my opinion, this both deals with software params (raid > level, chunk size, fs, ...), and with hardware decisions (sata vs. > scsi, the right controller, ...) Well, I wanted to focus more on refining key information from such stories. After all, a success story is only relevant to a particular situation. I'd rather develop a diagnostic approach which leads people through a diagnostic process and explains when to use certain tools/options. That would also be something we could keep up to date whereas an actual story loses relevance over time. > PS : I really like your "RAID Recovery" page : > "If this happens then first of all: don't panic. Seriously. Don't rush > into anything..." yes, but so true... David -- - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Mounting array was read write for about 3 minutes, then Read-only file system error
On 7/17/06, Neil Brown <[EMAIL PROTECTED]> wrote: On Thursday July 6, [EMAIL PROTECTED] wrote: > I created a raid1 array using /dev/disk/by-id with (2) 250GB USB 2.0 > Drives. It was working for about 2 minutes until I tried to copy a > directory tree from one drive to the array and then cancelled it > midstream. After cancelling the copy, when I list the contents of the > directory it doesn't show anything there. > > When I try to create a file, I get the following error msg: > > [EMAIL PROTECTED] ~]# cd /mnt/usb250 > [EMAIL PROTECTED] usb250]# ls > lost+found > [EMAIL PROTECTED] usb250]# touch test.txt > touch: cannot touch `test.txt': Read-only file system Sounds like you got some disk errors so the filesystem when readonly. Is there anything in /var/log/messages about errors at that time? What's /proc/mount for /dev/disk/by-id when it's rw? Is it rw,errors=remount-ro? David - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
md reports: unknown partition table
Hi After a powercut I'm trying to mount an array and failing :( teak:~# mdadm --assemble /dev/media --auto=p /dev/sd[bcdef]1 mdadm: /dev/media has been started with 5 drives. Good However: teak:~# mount /media mount: /dev/media1 is not a valid block device teak:~# dd if=/dev/media1 of=/dev/null dd: opening `/dev/media1': No such device or address teak:~# dd if=/dev/media of=/dev/null 792442+0 records in 792441+0 records out 405729792 bytes transferred in 4.363571 seconds (92981135 bytes/sec) (after ^C) dmesg shows: raid5: device sdb1 operational as raid disk 0 raid5: device sdf1 operational as raid disk 4 raid5: device sde1 operational as raid disk 3 raid5: device sdd1 operational as raid disk 2 raid5: device sdc1 operational as raid disk 1 raid5: allocated 5235kB for md_d127 raid5: raid level 5 set md_d127 active with 5 out of 5 devices, algorithm 2 RAID5 conf printout: --- rd:5 wd:5 fd:0 disk 0, o:1, dev:sdb1 disk 1, o:1, dev:sdc1 disk 2, o:1, dev:sdd1 disk 3, o:1, dev:sde1 disk 4, o:1, dev:sdf1 md_d127: bitmap initialized from disk: read 1/1 pages, set 0 bits, status: 0 created bitmap (5 pages) for device md_d127 md_d127: unknown partition table That last line looks odd... It was created like so: mdadm --create /dev/media --level=5 -n 5 -e1.2 --bitmap=internal --name=media --auto=p /dev/sd[bcdef]1 and the xfs fstab entry is: /dev/media1 /media xfs rw,noatime,logdev=/dev/media2 0 0 fdisk /dev/media shows: Device Boot Start End Blocks Id System /dev/media1 1 312536035 1250144138 83 Linux /dev/media2 312536036 312560448 97652 da Non-FS data cfdisk even gets the filesystem right... Which is expected. teak:~# ll /dev/media* brw-rw 1 root disk 254, 192 2006-07-18 17:18 /dev/media brw-rw 1 root disk 254, 193 2006-07-18 17:18 /dev/media1 brw-rw 1 root disk 254, 194 2006-07-18 17:18 /dev/media2 brw-rw 1 root disk 254, 195 2006-07-18 17:18 /dev/media3 brw-rw 1 root disk 254, 196 2006-07-18 17:18 /dev/media4 teak:~# uname -a Linux teak 2.6.16.19-teak-060602-01 #3 PREEMPT Sat Jun 3 09:20:24 BST 2006 i686 GNU/Linux teak:~# mdadm -V mdadm - v2.5.2 - 27 June 2006 David -- - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: XFS and write barrier
On Tue, Jul 18, 2006 at 06:58:56PM +1000, Neil Brown wrote: > On Tuesday July 18, [EMAIL PROTECTED] wrote: > > On Mon, Jul 17, 2006 at 01:32:38AM +0800, Federico Sevilla III wrote: > > > On Sat, Jul 15, 2006 at 12:48:56PM +0200, Martin Steigerwald wrote: > > > > I am currently gathering information to write an article about journal > > > > filesystems with emphasis on write barrier functionality, how it > > > > works, why journalling filesystems need write barrier and the current > > > > implementation of write barrier support for different filesystems. > > "Journalling filesystems need write barrier" isn't really accurate. > They can make good use of write barrier if it is supported, and where > it isn't supported, they should use blkdev_issue_flush in combination > with regular submit/wait. blkdev_issue_flush() causes a write cache flush - just like a barrier typically causes a write cache flush up to the I/O with the barrier in it. Both of these mechanisms provide the same thing - an I/O barrier that enforces ordering of I/Os to disk. Given that filesystems already indicate to the block layer when they want a barrier, wouldn't it be better to get the block layer to issue this cache flush if the underlying device doesn't support barriers and it receives a barrier request? FWIW, Only XFS and Reiser3 use this function, and only then when issuing a fsync when barriers are disabled to make sure a common test (fsync then power cycle) doesn't result in data loss... > > Noone here seems to know, maybe Neil &| the other folks on linux-raid > > can help us out with details on status of MD and write barriers? > > In 2.6.17, md/raid1 will detect if the underlying devices support > barriers and if they all do, it will accept barrier requests from the > filesystem and pass those requests down to all devices. > > Other raid levels will reject all barrier requests. Any particular reason for not supporting barriers on the other types of RAID? Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: md reports: unknown partition table - fixed.
David Greaves wrote: > Hi > > After a powercut I'm trying to mount an array and failing :( A reboot after tidying up /dev/ fixed it. The first time through I'd forgotten to update the boot scripts and they were assembling the wrong UUID. That was fine; I realised this and ran the manual assemble: mdadm --assemble /dev/media /dev/sd[bcdef]1 dmesg cat /proc/mdstat All OK (but I'd forgotten that this was a partitioned array). I suspect the device entries for /dev/media[1234] from last time were hanging about. mount /media fdisk /dev/media So I guess this fails because the major-minor are for a non-p md device? mdadm --assemble /dev/media --auto=p /dev/sd[bcdef]1 mdadm --stop /dev/media This fails because I'm on mdadm 2.4.1 mdadm --assemble /dev/media --auto=p /dev/sd[bcdef]1 cat /proc/mdstat mdadm --stop /dev/md_d0 mdadm --stop /dev/md0 cat /proc/mdstat So by now I upgrade to mdadm 2.5.1 in another session. mdadm --stop /dev/media dmesg cat /proc/mdstat and it stops. mdadm --assemble /dev/media --auto=p /dev/sd[bcdef]1 But now it won't create working devices... Much messing about with assemble and I try a kernel upgrade - can't because the driver for my video card won't compile under 2.6.17 yet so WTF, I suspect major/minor numbers so just reboot it under the same kernel. All seems well. I think there's a bug here somewhere. I wonder/suspect that the superblock should contain the fact that it's a partitioned/able md device? David -- - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Serious XFS bug in 2.6.17 kernels - FYI
Just an FYI for my friends here who may be running 2.6.17.x kernels and using XFS and who may not be monitoring lkml :) There is a fairly serious corruption problem that has recently been discussed on lkml and affects all 2.6.17 before -stable .7 (not yet released) Essentially the fs can be corrupted and it's serious because the current xfs_repair tools may make the problem worse, not better. There is a 1-line patch that can be applied : http://marc.theaimsgroup.com/?l=linux-kernel&m=115315508506996&w=2 FAQ message here http://marc.theaimsgroup.com/?l=linux-xfs&m=115338022506482&w=2 FAQ: http://oss.sgi.com/projects/xfs/faq.html#dir2 It appears that efforts are being focused on the repair tools now. It appears to me that the best response is to patch the kernel, reboot, backup the fs, recreate the fs and restore - but please read up before taking any action. David -- - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: host based mirror distance in a fc-based SAN environment
Stefan Majer wrote: > Hi, > > im curious if there are some numbers out up to which distance its possible > to mirror (raid1) 2 FC-LUNs. We have 2 datacenters with a effective > distance of 11km. The fabrics in one datacenter are connected to the > fabrics in the other datacenter with 5 dark fibre both about 11km in > distance. > > I want to set up servers wich mirrors their LUNs across the SAN-boxen in > both datacenters. On top of this mirrored LUN i put lvm2. > > So the question is does anybody have some numbers up to which distance > this method works ? No. But have a look at man mdadm in later mdadm: -W, --write-mostly subsequent devices lists in a --build, --create, or --add command will be flagged as 'write-mostly'. This is valid for RAID1 only and means that the 'md' driver will avoid reading from these devices if at all possible. This can be useful if mirroring over a slow link. --write-behind= Specify that write-behind mode should be enabled (valid for RAID1 only). If an argument is specified, it will set the maximum number of outstanding writes allowed. The default value is 256. A write-intent bitmap is required in order to use write-behind mode, and write-behind is only attempted on drives marked as write-mostly. Which suggests that the WAN/LAN latency shouldn't impact you except on failure. HTH David -- - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: let md auto-detect 128+ raid members, fix potential race condition
Alexandre Oliva wrote: > On Jul 30, 2006, Neil Brown <[EMAIL PROTECTED]> wrote: > >> 1/ >> It just isn't "right". We don't mount filesystems from partitions >> just because they have type 'Linux'. We don't enable swap on >> partitions just because they have type 'Linux swap'. So why do we >> assemble md/raid from partitions that have type 'Linux raid >> autodetect'? > > Similar reason to why vgscan finds and attempts to use any partitions > that have the appropriate type/signature (difference being that raid > auto-detect looks at the actual partition type, whereas vgscan looks > at the actual data, just like mdadm, IIRC): when you have to bootstrap > from an initrd, you don't want to be forced to have the correct data > in the initrd image, since then any reconfiguration requires the info > to be introduced in the initrd image before the machine goes down. > Sometimes, especially in case of disk failures, you just can't do > that. > This debate is not about generic autodetection - a good thing (tm) - but in-kernel vs userspace autodetection. Your example supports Neil's case - the proposal is to use initrd to run mdadm which thne (kinda) does what vgscan does. > >> So my preferred solution to the problem is to tell people not to use (in kernel) >> autodetect. Quite possibly this should be documented in the code, and >> maybe even have a KERN_INFO message if more than 64 devices are >> autodetected. > > I wouldn't have a problem with that, since then distros would probably > switch to a more recommended mechanism that works just as well, i.e., > ideally without requiring initrd-regeneration after reconfigurations > such as adding one more raid device to the logical volume group > containing the root filesystem. That's supported in today's mdadm. look at --uuid and --name >> So: Do you *really* need to *fix* this, or can you just use 'mdadm' >> to assemble you arrays instead? > > I'm not sure. I'd expect not to need it, but the limited feature > currently in place, that initrd uses to bring up the raid1 devices > containing the physical volumes that form the volume group where the > logical volume with my root filesystem is also brings up various raid6 > physical volumes that form an unrelated volume group, and it does so > in such a way that the last of them, containing the 128th fd-type > partition in the box, ends up being left out, so the raid device it's > a member of is brought up either degraded or missing the spare member, > none of which are good. > > I don't know that I can easily get initrd to replace nash's > raidautorun for mdadm unless mdadm has a mode to bring up any arrays > it can find, as opposed to bringing up a specific array out of a given > list of members or scanning for members. Either way, this won't fix > the problem 2) that you mentioned, but requiring initrd-regeneration > after extending the volume group containing the root device is another > problem that the current modes of operation of mdadm AFAIK won't > contemplate, so switching to it will trade one problem for another, > and the latter is IMHO more common than the former. > I think you should name your raid1 (maybe "hostname-root") and use initrd to bring it up by --name using: mdadm --assemble --scan --config partitions --name hostname-root It could also, later in the boot process, bring up "hostname-raid6" by --name too. mdadm --assemble --scan --config partitions --name hostname-raid6 David -- - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] md: new bitmap sysfs interface
Neil Brown wrote: > write-bits-here-to-dirty-them-in-the-bitmap > > is probably (no, definitely) too verbose. > Any better suggestions? It's not actually a bitmap is it? It takes a number or range and *operates* on a bitmap. so: dirty-chunk-in-bitmap or maybe: dirty-bitmap-chunk David - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: raid5/lvm setup questions
Shane wrote: > Hello all, > > I'm building a new server which will use a number of disks > and am not sure of the best way to go about the setup. > There will be 4 320gb SATA drives installed at first. I'm > just wondering how to set the system up for upgradability. > I'll be using raid5 but not sure whether to use lvm over > the raid array. > > By upgradability, I'd like to do several things. Adding > another drive of the same size to the array. I understand > reshape can be used here to expand the underlying block > device. Yes, it can. If the block device is the pv of an lvm array, > would that also automatically expand in which I would > create additional lvs in the new space. If this isn't > automatic, are there ways to do it manually? Not automatic AFAIK - but doable. > What about replacing all four drives with larger units. > Say going from 300gbx4 to 500gbx4. Can one replace them > one at a time, going through fail/rebuild as appropriate > and then expand the array into the unused space Yes. or would > one have to reinstall at that point. No None of the requirements above drive you to layering lvm over the top. That's not to say don't do it - but you certainly don't *need* to do it. Pros: * allows snapshots (for consistent backups) * allows various lvm block movements etc... * Can later grow vg to use discrete additional block devices without raid5 grow limitations (eg same-ish size disks etc) Cons: * extra complexity -> risk of bugs/admin errors... * performance impact As an example of the cons: I've just set up lvm2 over my raid5 and whilst testing snapshots, the first thing that happened was a kernel BUG and an oops... David - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: raid5/lvm setup questions
Shane wrote: > On Mon, Aug 07, 2006 at 08:57:13PM +0100, Nix wrote: >> On 5 Aug 2006, David Greaves prattled cheerily: >>> As an example of the cons: I've just set up lvm2 over my raid5 and whilst >>> testing snapshots, the first thing that happened was a kernel BUG and an >>> oops... >> I've been backing up using writable snapshots on LVM2 over RAID-5 for >> some time. No BUGs. > > Just performed some basic throughput tests using 4 SATA > disks in a raid5 array. The read performance on the > /dev/mdx device runs around 180mbps but if lvm is layered > over that, reads on the lv are around 130mbps. Not an > unsubstantial reduction. Check the readahead at various block levels blockdev --setra xxx I think I found the best throughput (for me) was with 0 readahead for /dev/hdX, 0 for /dev/mdX and lots for /dev/vg/lv > > I seem to recall patches to md floating around a couple > years back for partitioning of md devices. Are those still > available somewhere? man mdadm and see --auto... David -- - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: raid5/lvm setup questions
Nix wrote: > On 5 Aug 2006, David Greaves prattled cheerily: that's me :) >> As an example of the cons: I've just set up lvm2 over my raid5 and whilst >> testing snapshots, the first thing that happened was a kernel BUG and an >> oops... > > I've been backing up using writable snapshots on LVM2 over RAID-5 for > some time. No BUGs. I tried but it didn't recurr. I sent a report to lkml. > I think the blame here is likely to be layable at the snapshots' door, > anyway: they're still a little wobbly and the implementation is pretty > complex: bugs surface on a regular basis. Hmmm. Bugs in a backup strategy. Hmmm. I think I can live with a nightly shutdown of the daemons whilst rsync does it's stuff across the LAN. David -- - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Resize on dirty array?
No, it wasn't *less* reliable than a single drive; you benefited as soon as a James Peverill wrote: > > In this case the raid WAS the backup... however it seems it turned out > to be less reliable than the single disks it was supporting. In the > future I think I'll make sure my disks have varying ages so they don't > fail all at once. > be at the moment. With RAID you then stressed the remaining drives to the point of a second failure (not that you had much choice - you *could* have spent money > James > >>> RAID is no excuse for backups. on enough media to mirror your data whilst you played with your only remaining I can't see where you mention the kernel version you're running? md can perform validation sync's on a periodic basis in later kernels - Debian's mdadm enables this in cron. copy - that's a cost/risk tradeoff you chose not to make. I've made the same choice in the past - I've been lucky - you were not - sorry.) > PS: > - David > To unsubscribe from this list: send the line "unsubscribe linux-raid" in > the body of a message to [EMAIL PROTECTED] > More majordomo info at http://vger.kernel.org/majordomo-info.html > drive failed. At that point you would have been just as toasted as you may well PS Reorganise lines from distributed reply as you like :) -- - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Resize on dirty array?
On 8/10/06, dean gaudet <[EMAIL PROTECTED]> wrote: - set up smartd to run long self tests once a month. (stagger it every few days so that your disks aren't doing self-tests at the same time) I personally prefer to do a long self-test once a week, a month seems like a lot of time for something to go wrong. - run nightly diffs of smartctl -a output on all your drives so you see when one of them reports problems in the smart self test or otherwise has a Current_Pending_Sectors or Realloc event... then launch a repair sync_action. You can (and probably should) setup smartd to automatically send out email alerts as well. -Dave - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Resize on dirty array?
On 8/11/06, dean gaudet <[EMAIL PROTECTED]> wrote: On Fri, 11 Aug 2006, David Rees wrote: > On 8/10/06, dean gaudet <[EMAIL PROTECTED]> wrote: > > - set up smartd to run long self tests once a month. (stagger it every > > few days so that your disks aren't doing self-tests at the same time) > > I personally prefer to do a long self-test once a week, a month seems > like a lot of time for something to go wrong. unfortunately i found some drives (seagate 400 pata) had a rather negative effect on performance while doing self-test. Interesting that you noted negative performance, but I typically schedule the tests for off-hours anyway where performance isn't critical. How much of a performance hit did you notice? -Dave - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Kernel RAID support
Richard Scobie wrote: > Josh Litherland wrote: >> On Sun, 2006-09-03 at 15:56 +1200, Richard Scobie wrote: >> >>> I am building 2.6.18rc5-mm1 and I cannot find the entry under "make >>> config", to enable the various RAID options. >> >> >> Under "Device Drivers", switch on "Multi-device support". >> > > Thanks. I must be going nuts, as it does not appear as an option. Below > is the list under "Device Drivers" if I do a "make menuconfig": Recently reported on lkml Andrew Morton said: ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.18-rc5/2.6.18-rc5-mm1/hot-fixes/ contains a fix for this. HTH David PS on kernel mailing lists do a reply-all and don't trim cc lists :) -- VGER BF report: U 0.500279 - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Messed up creating new array...
On 9/8/06, Ruth Ivimey-Cook <[EMAIL PROTECTED]> wrote: I messed up slightly when creating a new 6-disk raid6 array, and am wondering if there is a simple answer. The problem is that I didn't partition the drives, but simply used the whole drive. All drives are of the same type and using the Supermicro SAT2-MV8 controller. This should work: 1. Unmount filesystem. 2. Shrink file system to something a bit smaller. Since it's a big array, 1GB should give you plenty of room. 3. Shrink raid array to something in between the new fs size and old fs size. Make sure you don't shrink it smaller than the filesystem! 4. Remove a disk from the array (fail/remove) 5. Partition disk 6. Add partition back to array 7. Repeat steps 4-6 for all disks in the array. 8. Now the whole array should be on partitions. Grow the raid array back to "max". 9. Grow filesystem to the partition size. -Dave - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Simulating Drive Failure on Mirrored OS drive
andy liebman wrote: > I tried simply unplugging one drive from its power and from its SATA > connector. The OS didn't like that at all. My KDE session kept running, > but I could no longer open any new terminals. I couldn't become root in > an existing terminal that was already running. And I couldn't SSH into > the machine. That's likely to be because sata hotswap isn't supported (yet). dmesg should give you more info. > I know that simply unplugging a drive is not the same as a drive failing > or timing out. But is there a more realistic way to simulate a failure > so that I can know that the mirror will work when it's needed? Read up on the md-faulty device. Also, FWIW, md works just fine :) (Lots of other things can go wrong so testing your setup is a food idea though) David - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Recipe for Mirrored OS Drives
andy liebman wrote: > A few weeks ago, I promised that I would put my "recipe" here for > creating "mirrored OS drives from an existing OS Drive". This "recipe" > combines what I learned from MANY OTHER sometimes conflicting documents > on the same subject -- documents that were probably developed for > earlier kernels and distributions. Feel free to add it here: http://linux-raid.osdl.org/index.php/Main_Page I haven't been able to do much for a few weeks (typical - I find some time and use it all up just getting the basic setup done - still it's started!) David - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 003 of 6] md: Remove 'experimental' classification from raid5 reshape.
Typo in first line of this patch :) > I have had enough success reports not^H^H^H to believe that this > is safe for 2.6.19. - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: mdadm and raidtools - noob
Mark Ryden wrote: > Hello linux-raid list, > > I want to create a Linux Software RAID1 on linux FC5 (x86_64), > from SATA II disks. I am a noob in this. No problems. > I looked for it and saw that as far as I understand, > raidtools is quite old - from 2003. > for exanple, http://people.redhat.com/mingo/raidtools/ correct > So my question is this: > is raidtools deprecated ? Yes > Is it possible at all to use raidtool to create linux software RAID1, > running 2.6.17-1.2187_FC5 kernel on x86_64 ? Maybe - don't > Is using mdadm is the way to create linux software RAID1 ? Yes Is it the only way ? No (eg EVMS) David -- - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Recipe for Mirrored OS Drives
andy liebman wrote: > >> >> Feel free to add it here: >> http://linux-raid.osdl.org/index.php/Main_Page >> >> I haven't been able to do much for a few weeks (typical - I find some >> time and >> use it all up just getting the basic setup done - still it's started!) >> >> David >> > > Any hints on how to add a page? > > Andy > Yep :) First off it would help to read up on Wikis : http://meta.wikimedia.org/wiki/Help:Contents Basically you: * go to the page where you want to link from * edit that page to link to your new (not yet created) page * save your edit * click on the (red) link and you'll be given a page to edit * type... I suggest you link from http://linux-raid.osdl.org/index.php/RAID_Boot David -- - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Recipe for Mirrored OS Drives
Nix wrote: > On 2 Oct 2006, David Greaves spake: >> I suggest you link from http://linux-raid.osdl.org/index.php/RAID_Boot > > The pages don't really have the same purpose. RAID_Boot is `how to boot > your RAID system using initramfs'; this is `how to set up a RAID system > in the first place', i.e., setup. > > I'll give it a bit of a tweak-and-rename in a bit. > Fair :) FYI I've done quite a bit on the Howto section: http://linux-raid.osdl.org/index.php/Overview It still needs a lot of work I think but it's getting there... David -- - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Multiple Disk Failure Recovery
On 10/14/06, Lane Brooks <[EMAIL PROTECTED]> wrote: I am wondering if there is a way to cut my losses with these bad sectors and have it recover what it can so that I can get my raid array back to functioning. Right now I cannot get a spare disk recovery to finish because these bad sectors. Is there a way to force as much recovery as possible so that I can replace this newly faulty drive? One technique is to use ddrescue to create an image of the failing drive(s) (I would image all drives if possible) and use those images to try to retrieve your data. -Dave - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Need help recovering a raid5 array
[EMAIL PROTECTED] wrote: > Hello all, Hi First off, don't do anything else without reading up or talking on here :) The list archive has got a lot of good material - 'help' is usually a good search term!!! > > I had a disk fail in a raid 5 array (4 disk array, no spares), and am > having trouble recovering it. I believe my data is still safe, but I > cannot tell what is going wrong here. There's some useful stuff but always include: * kernel version * mdadm version * relevant dmesg or similar output What went wrong? Did /dev/sdd fail? If so then why are you adding it back to the array? Or is this now a replacement? You should be OK - I'll reply quickly now and see if I can make some suggestions later (or sooner). David > > When I try to rebuild the array "mdadm --assemble /dev/md0 /dev/sda2 > /dev/sdb2 /dev/sdc2 /dev/sdd2" I see "failed to RUN_ARRAY /dev/md0: > Input/output error". > > dmesg shows the following: > md: bind > md: bind > md: bind > md: bind > md: md0: raid array is not clean -- starting background reconstruction > raid5: device sda2 operational as raid disk 0 > raid5: device sdc2 operational as raid disk 2 > raid5: device sdb2 operational as raid disk 1 > raid5: cannot start dirty degraded array for md0 > RAID5 conf printout: > --- rd:4 wd:3 fd:1 > disk 0, o:1, dev:sda2 > disk 1, o:1, dev:sdb2 > disk 2, o:1, dev:sdc2 > raid5: failed to run raid set md0 > md: pers->run() failed ... > > > > /proc mdstat shows: > md0 : inactive sda2[0] sdd2[3](S) sdc2[2] sdb2[1] > > This seems wrong, as sdd2 should not be a spare - I want it to be the > fourth disk. > > > The output of mdadm -E for each disk is as follows: > sda2: > /dev/sda2: > Magic : a92b4efc > Version : 00.90.00 >UUID : c50a81fc:ef4323e6:438a7cb1:25ae35e5 > Creation Time : Thu Jun 1 21:13:58 2006 > Raid Level : raid5 > Device Size : 390555904 (372.46 GiB 399.93 GB) > Array Size : 1171667712 (1117.39 GiB 1199.79 GB) >Raid Devices : 4 > Total Devices : 4 > Preferred Minor : 0 > > Update Time : Sun Oct 22 23:39:06 2006 > State : active > Active Devices : 3 > Working Devices : 4 > Failed Devices : 0 > Spare Devices : 1 >Checksum : 683f2f5c - correct > Events : 0.8831997 > > Layout : left-symmetric > Chunk Size : 256K > > Number Major Minor RaidDevice State > this 0 820 active sync /dev/sda2 > >0 0 820 active sync /dev/sda2 >1 1 8 181 active sync /dev/sdb2 >2 2 8 342 active sync /dev/sdc2 >3 3 003 faulty removed >4 4 8 504 spare /dev/sdd2 > > > sdb2: > /dev/sdb2: > Magic : a92b4efc > Version : 00.90.00 >UUID : c50a81fc:ef4323e6:438a7cb1:25ae35e5 > Creation Time : Thu Jun 1 21:13:58 2006 > Raid Level : raid5 > Device Size : 390555904 (372.46 GiB 399.93 GB) > Array Size : 1171667712 (1117.39 GiB 1199.79 GB) >Raid Devices : 4 > Total Devices : 4 > Preferred Minor : 0 > > Update Time : Sun Oct 22 23:39:06 2006 > State : active > Active Devices : 3 > Working Devices : 4 > Failed Devices : 0 > Spare Devices : 1 >Checksum : 683f2f6e - correct > Events : 0.8831997 > > Layout : left-symmetric > Chunk Size : 256K > > Number Major Minor RaidDevice State > this 1 8 181 active sync /dev/sdb2 > >0 0 820 active sync /dev/sda2 >1 1 8 181 active sync /dev/sdb2 >2 2 8 342 active sync /dev/sdc2 >3 3 003 faulty removed >4 4 8 504 spare /dev/sdd2 > > > sdc2: > /dev/sdc2: > Magic : a92b4efc > Version : 00.90.00 >UUID : c50a81fc:ef4323e6:438a7cb1:25ae35e5 > Creation Time : Thu Jun 1 21:13:58 2006 > Raid Level : raid5 > Device Size : 390555904 (372.46 GiB 399.93 GB) > Array Size : 1171667712 (1117.39 GiB 1199.79 GB) >Raid Devices : 4 > Total Devices : 4 > Preferred Minor : 0 > > Update Time : Sun Oct 22 23:39:06 2006 > State : active > Active Devices : 3 > Working Devices : 4 > Failed Devices : 0 > Spare Devices : 1 >Checksum : 683f2f80 - correct > Events : 0.8831997 > >
Re: Raid5 or 6 here... ?
Gordon Henderson wrote: >1747 ?S< 724:25 [md9_raid5] > > It's kernel 2.6.18 and Wasn't the module merged to raid456 in 2.6.18? Are your mdx_raid6's earlier kernels. My raid 6 is on 2.7.17 and says _raid6 Could it be that the combined kernel thread is called mdX_raid5 David - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Raid5 or 6 here... ?
David Greaves wrote: > Gordon Henderson wrote: >>1747 ?S< 724:25 [md9_raid5] >> >> It's kernel 2.6.18 and > > Wasn't the module merged to raid456 in 2.6.18? > > Are your mdx_raid6's earlier kernels. My raid 6 is on 2.7.17 and says _raid6 > > Could it be that the combined kernel thread is called mdX_raid5 > Yup raid5.c now handles 45 and 6 and says: mddev->thread = md_register_thread(raid5d, mddev, "%s_raid5"); I think I may actually be able to patch that... David - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Relabeling UUID
Neil Brown wrote: > Patches to the man page to add useful examples are always welcome. And if people would like to be more verbose, the wiki is available at http://linux-raid.osdl.org/ It's now kinda useful but definitely not fully migrated from the old RAID FAQ. David - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Frequent SATA errors / port timeouts in 2.6.18.3?
Patrik Jonsson wrote: > Hi all, > this may not be the best list for this question, but I figure that the > number of disks connected to users here should be pretty big... > > I upgraded from 2.6.17-rc4 to 2.6.18.3 about a week ago, and I've since > had 3 drives kicked out of my 10-drive RAID5 array. Previously, I had no > kicks over almost a year. The kernel message is: > > ata7.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0 > ata7.00: (BMDMA stat 0x20) > ata7.00: tag 0 cmd 0xc8 Emask 0x1 stat 0x41 err 0x4 (device error) > ata7: EH complete > Any ideas or thought would be appreciated, SMART? Read the manpage and then try running: smartctl -data -S on /dev/... and smartctl -data -s on /dev/... Then look at your smartd timing and see if it's related; possibly just do a manual smartd poll. I've had smart/libata problems (well, no, glitches) for about 2 years now but as the irq handler occasionally says "no one cared" ;) It may well not be your problem but... David - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: mdadm: what if - crashed OS
Assuming you can allow some downtime, get yourself a rescue CD such as 'RIP' This will let you boot into the machine and run mdadm commands. You don't mention kernel/mdadm versions so you may want to check they're close on the rescue CD. Then try looking at the manpage around --assemble. In particular you may want to try --scan and --uuid (if your RIP/live kernel/mdadm support it) Also check out the examples... Assuming this is a sane machine and you're not in real disaster recovery mode with drives pulled in from random boxes then look at using the literal string "--config=partitions" (see the manpage) to avoid creating an mdadm.conf with the "DEVICE partitions" line - PITA on live CDs where you just want a command line ;) If you can manage it, this will give you a nice warm feeling about recovering from a problem and it's pretty safe - just common sense like making sure the live CD kernel/mdadm are either up-to-date or match your production system. HTH Also: > I have thought about this, and I can't understand how 'mdadm' decides the > health of an array. Each disk/partition used by md has a superblock which contains a unique UUID and other info, like the number of devices and the raid level. mdadm --scan looks into each partition for a superblock and notes this data. It can then group all the superblocks with the same UUID together and, for each group, knowing how many devices it should have, how many it has and how many it needs it can decide if the device can safely be assembled. David PS Yes, I've done this (too many times!) - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: mdadm --grow failed
Marc Marais wrote: [snip] > Unfortunately one of the drives timed out during the operation (not a read > error - just a timeout - which I would've thought would be retried but > anyway...): > Help appreciated. (I do have a full backup of course but that's a last > resort with my luck I'd get a read error from the tape drive) Hi Marc It looks like you've since recreated the array and restored your data - good :) It doesn't appear that you mentioned the kernel and distro you are using and the software versions. I'm sure this is something people will need. David - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Linux Software RAID a bit of a weakness?
On 2/25/07, Richard Scobie <[EMAIL PROTECTED]> wrote: Colin Simpson wrote: > They therefore do not have the "check" option in the kernel. Is there > anything else I can do? Would forcing a resync achieve the same result > (or is that down right dangerous as the array is not considered > consistent for a while). Any thoughts apart from my one being to upgrade > them to RH5 when that appears with a probably 2.6.18 kernel (which will > presumably have "check")? Any thoughts? You could configure smartd to do regular long selftests, which would notify you on failures and allow you to take the drive offline and dd, replace etc. So what do you do when your drives in your array don't support SMART self tests for some reason? The best solution I have thought of so far is to do a `dd if=/dev/mdX of=/dev/null` periodically, but this isn't as nice as running a check in the later kernels as it's not guaranteed to read blocks from all disks. I guess you could instead do the same thing but with the underlying disks instead of the raid device, then make sure you watch the logs for disk read errors. -Dave - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Linux Software RAID a bit of a weakness?
On 2/26/07, Colin Simpson <[EMAIL PROTECTED]> wrote: If I say, dd if=/dev/sda2 of=/dev/null where /dev/sda2 is a component of an active md device. Will the RAID subsystem get upset that someone else is fiddling with the disk (even in just a read only way)? And will a read error on this dd (caused by a bad block) cause md to knock out that device? The MD subsystem doesn't care if someone else is reading the disk, and I'm pretty sure that rear errors will be noticed by the MD system, either. -Dave - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Linux Software RAID a bit of a weakness?
On 2/26/07, Neil Brown <[EMAIL PROTECTED]> wrote: On Monday February 26, [EMAIL PROTECTED] wrote: > I'm pretty sure that rear errors will be noticed by the MD system, > either. :-) Your typing is nearly as bad as mine often is, but your intent is correct. If you independently read from a device in an MD array and get an error, MD won't notice. MD only notices errors for requests that it makes of the devices itself. Doh, 2 errors in one line! Should have read: I'm pretty sure that read errors will _not_ be noticed by the MD system, either. Good thing at least Neil understood me. :-) -Dave - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BLK_DEV_MD with CONFIG_NET
From: Randy Dunlap <[EMAIL PROTECTED]> Date: Tue, 20 Mar 2007 20:05:38 -0700 > Build a kernel with CONFIG_NET-n and CONFIG_BLK_DEV_MD=m. > Unless csum_partial() is built and kept by some arch Makefile, > the result is: > ERROR: "csum_partial" [drivers/md/md-mod.ko] undefined! > make[1]: *** [__modpost] Error 1 > make: *** [modules] Error 2 > > > Any suggested solutions? Anything which is every exported to modules, which ought to be the situation in this case, should be obj-y not lib-y right? - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: raid10 kernel panic on sparc64
From: Jan Engelhardt <[EMAIL PROTECTED]> Date: Mon, 2 Apr 2007 02:15:57 +0200 (MEST) > just when I did > # mdadm -C /dev/md2 -b internal -e 1.0 -l 10 -n 4 /dev/sd[cdef]4 > (created) > # mdadm -D /dev/md2 > Killed > > dmesg filled up with a kernel oops. A few seconds later, the box > locked solid. Since I was only in by ssh and there is not (yet) any > possibility to reset it remotely, this is all I can give right now, > the last 80x25 screen: Unfortunately the beginning of the OOPS is the most important part, that says where exactly the kernel died, the rest of the log you showed only gives half the registers and the rest of the call trace. Please try to capture the whole thing. Please also provide hardware type information as well, which you should give in any bug report like this. - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: raid10 kernel panic on sparc64
From: Jan Engelhardt <[EMAIL PROTECTED]> Date: Mon, 2 Apr 2007 02:15:57 +0200 (MEST) > Kernel is kernel-smp-2.6.16-1.2128sp4.sparc64.rpm from Aurora Corona. > Perhaps it helps, otherwise hold your breath until I reproduce it. Jan, if you can reproduce this with the current 2.6.20 vanilla kernel I'd be very interested in a full trace so that I can try to fix this. With the combination of an old kernel and only part of the crash trace, there isn't much I can do with this report. - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Manually hacking superblocks
Lasse Kärkkäinen wrote: > I managed to mess up a RAID-5 array by mdadm -adding a few failed disks > back, trying to get the array running again. Unfortunately, -add didn't > do what I expected, but instead made spares out of the failed disks. The > disks failed due to loose SATA cabling and the data inside should be > fairly consistent. sdh failed a bit earlier than sdd and sde, so I > expect to be able to revocer by building a degraded array without sdh > and then syncing. > > The current situation looks like this: > Number Major Minor RaidDevice State >0 0 8 330 active sync /dev/sdc1 >1 1 001 faulty removed >2 2 8 972 active sync /dev/sdg1 >3 3 8 1293 active sync /dev/sdi1 >4 4 004 faulty removed >5 5 8 815 active sync /dev/sdf1 >6 6 006 faulty removed >7 7 8 1777 spare >8 8 8 1618 spare >9 9 8 1459 spare > > ... and before any of this happened, the configuration was: > > disk 0, o:1, dev:sdc1 > disk 1, o:1, dev:sde1 > disk 2, o:1, dev:sdg1 > disk 3, o:1, dev:sdi1 > disk 4, o:1, dev:sdh1 > disk 5, o:1, dev:sdf1 > disk 6, o:1, dev:sdd1 > > I gather that I need a way to alter the superblocks of sde and sdd so > that they seem to be clean up-to-date disks, with their original disk > numbers 1 and 6. A hex editor comes to mind, but are there any better > tools for that? You don't need a tool. mdadm --force will do what you want. Read the archives and the man page. You are correct to assemble the array with a missing disk (or 2 missing disks for RAID6) - this prevents the kernel from trying to sync. Not syncing is good because if you do make a slight error in the order, you can end up syncing bad data over good. I *THINK* you should try something like (untested): mdadm --assemble /dev/md0 --force /dev/sdc1 /dev/sde1 /dev/sdg1 /dev/sdi1 missing /dev/sdf1 /dev/sdf1 The order is important and should match the original order. There's more you could do by looking at device event counts (--examine) Also you must do a READ-ONLY mount the first time you mount the array - this will check the consistency and avoid corruption if you get the order wrong. I really must get around to setting up a test environment so I can check this out and update the wiki... I have to go out or a couple of hours. Let me know how it goes if you can't wait for me to get back. David - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Partitioned arrays initially missing from /proc/partitions
Hi Neil I think this is a bug. Essentially if I create an auto=part md device then I get md_d0p? partitions. If I stop the array and just re-assemble, I don't. It looks like the same (?) problem as Mike (see below - Mike do you have a patch?) but I'm on 2.6.20.7 with mdadm v2.5.6 FWIW I upgraded from 2.6.16 where it worked (but used in-kernel detection which isn't working in 2.6.20 for some reason but I don't mind). Here's a simple sequence of commands: teak:~# mdadm --stop /dev/md_d0 mdadm: stopped /dev/md_d0 teak:~# mdadm --create /dev/md_d0 -l5 -n5 --bitmap=internal -e1.2 --auto=part --name media --force /dev/sde1 /dev/sdc1 /dev/sdd1 missing /dev/sdf1 mdadm: /dev/sde1 appears to be part of a raid array: level=raid5 devices=5 ctime=Mon Apr 23 15:02:13 2007 mdadm: /dev/sdc1 appears to be part of a raid array: level=raid5 devices=5 ctime=Mon Apr 23 15:02:13 2007 mdadm: /dev/sdd1 appears to be part of a raid array: level=raid5 devices=5 ctime=Mon Apr 23 15:02:13 2007 mdadm: /dev/sdf1 appears to be part of a raid array: level=raid5 devices=5 ctime=Mon Apr 23 15:02:13 2007 Continue creating array? y mdadm: array /dev/md_d0 started. teak:~# grep md /proc/partitions 254 0 1250241792 md_d0 254 1 1250144138 md_d0p1 254 2 97652 md_d0p2 teak:~# mdadm --stop /dev/md_d0 mdadm: stopped /dev/md_d0 teak:~# mdadm --assemble /dev/md_d0 --auto=part /dev/sde1 /dev/sdc1 /dev/sdd1 /dev/sdf1 mdadm: /dev/md_d0 has been started with 4 drives (out of 5). teak:~# grep md /proc/partitions 254 0 1250241792 md_d0 If I then run cfdisk it finds the partition table. I write this and get: teak:~# cfdisk /dev/md_d0 Disk has been changed. WARNING: If you have created or modified any DOS 6.x partitions, please see the cfdisk manual page for additional information. teak:~# grep md /proc/partitions 254 0 1250241792 md_d0 254 1 1250144138 md_d0p1 254 2 97652 md_d0p2 and the syslog: Apr 23 15:13:13 localhost kernel: md: md_d0 stopped. Apr 23 15:13:13 localhost kernel: md: unbind Apr 23 15:13:13 localhost kernel: md: export_rdev(sde1) Apr 23 15:13:13 localhost kernel: md: unbind Apr 23 15:13:13 localhost kernel: md: export_rdev(sdf1) Apr 23 15:13:13 localhost kernel: md: unbind Apr 23 15:13:13 localhost kernel: md: export_rdev(sdd1) Apr 23 15:13:13 localhost kernel: md: unbind Apr 23 15:13:13 localhost kernel: md: export_rdev(sdc1) Apr 23 15:13:13 localhost mdadm: DeviceDisappeared event detected on md device /dev/md_d0 Apr 23 15:13:36 localhost kernel: md: bind Apr 23 15:13:36 localhost kernel: md: bind Apr 23 15:13:36 localhost kernel: md: bind Apr 23 15:13:36 localhost kernel: md: bind Apr 23 15:13:36 localhost kernel: raid5: device sdf1 operational as raid disk 4 Apr 23 15:13:36 localhost kernel: raid5: device sdd1 operational as raid disk 2 Apr 23 15:13:36 localhost kernel: raid5: device sdc1 operational as raid disk 1 Apr 23 15:13:36 localhost kernel: raid5: device sde1 operational as raid disk 0 Apr 23 15:13:36 localhost kernel: raid5: allocated 5236kB for md_d0 Apr 23 15:13:36 localhost kernel: raid5: raid level 5 set md_d0 active with 4 out of 5 devices, algorithm 2 Apr 23 15:13:36 localhost kernel: RAID5 conf printout: Apr 23 15:13:36 localhost kernel: --- rd:5 wd:4 Apr 23 15:13:36 localhost kernel: disk 0, o:1, dev:sde1 Apr 23 15:13:36 localhost kernel: disk 1, o:1, dev:sdc1 Apr 23 15:13:36 localhost kernel: disk 2, o:1, dev:sdd1 Apr 23 15:13:36 localhost kernel: disk 4, o:1, dev:sdf1 Apr 23 15:13:36 localhost kernel: md_d0: bitmap initialized from disk: read 1/1 pages, set 19078 bits, status: 0 Apr 23 15:13:36 localhost kernel: created bitmap (10 pages) for device md_d0 Apr 23 15:13:36 localhost kernel: md_d0: p1 p2 Apr 23 15:13:54 localhost kernel: md: md_d0 stopped. Apr 23 15:13:54 localhost kernel: md: unbind Apr 23 15:13:54 localhost kernel: md: export_rdev(sdf1) Apr 23 15:13:54 localhost kernel: md: unbind Apr 23 15:13:54 localhost kernel: md: export_rdev(sdd1) Apr 23 15:13:54 localhost kernel: md: unbind Apr 23 15:13:54 localhost kernel: md: export_rdev(sdc1) Apr 23 15:13:54 localhost kernel: md: unbind Apr 23 15:13:54 localhost kernel: md: export_rdev(sde1) Apr 23 15:13:54 localhost mdadm: DeviceDisappeared event detected on md device /dev/md_d0 Apr 23 15:14:04 localhost kernel: md: md_d0 stopped. Apr 23 15:14:04 localhost kernel: md: bind Apr 23 15:14:04 localhost kernel: md: bind Apr 23 15:14:04 localhost kernel: md: bind Apr 23 15:14:04 localhost kernel: md: bind Apr 23 15:14:04 localhost kernel: raid5: device sde1 operational as raid disk 0 Apr 23 15:14:04 localhost kernel: raid5: device sdf1 operational as raid disk 4 Apr 23 15:14:04 localhost kernel: raid5: device sdd1 operational as raid disk 2 Apr 23 15:14:04 localhost kernel: raid5: device sdc1 operational as raid disk 1 Apr 23 15:14:04 localhost kernel: raid5: allocated 5236kB for md_d0 Apr 23 15:14:04 localhost kernel: raid5: raid level 5 set md_d0 active with 4 out of 5 devices, a
Re: Multiple disk failure, but slot numbers are corrupt and preventing assembly.
There is some odd stuff in there: /dev/sda1: Active Devices : 4 Working Devices : 4 Failed Devices : 0 Events : 0.115909229 /dev/sdb1: Active Devices : 5 Working Devices : 4 Failed Devices : 1 Events : 0.115909230 /dev/sdc1: Active Devices : 8 Working Devices : 8 Failed Devices : 1 Events : 0.115909230 /dev/sdd1: Active Devices : 4 Working Devices : 4 Failed Devices : 0 Events : 0.115909230 but your event counts are consistent. It looks like corruption on 2 disks :( Or did you try some things? I think you'll need to recreate the array since assemble can't figure things out. Since you mention SMART errors on /dev/sdb you are taking a big chance by trying to start up the array with a known faulty disk - especially if you resync as it's a very IO intensive operation that will read every sector of the bad disk and is likely to trigger errors that will kick it again leaving you back where you started (or worse). If you are desperate for data recovery and you have the space then you should take disk images using ddrescue *before* trying anything. Next best is if you are buying new disks and can wait for them to arrive, do so. You can then use ddrescue to copy the old disk to the new ones and work with non-broken hardware. If you have no choice >From this point forward it will be very easy to mess up. Once you have disks to work on you can try to recreate the array. You were using 0.9 superblocks, 64k, left symmetric which are defaults. You should re-create in degraded mode to prevent the sync from starting (if you got the order wrong then it would get the parity calc wrong). So: mdadm --create /dev/md0 --force -l5 -n4 /dev/sda1 /dev/sdb1 missing /dev/sdc1 Then do a *readonly* fsck on the /dev/md0. If it works you can try a backup or an fsck. Ask if anything isn't clear. David PS I recovered from a 2-disk failure last night. Seems to be back up and re-syncing :) Glad I had a spare disk around! Leon Woestenberg wrote: > Hello, > > it's recovery time again. Problem at hand: raid5 consisting of four > partitions, each on a drive. Two disks have failed. Assembly fails > because the slot numbers of the array components seem to be corrupt. > > /dev/md0 consisting of /dev/sd[abcd]1, of which b,c failed and of > which c seems really bad in SMART, b looks reasonably OK judging from > SMART. > > Checksum of the failed component superblocks was bad. > > Using mdadm.conf we have already tried updating the superblocks. This > partly succeeded in the sense that checksums came up ok, the slot > numbers did not. > > mdadm refuses to assemble, even with --force. > > Could you guys peek over the array configuration (mdadm --examine) and > see if there is a non-destructive way to try and mount the array. If > not, what is the least intrusive way to do a non-syncing (re)create? > > Data recovery is our prime concern here. > > Below the uname -a, --examine output of all four drives, mdadm.conf of > what we think the array should look like and finally, the mdadm > --assemble command and output. > > Note the slot numbers on /dev/sd[bc]. > > Thanks for any help, > > with kind regards, > > Leon Woestenberg > > > > > Linux localhost 2.6.16.14-axon1 #1 SMP PREEMPT Mon May 8 17:01:33 CEST > 2006 i486 pentium4 i386 GNU/Linux > > [EMAIL PROTECTED] ~]# mdadm --examine /dev/sda1 > /dev/sda1: > Magic : a92b4efc >Version : 00.90.00 > UUID : 51a95144:00af4c77:c1cd173b:94cb1446 > Creation Time : Mon Sep 5 13:16:42 2005 > Raid Level : raid5 >Device Size : 390620352 (372.52 GiB 400.00 GB) > Raid Devices : 4 > Total Devices : 4 > Preferred Minor : 0 > >Update Time : Tue Apr 17 07:03:46 2007 > State : active > Active Devices : 4 > Working Devices : 4 > Failed Devices : 0 > Spare Devices : 0 > Checksum : f98ed71b - correct > Events : 0.115909229 > > Layout : left-symmetric > Chunk Size : 64K > > Number Major Minor RaidDevice State > this 0 810 active sync /dev/sda1 > > 0 0 810 active sync /dev/sda1 > 1 1 8 171 active sync /dev/sdb1 > 2 2 8 332 active sync /dev/sdc1 > 3 3 8 493 active sync /dev/sdd1 > [EMAIL PROTECTED] ~]# mdadm --examine /dev/sdb1 > /dev/sdb1: > Magic : a92b4efc >Version : 00.90.00 > UUID : 51a95144:00af4c77:c1cd173b:94cb1446 > Creation Time : Mon Sep 5 13:16:42 2005 > Raid Level : raid5 >Device Size : 390620352 (372.52 GiB 400.00 GB) > Raid Devices : 4 > Total Devices : 5 > Preferred Minor : 0 > >Update T
Re: Multiple disk failure, but slot numbers are corrupt and preventing assembly.
Leon Woestenberg wrote: > On 4/24/07, Leon Woestenberg <[EMAIL PROTECTED]> wrote: >> Hello, >> >> On 4/23/07, David Greaves <[EMAIL PROTECTED]> wrote: >> > There is some odd stuff in there: >> > >> [EMAIL PROTECTED] ~]# mdadm -v --assemble --scan >> --config=/tmp/mdadm.conf --force >> [...] >> mdadm: no uptodate device for slot 1 of /dev/md0 >> mdadm: no uptodate device for slot 2 of /dev/md0 >> [...] >> > So, the problem I am facing is that the slot number (as seen with > --examine) is invalid on two and therefore they won't be recognized as > valid drives for the array. > > Is there any way to override the slot number? I could not find > anything in mdadm or mdadm.conf to override them. Yes --create, see my original reply. Essentially all --create does is create superblocks with the data you want (eg slot numbers). It does not touch other 'on disk data'. It is safe to run the *exact same* create command on a dormant array at any time after initial creation - the main side effect is a new UUID. (Neil - yell if I'm wrong). The most 'dangerous' part is to create a superblock with a different version. If you wanted to experiment (maybe with loopback devices) you could try --create'ing an array with 4 devices to simulate where you were. Then do the --create again with 2 devices missing. This should end up with 2 devices with one UUID, 2 with another. Then do an --assemble using --force and --update=uuid. Report back if you do this... Cheers David - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Partitioned arrays initially missing from /proc/partitions
Neil Brown wrote: > This problem is very hard to solve inside the kernel. > The partitions will not be visible until the array is opened *after* > it has been created. Making the partitions visible before that would > be possible, but would be very easy. > > I think the best solution is Mike's solution which is to simply > open/close the array after it has been assembled. I will make sure > this is in the next release of mdadm. > > Note that you can still access the partitions even though they do not > appear in /proc/partitions. Any attempt to access and of them will > make them all appear in /proc/partitions. But I understand there is > sometimes value in seeing them before accessing them. > > NeilBrown Um. Are you sure? The reason I noticed is that I couldn't mount them until they appeared; see these cut'n'pastes from my terminal history: teak:~# mount /media/ mount: /dev/md_d0p1 is not a valid block device teak:~# mount /dev/md_d0p1 /media mount: you must specify the filesystem type teak:~# xfs_repair -ln /dev/md_d0p2 /dev/md_d0p1 Usage: xfs_repair [-nLvV] [-o subopt[=value]] [-l logdev] [-r rtdev] devname teak:~# ll /dev/md* brw-rw 1 root disk 254, 0 2007-04-23 15:44 /dev/md_d0 brw-rw 1 root disk 254, 1 2007-04-23 14:46 /dev/md_d0p1 brw-rw 1 root disk 254, 2 2007-04-23 14:46 /dev/md_d0p2 brw-rw 1 root disk 254, 3 2007-04-23 15:44 /dev/md_d0p3 brw-rw 1 root disk 254, 4 2007-04-23 15:44 /dev/md_d0p4 /dev/md: total 0 teak:~# /etc/init.d/mdadm-raid stop Stopping MD array md_d0...done (stopped). teak:~# /etc/init.d/mdadm-raid start Assembling MD array md_d0...done (degraded [4/5]). Generating udev events for MD arrays...done. teak:~# cfdisk /dev/md_d0 teak:~# mount /dev/md_d0p1 mount: /dev/md_d0p1 is not a valid block device and so on... Notice the cfdisk command above. I did this to check the on-array table (it was good). I assume cfdisk opens the array - but the partitions were still not there afterwards. I did not do a 'Write' from in cfdisk this time. I wouldn't be so concerned at a cosmetic thing in /proc/partitions - the problem is that I can't mount my array after doing an assemble and I have to --create each time - not the nicest solution. Oh, I'm using udev FWIW. David - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Partitioned arrays initially missing from /proc/partitions
Mike Accetta wrote: > David Greaves writes: > > ... >> It looks like the same (?) problem as Mike (see below - Mike do you have a >> patch?) but I'm on 2.6.20.7 with mdadm v2.5.6 > ... > > We have since started assembling the array from the initrd using > --homehost and --auto-update-homehost which takes a different path through > the code, and in this path the kernel figures out there are partitions > on the array before mdadm exists. Just tried that - doesn't work :) > For the previous code path, we had been ruuning with the patch I described > in my original post which I've included below. I'd guess that the bug > is actually in the kernel code and I looked at it briefly but couldn't > figure out how things all fit together well enough to come up with a > patch there. The user level patch is a bit of a hack and there may be > other code paths that also need a similar patch. I only made this patch > in the assembly code path we were executing at the time. > > BUILD/mdadm/mdadm.c#2 (text) - BUILD/mdadm/mdadm.c#3 (text) content > @@ -983,6 +983,10 @@ >NULL, >readonly, > runstop, NULL, verbose-quiet, force); > close(mdfd); > + mdfd = open(array_list->devname, > O_RDONLY); > + if (mdfd >= 0) { > + close(mdfd); > + } > } > } > break; Thanks Mike But this doesn't work for me either :( I changed array_list to devlist inline with 2.6.9 and it compiles and runs OK. teak:~# mdadm --stop /dev/md_d0 mdadm: stopped /dev/md_d0 teak:~# /everything/devel/mdadm/mdadm-2.5.6/mdadm --assemble /dev/md_d0 /dev/sd[bcdef]1 mdadm: With Fudge. mdadm: /dev/md_d0 has been started with 5 drives. mdadm: Fudging partition creation. teak:~# mount /media mount: /dev/md_d0p1 is not a valid block device teak:~# I also wrote a small c program to call the RAID_AUTORUN ioctl - that didn't work either because I'd compiled RAID as a module so the ioctl isn't defined. currently recompiling the kernel to allow autorun... David - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Partitioned arrays initially missing from /proc/partitions
David Greaves wrote: > currently recompiling the kernel to allow autorun... Which of course won't work because I'm on 1.2 superblocks: md: Autodetecting RAID arrays. md: invalid raid superblock magic on sdb1 md: sdb1 has invalid sb, not importing! md: invalid raid superblock magic on sdc1 md: sdc1 has invalid sb, not importing! md: invalid raid superblock magic on sdd1 md: sdd1 has invalid sb, not importing! md: invalid raid superblock magic on sde1 md: sde1 has invalid sb, not importing! md: invalid raid superblock magic on sdf1 md: sdf1 has invalid sb, not importing! md: autorun ... md: ... autorun DONE. David PS Dropped Mike from cc since I doubt he's too interested :) - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Partitioned arrays initially missing from /proc/partitions
Neil Brown wrote: > This problem is very hard to solve inside the kernel. > The partitions will not be visible until the array is opened *after* > it has been created. Making the partitions visible before that would > be possible, but would be very easy. > > I think the best solution is Mike's solution which is to simply > open/close the array after it has been assembled. I will make sure > this is in the next release of mdadm. > > Note that you can still access the partitions even though they do not > appear in /proc/partitions. Any attempt to access and of them will > make them all appear in /proc/partitions. But I understand there is > sometimes value in seeing them before accessing them. > > NeilBrown For anyone else who is in this boat and doesn't fancy finding somewhere in mdadm to hack, here's a simple program that issues the BLKRRPART ioctl. This re-reads the block device partition table and 'works for me'. I think partx -a would do the same job but for some reason partx isn't in utils-linux for Debian... Neil, isn't it easy to just do this after an assemble? David #include #include #include #include #include #include #include #include #include int main(int argc, char *argv[]) { int fd; if (argc != 2) fprintf(stderr, "Usage: %s \n", argv[0]); if ((fd = open(argv[1], O_RDONLY)) == -1) { fprintf(stderr, "Can't open md device %s\n", argv[1]); return -1; } if (ioctl(fd, BLKRRPART, NULL) != 0) { fprintf(stderr, "ioctl failed\n"); close (fd); return -1; } close (fd); return 0; }
Re: Partitioned arrays initially missing from /proc/partitions
Neil Brown wrote: > On Tuesday April 24, [EMAIL PROTECTED] wrote: >> Neil Brown wrote: >>> This problem is very hard to solve inside the kernel. >>> The partitions will not be visible until the array is opened *after* >>> it has been created. Making the partitions visible before that would >>> be possible, but would be very easy. >>> >>> I think the best solution is Mike's solution which is to simply >>> open/close the array after it has been assembled. I will make sure >>> this is in the next release of mdadm. >>> >>> Note that you can still access the partitions even though they do not >>> appear in /proc/partitions. Any attempt to access and of them will >>> make them all appear in /proc/partitions. But I understand there is >>> sometimes value in seeing them before accessing them. >>> >>> NeilBrown >> Um. Are you sure? > > "Works for me". Lucky you ;) > What happens if you > blockdev --rereadpt /dev/md_d0 > ?? It probably works then. Well, that's probably the same as my BLKRRPART ioctl so I guess yes. [confirmed - yes, but blockdev seems to do it twice - I get 2 kernel messages] > It sounds like someone is deliberately removing all the partition > info. Gremlins? > Can you try this patch and see if it reports anyone calling > '2' on md_d0 ?? Nope, not being called at all. teak:~# mdadm --assemble /dev/md_d0 --auto=parts /dev/sd[bcdef]1 mdadm: /dev/md_d0 has been started with 5 drives. dmesg: md: bind md: bind md: bind md: bind md: bind raid5: device sde1 operational as raid disk 0 raid5: device sdf1 operational as raid disk 4 raid5: device sdb1 operational as raid disk 3 raid5: device sdd1 operational as raid disk 2 raid5: device sdc1 operational as raid disk 1 raid5: allocated 5236kB for md_d0 raid5: raid level 5 set md_d0 active with 5 out of 5 devices, algorithm 2 RAID5 conf printout: --- rd:5 wd:5 disk 0, o:1, dev:sde1 disk 1, o:1, dev:sdc1 disk 2, o:1, dev:sdd1 disk 3, o:1, dev:sdb1 disk 4, o:1, dev:sdf1 md_d0: bitmap initialized from disk: read 1/1 pages, set 0 bits, status: 0 created bitmap (10 pages) for device md_d0 teak:~# mount /media mount: special device /dev/md_d0p1 does not exist no dmesg teak:~# blockdev --rereadpt /dev/md_d0 dmesg: md_d0: p1 p2 md_d0: p1 p2 did I mention 2.6.20.7 and mdadm v2.5.6 and udev I'd be happy if I've done something wrong... anyway, more config data... teak:~# mdadm --detail /dev/md_d0 /dev/md_d0: Version : 01.02.03 Creation Time : Mon Apr 23 15:13:35 2007 Raid Level : raid5 Array Size : 1250241792 (1192.32 GiB 1280.25 GB) Device Size : 625120896 (298.08 GiB 320.06 GB) Raid Devices : 5 Total Devices : 5 Preferred Minor : 0 Persistence : Superblock is persistent Intent Bitmap : Internal Update Time : Tue Apr 24 12:49:26 2007 State : active Active Devices : 5 Working Devices : 5 Failed Devices : 0 Spare Devices : 0 Layout : left-symmetric Chunk Size : 64K Name : media UUID : f7835ba6:e38b6feb:c0cd2e2d:3079db59 Events : 25292 Number Major Minor RaidDevice State 0 8 650 active sync /dev/sde1 1 8 331 active sync /dev/sdc1 2 8 492 active sync /dev/sdd1 5 8 173 active sync /dev/sdb1 4 8 81 4 active sync /dev/sdf1 teak:~# cat /etc/mdadm/mdadm.conf DEVICE partitions ARRAY /dev/md_d0 auto=part level=raid5 num-devices=5 UUID=f7835ba6:e38b6feb:c0cd2e2d:3079db59 MAILADDR [EMAIL PROTECTED] David - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Partitioned arrays initially missing from /proc/partitions
Neil Brown wrote: > On Tuesday April 24, [EMAIL PROTECTED] wrote: >> Neil, isn't it easy to just do this after an assemble? > > Yes, but it should not be needed, and I'd like to understand why it > is. > One of the last things do_md_run does is >mddev->changed = 1; > > When you next open /dev/md_d0, md_open is called which calls > check_disk_change(). > This will call into md_fops->md_media_changed which will return the > value of mddev->changed, which will be '1'. > So check_disk_change will then call md_fops->revalidate_disk which > will set mddev->changed to 0, and will then set bd_invalidated to 1 > (as bd_disk->minors > 1 (being 64)). > > md_open will then return into do_open (in fs/block_dev.c) and because > bd_invalidated is true, it will call rescan_partitions and the > partitions will appear. > > Hmmm... there is room for a race there. If some other process opens > /dev/md_d0 before mdadm gets to close it, it will call > rescan_partitions before first calling bd_set_size to update the size > of the bdev. So when we try to read the partition table, it will > appear to be reading past the EOF, and will not actually read > anything.. > > I guess udev must be opening the block device at exactly the wrong > time. > > I can simulate this by holding /dev/md_d0 open while assembling the > array. If I do that, the partitions don't get created. > Yuck. > > Maybe I could call bd_set_size in md_open before calling > check_disk_change.. > > Yep, this patch seems to fix it. Could you confirm? almost... teak:~# mdadm --assemble /dev/md_d0 --auto=parts /dev/sd[bcdef]1 mdadm: /dev/md_d0 has been started with 5 drives. teak:~# mount /media teak:~# umount /media teak:~# mdadm --stop /dev/md_d0 mdadm: stopped /dev/md_d0 teak:~# mdadm --assemble /dev/md_d0 --auto=parts /dev/sd[bcdef]1 mdadm: /dev/md_d0 has been started with 5 drives. teak:~# mount /media mount: No such file or directory teak:~# mount /media teak:~# (second mount succeeds second time around) md: md_d0 stopped. md: bind md: bind md: bind md: bind md: bind raid5: device sde1 operational as raid disk 0 raid5: device sdf1 operational as raid disk 4 raid5: device sdb1 operational as raid disk 3 raid5: device sdd1 operational as raid disk 2 raid5: device sdc1 operational as raid disk 1 raid5: allocated 5236kB for md_d0 raid5: raid level 5 set md_d0 active with 5 out of 5 devices, algorithm 2 RAID5 conf printout: --- rd:5 wd:5 disk 0, o:1, dev:sde1 disk 1, o:1, dev:sdc1 disk 2, o:1, dev:sdd1 disk 3, o:1, dev:sdb1 disk 4, o:1, dev:sdf1 md_d0: bitmap initialized from disk: read 1/1 pages, set 0 bits, status: 0 created bitmap (10 pages) for device md_d0 md_d0: p1 p2 Filesystem "md_d0p1": Disabling barriers, not supported with external log device XFS mounting filesystem md_d0p1 Ending clean XFS mount for filesystem: md_d0p1 md: md_d0 stopped. md: unbind md: export_rdev(sde1) md: unbind md: export_rdev(sdf1) md: unbind md: export_rdev(sdb1) md: unbind md: export_rdev(sdd1) md: unbind md: export_rdev(sdc1) md: md_d0 stopped. md: bind md: bind md: bind md: bind md: bind raid5: device sde1 operational as raid disk 0 raid5: device sdf1 operational as raid disk 4 raid5: device sdb1 operational as raid disk 3 raid5: device sdd1 operational as raid disk 2 raid5: device sdc1 operational as raid disk 1 raid5: allocated 5236kB for md_d0 raid5: raid level 5 set md_d0 active with 5 out of 5 devices, algorithm 2 RAID5 conf printout: --- rd:5 wd:5 disk 0, o:1, dev:sde1 disk 1, o:1, dev:sdc1 disk 2, o:1, dev:sdd1 disk 3, o:1, dev:sdb1 disk 4, o:1, dev:sdf1 md_d0: bitmap initialized from disk: read 1/1 pages, set 0 bits, status: 0 created bitmap (10 pages) for device md_d0 md_d0: p1 p2 XFS: Invalid device [/dev/md_d0p2], error=-2 Filesystem "md_d0p1": Disabling barriers, not supported with external log device XFS mounting filesystem md_d0p1 Ending clean XFS mount for filesystem: md_d0p1 - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Multiple disk failure, but slot numbers are corrupt and preventing assembly.
Leon Woestenberg wrote: > David, > > thanks for all the advice so far. No problem :) > In first instance we were searching for ways to tell mdadm what we > know about the array (through mdadm.conf) but from all advice we got > we have to take the 'usual' non-syncing-recreate approach. > > We will try to make disk clones first. Will dd suffice or do I need > something more fancy that maybe copes with source drive read errors in > a better fashion? ddrescue and dd_rescue are *much* better. I favour the gnu ddrescue - it's much easier. But sometimes, on some kernels with some hardware I've had kernel locks that dd_rescue (eventually, after many minutes) times out from. The RIP iso is a good place to start. http://www.tux.org/pub/people/kent-robotti/looplinux/rip/ David - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Multiple disk failure, but slot numbers are corrupt and preventing assembly.
Bill Davidsen wrote: > Leon Woestenberg wrote: >> We will try to make disk clones first. Will dd suffice or do I need >> something more fancy that maybe copes with source drive read errors in >> a better fashion? > > Yes to both. dd will be fine in most cases, and I suggest using noerror > to continue after errors, and oflag=direct just for performance. You > could use ddrescue, it supposedly copes better with errors, although I > don't know details. > Hi Bill IIRC dd will continue to operate on error but just retries a set number of times and continues to the next sector. ddrescue does clever things like jumping a few sectors when it hits an error, then, after it has retrieved as much data off the disk as possible it starts to bisect the error area. This type of algorithm is good because disks often die after a few minutes of being powered and deteriorate rapidly as data is recovered - hammering them on retries on an error on sector 312, 313, 314 when you have millions of (currently) readable sectors elsewhere is a bad idea because it can minimise the time you have left to read your data. It's also very fast, provides continuous updates to the screen as to where it is and the io rates it is currently seeing and those it's averaging, it also writes a log of what it's done to local file to allow restarts. etc etc etcc.. try it - IMHO it's the right tool for the (data-recovery) job :) David - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RAID rebuild on Create
Jan Engelhardt wrote: > Hi list, > > > when a user does `mdadm -C /dev/md0 -l -n > `, the array gets rebuilt for at least RAID1 and RAID5, even if > the disk contents are most likely not of importance (otherwise we would > not be creating a raid array right now). Could not this needless resync > be skipped - what do you think? > > > Jan This is an FAQ - and I'll put it there RSN :) Here's one answer from Neil from the archives (google "avoiding the initial resync on --create"): Otherwise I agree. There is no real need to perform the sync of a raid1 at creation. However it seems to be a good idea to regularly 'check' an array to make sure that all blocks on all disks get read to find sleeping bad blocks early. If you didn't sync first, then every check will find lots of errors. Ofcourse you could 'repair' instead of 'check'. Or do that once. Or something. For raid6 it is also safe to not sync first, though with the same caveat as raid1. Raid6 always updates parity by reading all blocks in the stripe that aren't known and calculating P and Q. So the first write to a stripe will make P and Q correct for that stripe. This is current behaviour. I don't think I can guarantee it will never changed. For raid5 it is NOT safe to skip the initial sync. It is possible for all updates to be "read-modify-write" updates which assume the parity is correct. If it is wrong, it stays wrong. Then when you lose a drive, the parity blocks are wrong so the data you recover using them is wrong. In summary, it is safe to use --assume-clean on a raid1 or raid1o, though I would recommend a "repair" before too long. For other raid levels it is best avoided. - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: raid10 on centos 5
Ruslan Sivak wrote: > So a custom kernel is needed? Is there a way to do a kickstart install > with the new kernel? Or better yet, put it on the install cd? have you tried: modprobe raid10 ? David - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Partitioned arrays initially missing from /proc/partitions
Hi Neil Just wondering what the status is here - do you need any more from me or is it on your stack? The patch helped but didn't cure. After a clean boot it mounted correctly first try. Then I unmounted, stopped and re-assembled the array. The next mount failed. The subsequent mount succeeded. How do other block devices initialise their partitions on 'discovery'? David David Greaves wrote: > Neil Brown wrote: >> On Tuesday April 24, [EMAIL PROTECTED] wrote: >>> Neil, isn't it easy to just do this after an assemble? >> Yes, but it should not be needed, and I'd like to understand why it >> is. >> One of the last things do_md_run does is >>mddev->changed = 1; >> >> When you next open /dev/md_d0, md_open is called which calls >> check_disk_change(). >> This will call into md_fops->md_media_changed which will return the >> value of mddev->changed, which will be '1'. >> So check_disk_change will then call md_fops->revalidate_disk which >> will set mddev->changed to 0, and will then set bd_invalidated to 1 >> (as bd_disk->minors > 1 (being 64)). >> >> md_open will then return into do_open (in fs/block_dev.c) and because >> bd_invalidated is true, it will call rescan_partitions and the >> partitions will appear. >> >> Hmmm... there is room for a race there. If some other process opens >> /dev/md_d0 before mdadm gets to close it, it will call >> rescan_partitions before first calling bd_set_size to update the size >> of the bdev. So when we try to read the partition table, it will >> appear to be reading past the EOF, and will not actually read >> anything.. >> >> I guess udev must be opening the block device at exactly the wrong >> time. >> >> I can simulate this by holding /dev/md_d0 open while assembling the >> array. If I do that, the partitions don't get created. >> Yuck. >> >> Maybe I could call bd_set_size in md_open before calling >> check_disk_change.. >> >> Yep, this patch seems to fix it. Could you confirm? > almost... > > teak:~# mdadm --assemble /dev/md_d0 --auto=parts /dev/sd[bcdef]1 > mdadm: /dev/md_d0 has been started with 5 drives. > teak:~# mount /media > teak:~# umount /media > teak:~# mdadm --stop /dev/md_d0 > mdadm: stopped /dev/md_d0 > teak:~# mdadm --assemble /dev/md_d0 --auto=parts /dev/sd[bcdef]1 > mdadm: /dev/md_d0 has been started with 5 drives. > teak:~# mount /media > mount: No such file or directory > teak:~# mount /media > teak:~# > (second mount succeeds second time around) > > > > md: md_d0 stopped. > md: bind > md: bind > md: bind > md: bind > md: bind > raid5: device sde1 operational as raid disk 0 > raid5: device sdf1 operational as raid disk 4 > raid5: device sdb1 operational as raid disk 3 > raid5: device sdd1 operational as raid disk 2 > raid5: device sdc1 operational as raid disk 1 > raid5: allocated 5236kB for md_d0 > raid5: raid level 5 set md_d0 active with 5 out of 5 devices, algorithm 2 > RAID5 conf printout: > --- rd:5 wd:5 > disk 0, o:1, dev:sde1 > disk 1, o:1, dev:sdc1 > disk 2, o:1, dev:sdd1 > disk 3, o:1, dev:sdb1 > disk 4, o:1, dev:sdf1 > md_d0: bitmap initialized from disk: read 1/1 pages, set 0 bits, status: 0 > created bitmap (10 pages) for device md_d0 > md_d0: p1 p2 > Filesystem "md_d0p1": Disabling barriers, not supported with external log > device > XFS mounting filesystem md_d0p1 > Ending clean XFS mount for filesystem: md_d0p1 > md: md_d0 stopped. > md: unbind > md: export_rdev(sde1) > md: unbind > md: export_rdev(sdf1) > md: unbind > md: export_rdev(sdb1) > md: unbind > md: export_rdev(sdd1) > md: unbind > md: export_rdev(sdc1) > md: md_d0 stopped. > md: bind > md: bind > md: bind > md: bind > md: bind > raid5: device sde1 operational as raid disk 0 > raid5: device sdf1 operational as raid disk 4 > raid5: device sdb1 operational as raid disk 3 > raid5: device sdd1 operational as raid disk 2 > raid5: device sdc1 operational as raid disk 1 > raid5: allocated 5236kB for md_d0 > raid5: raid level 5 set md_d0 active with 5 out of 5 devices, algorithm 2 > RAID5 conf printout: > --- rd:5 wd:5 > disk 0, o:1, dev:sde1 > disk 1, o:1, dev:sdc1 > disk 2, o:1, dev:sdd1 > disk 3, o:1, dev:sdb1 > disk 4, o:1, dev:sdf1 > md_d0: bitmap initialized from disk: read 1/1 pages, set 0 bits, status: 0 > created bitmap (10 pages) for device md_d0 > md_d0: p1 p2 > XFS: Invalid device [/dev/md_d0p2], error=-2 > Filesystem "md_d0p1": Disabling barriers, not supported with external log > device > XFS mounting filesystem md_d0p1 > Ending clean XFS mount for filesystem: md_d0p1 > > > - > To unsubscribe from this list: send the line "unsubscribe linux-raid" in > the body of a message to [EMAIL PROTECTED] > More majordomo info at http://vger.kernel.org/majordomo-info.html - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Swapping out for larger disks
Brad Campbell wrote: > G'day all, > > I've got 3 arrays here. A 3 drive raid-5, a 10 drive raid-5 and a 15 > drive raid-6. They are all currently 250GB SATA drives. > > I'm contemplating an upgrade to 500GB drives on one or more of the > arrays and wondering the best way to do the physical swap. > > The slow and steady way would be to degrade the array, remove a disk, > add the new disk, lather, rinse, repeat. After which I could use mdadm > --grow. There is the concern of a degraded array here though (and one of > the reasons I'm looking to swap is some of the disks have about 30,000 > hours on the clock and are growing the odd defect). Assuming hotswap and for maximum uptime/minimal exposure to risk... a while back there was a discussion of a fiddly way that involved adding a disk, making a mirror, removing the old disk, breaking the mirror. ( See archive for details) > > I was more wondering about the feasibility of using dd to copy the drive > contents to the larger drives (then I could do 5 at a time) and working > it from there. Err, if you can dd the drives, why can't you create a new array and use xfsdump or equivalent? Is downtime due to copying that bad? David - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: removed disk && md-device
Neil Brown wrote: > On Wednesday May 9, [EMAIL PROTECTED] wrote: >> Neil Brown <[EMAIL PROTECTED]> [2007.04.02.0953 +0200]: >>> Hmmm... this is somewhat awkward. You could argue that udev should be >>> taught to remove the device from the array before removing the device >> >from /dev. But I'm not convinced that you always want to 'fail' the >>> device. It is possible in this case that the array is quiescent and >>> you might like to shut it down without registering a device failure... >> Hmm, the the kernel advised hotplug to remove the device from /dev, but you >> don't want to remove it from md? Do you have an example for that case? > > Until there is known to be an inconsistency among the devices in an > array, you don't want to record that there is. > > Suppose I have two USB drives with a mounted but quiescent filesystem > on a raid1 across them. > I pull them both out, one after the other, to take them to my friends > place. > > I plug them both in and find that the array is degraded, because as > soon as I unplugged on, the other was told that it was now the only > one. And, in truth, so it was. Who updated the event count though? > Not good. Best to wait for an IO request that actually returns an > errors. Ah, now would that be a good time to update the event count? Maybe you should allow drives to be removed even if they aren't faulty or spare? A write to a removed device would mark it faulty in the other devices without waiting for a timeout. But joggling a usb stick (similar to your use case) would probably be OK since it would be hot-removed and then hot-added. David - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: removed disk && md-device
x27;m not completely familiar > with how USB storage works. Yes, so, assuming my proposal, in the case where you hot remove sdb (not fail) then hot add sdp (same drive slot different drive identifier, maybe different usb controller) the on-disk superblock can reliably ensure that the array just continues (also assuming quiescence)? > In any case, it should really be a user-space decision what happens > then. A hot re-add may well be appropriate, but I wouldn't want to > have the kernel make that decision. udev is userspace though - you could have a conservative no-add policy ruleset. My proposal is simply to allow a hot-remove of a drive without marking it faulty. This remove event would not update the event counts in other drives. This allows transient (stupid human in the OP report) drive removal to be properly communicated via udev to md. You don't end up in the situation of "the drive formerly known as..." Just out of interest. Currently, if I unplug /dev/sdp (which is md0 slot3), wait, plug in a random non-md usb drive which appears as /dev/sdp, what does md do? Just write to the new /dev/sdp assuming it's the old one? David - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: removed disk && md-device
assuming my proposal, in the case where you hot remove sdb (not fail) then hot add sdp (same drive slot different drive identifier, maybe different usb controller) the on-disk superblock can reliably ensure that the array just continues (also assuming quiescence)? > In any case, it should really be a user-space decision what happens > then. A hot re-add may well be appropriate, but I wouldn't want to > have the kernel make that decision. udev is userspace though - you could have a conservative no-add policy ruleset. My proposal is simply to allow a hot-remove of a drive without marking it faulty. This remove event would not update the event counts in other drives. This allows transient (stupid human in the OP report) drive removal to be properly communicated via udev to md. You don't end up in the situation of "the drive formerly known as..." Just out of interest. Currently, if I unplug /dev/sdp (which is md0 slot3), wait, plug in a random non-md usb drive which appears as /dev/sdp, what does md do? Just write to the new /dev/sdp assuming it's the old one? David - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: removed disk && md-device
gt; back a something different (sdp?) - though I'm not completely familiar > with how USB storage works. Yes, so, assuming my proposal, in the case where you hot remove sdb (not fail) then hot add sdp (same drive slot different drive identifier, maybe different usb controller) the on-disk superblock can reliably ensure that the array just continues (also assuming quiescence)? > In any case, it should really be a user-space decision what happens > then. A hot re-add may well be appropriate, but I wouldn't want to > have the kernel make that decision. udev is userspace though - you could have a conservative no-add policy ruleset. My proposal is simply to allow a hot-remove of a drive without marking it faulty. This remove event would not update the event counts in other drives. This allows transient (stupid human in the OP report) drive removal to be properly communicated via udev to md. You don't end up in the situation of "the drive formerly known as..." Just out of interest. Currently, if I unplug /dev/sdp (which is md0 slot3), wait, plug in a random non-md usb drive which appears as /dev/sdp, what does md do? Just write to the new /dev/sdp assuming it's the old one? David - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: how to synchronize two devices (RAID-1, but not really?)
Tomasz Chmielewski wrote: > Peter Rabbitson schrieb: >> Tomasz Chmielewski wrote: >>> I have a RAID-10 setup of four 400 GB HDDs. As the data grows by several >>> GBs a day, I want to migrate it somehow to RAID-5 on separate disks in a >>> separate machine. >>> >>> Which would be easy, if I didn't have to do it online, without stopping >>> any services. >>> >>> >> >> Your /dev/md10 - what is directly on top of it? LVM? XFS? EXT3? > > Good point. I don't want to copy the whole RAID-10. > I want to copy only one LVM-2 volume (which is like 90% of that RAID-10, > anyway). > > > So I want to synchronize /dev/LVM2/my-volume (ext3) with /dev/sdr (now > empty; bigger than /dev/LVM2/my-volume). > > > (sda2, sdb2, sdc2, sdd2) -> RAID-10 -> LVM-2 -> my volume -> ext3 > > I've not used iSCSI but I wonder about using nbd : network block device Use nbd to export /dev/md5 from machine 2. Import /dev/nbd0 on machine 1. Add nbd0 to the VG on machine 1 pvmove the data from /dev/md10 to /dev/nbd0 (ie the md5 on machine2 via nbd) remove /dev/md10 from the VG. The VG should now exist only on /dev/nbd0 on machine 2 stop the services and lvm on machine 1 start the lvm and services on machine 2. I'd suggest testing this first . David - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: raid5: I lost a XFS file system due to a minor IDE cable problem
On Thu, May 24, 2007 at 07:20:35AM -0400, Justin Piszcz wrote: > Including XFS mailing list on this one. Thanks Justin. > On Thu, 24 May 2007, Pallai Roland wrote: > > > > >Hi, > > > >I wondering why the md raid5 does accept writes after 2 disks failed. I've > >an > >array built from 7 drives, filesystem is XFS. Yesterday, an IDE cable > >failed > >(my friend kicked it off from the box on the floor:) and 2 disks have been > >kicked but my download (yafc) not stopped, it tried and could write the > >file > >system for whole night! > >Now I changed the cable, tried to reassembly the array (mdadm -f --run), > >event counter increased from 4908158 up to 4929612 on the failed disks, > >but I > >cannot mount the file system and the 'xfs_repair -n' shows lot of errors > >there. This is expainable by the partially successed writes. Ext3 and JFS > >has "error=" mount option to switch filesystem read-only on any error, but > >XFS hasn't: why? "-o ro,norecovery" will allow you to mount the filesystem and get any uncorrupted data off it. You still may get shutdowns if you trip across corrupted metadata in the filesystem, though. > >It's a good question too, but I think the md layer could > >save dumb filesystems like XFS if denies writes after 2 disks are failed, > >and > >I cannot see a good reason why it's not behave this way. How is *any* filesystem supposed to know that the underlying block device has gone bad if it is not returning errors? I did mention this exact scenario in the filesystems workshop back in february - we'd *really* like to know if a RAID block device has gone into degraded mode (i.e. lost a disk) so we can throttle new writes until the rebuil dhas been completed. Stopping writes completely on a fatal error (like 2 lost disks in RAID5, and 3 lost disks in RAID6) would also be possible if only we could get the information out of the block layer. > >Do you have better idea how can I avoid such filesystem corruptions in the > >future? No, I don't want to use ext3 on this box. :) Well, the problem is a bug in MD - it should have detected drives going away and stopped access to the device until it was repaired. You would have had the same problem with ext3, or JFS, or reiser or any other filesystem, too. > >my mount error: > >XFS: Log inconsistent (didn't find previous header) > >XFS: failed to find log head > >XFS: log mount/recovery failed: error 5 > >XFS: log mount failed You MD device is still hosed - error 5 = EIO; the md device is reporting errors back the filesystem now. You need to fix that before trying to recover any data... Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: raid5: I lost a XFS file system due to a minor IDE cable problem
On Fri, May 25, 2007 at 03:35:48AM +0200, Pallai Roland wrote: > On Fri, 2007-05-25 at 10:05 +1000, David Chinner wrote: > > > >It's a good question too, but I think the md layer could > > > >save dumb filesystems like XFS if denies writes after 2 disks are > > > >failed, > > > >and > > > >I cannot see a good reason why it's not behave this way. > > > > How is *any* filesystem supposed to know that the underlying block > > device has gone bad if it is not returning errors? > It is returning errors, I think so. If I try to write raid5 with 2 > failed disks with dd, I've got errors on the missing chunks. Oh, did you look at your logs and find that XFS had spammed them about writes that were failing? > The difference between ext3 and XFS is that ext3 will remount to > read-only on the first write error but the XFS won't, XFS only fails > only the current operation, IMHO. The method of ext3 isn't perfect, but > in practice, it's working well. XFS will shutdown the filesystem if metadata corruption will occur due to a failed write. We don't immediately fail the filesystem on data write errors because on large systems you can get *transient* I/O errors (e.g. FC path failover) and so retrying failed data writes is useful for preventing unnecessary shutdowns of the filesystem. Different design criteria, different solutions... Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: raid5: I lost a XFS file system due to a minor IDE cable problem
On Fri, May 25, 2007 at 12:43:51AM -0500, Alberto Alonso wrote: > > > The difference between ext3 and XFS is that ext3 will remount to > > > read-only on the first write error but the XFS won't, XFS only fails > > > only the current operation, IMHO. The method of ext3 isn't perfect, but > > > in practice, it's working well. > > > > XFS will shutdown the filesystem if metadata corruption will occur > > due to a failed write. We don't immediately fail the filesystem on > > data write errors because on large systems you can get *transient* > > I/O errors (e.g. FC path failover) and so retrying failed data > > writes is useful for preventing unnecessary shutdowns of the > > filesystem. > > > > Different design criteria, different solutions... > > I think his point was that going into a read only mode causes a > less catastrophic situation (ie. a web server can still serve > pages). Sure - but once you've detected one corruption or had metadata I/O errors, can you trust the rest of the filesystem? > I think that is a valid point, rather than shutting down > the file system completely, an automatic switch to where the least > disruption of service can occur is always desired. I consider the possibility of serving out bad data (i.e after a remount to readonly) to be the worst possible disruption of service that can happen ;) > Maybe the automatic failure mode could be something that is > configurable via the mount options. If only it were that simple. Have you looked to see how many hooks there are in XFS to shutdown without causing further damage? % grep FORCED_SHUTDOWN fs/xfs/*.[ch] fs/xfs/*/*.[ch] | wc -l 116 Changing the way we handle shutdowns would take a lot of time, effort and testing. When can I expect a patch? ;) > I personally have found the XFS file system to be great for > my needs (except issues with NFS interaction, where the bug report > never got answered), but that doesn't mean it can not be improved. Got a pointer? Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.
On Fri, May 25, 2007 at 05:58:25PM +1000, Neil Brown wrote: > We can think of there being three types of devices: > > 1/ SAFE. With a SAFE device, there is no write-behind cache, or if > there is it is non-volatile. Once a write completes it is > completely safe. Such a device does not require barriers > or ->issue_flush_fn, and can respond to them either by a > no-op or with -EOPNOTSUPP (the former is preferred). > > 2/ FLUSHABLE. > A FLUSHABLE device may have a volatile write-behind cache. > This cache can be flushed with a call to blkdev_issue_flush. > It may not support barrier requests. So returns -EOPNOTSUPP to any barrier request? > 3/ BARRIER. > A BARRIER device supports both blkdev_issue_flush and > BIO_RW_BARRIER. Either may be used to synchronise any > write-behind cache to non-volatile storage (media). > > Handling of SAFE and FLUSHABLE devices is essentially the same and can > work on a BARRIER device. The BARRIER device has the option of more > efficient handling. > > How does a filesystem use this? > === > > The filesystem will want to ensure that all preceding writes are safe > before writing the barrier block. There are two ways to achieve this. Three, actually. > 1/ Issue all 'preceding writes', wait for them to complete (bi_endio >called), call blkdev_issue_flush, issue the commit write, wait >for it to complete, call blkdev_issue_flush a second time. >(This is needed for FLUSHABLE) *nod* > 2/ Set the BIO_RW_BARRIER bit in the write request for the commit > block. >(This is more efficient on BARRIER). *nod* 3/ Use a SAFE device. > The second, while much easier, can fail. So we do a test I/O to see if the device supports them before enabling that mode. But, as we've recently discovered, this is not sufficient to detect *correctly functioning* barrier support. > So a filesystem should be > prepared to deal with that failure by falling back to the first > option. I don't buy that argument. > Thus the general sequence might be: > > a/ issue all "preceding writes". > b/ issue the commit write with BIO_RW_BARRIER At this point, the filesystem has done everything it needs to ensure that the block layer has been informed of the I/O ordering requirements. Why should the filesystem now have to detect block layer breakage, and then use a different block layer API to issue the same I/O under the same constraints? > c/ wait for the commit to complete. > If it was successful - done. > If it failed other than with EOPNOTSUPP, abort > else continue > d/ wait for all 'preceding writes' to complete > e/ call blkdev_issue_flush > f/ issue commit write without BIO_RW_BARRIER > g/ wait for commit write to complete >if it failed, abort > h/ call blkdev_issue _flush? > DONE > > steps b and c can be left out if it is known that the device does not > support barriers. The only way to discover this to try and see if it > fails. That's a very linear, single-threaded way of looking at it... ;) > I don't think any filesystem follows all these steps. > > ext3 has the right structure, but it doesn't include steps e and h. > reiserfs is similar. It does have a call to blkdev_issue_flush, but > that is only on the fsync path, so it isn't really protecting > general journal commits. > XFS - I'm less sure. I think it does 'a' then 'd', then 'b' or 'f' >depending on a whether it thinks the device handles barriers, >and finally 'g'. That's right, except for the "g" (or "c") bit - commit writes are async and nothing waits for them - the io completion wakes anything waiting on it's completion (yes, all XFS barrier I/Os are issued async which is why having to handle an -EOPNOTSUPP error is a real pain. The fix I currently have is to reissue the I/O from the completion handler with is ugly, ugly, ugly.) > So for devices that support BIO_RW_BARRIER, and for devices that don't > need any flush, they work OK, but for device that need flushing, but > don't support BIO_RW_BARRIER, none of them work. This should be easy > to fix. Right - XFS as it stands was designed to work on SAFE devices, and we've modified it to work on BARRIER devices. We don't support FLUSHABLE devices at all. But if the filesystem supports BARRIER devices, I don't see any reason why a filesystem needs to be modified to support FLUSHABLE devices - the key point being that by the time the filesystem has issued the "commit write" it has already waited for all it's dependent I/O, and so all the block device needs to do is issue flushes either side of the commit write > HOW DO MD or DM USE THIS > > > 1/ striping devices. > This includes md/raid0 md/linear dm-linear dm-stripe and probably > others. > >These devices can easily supp
Re: raid10 kernel panic on sparc64
From: Jan Engelhardt <[EMAIL PROTECTED]> Date: Sat, 26 May 2007 17:10:30 +0200 (MEST) > > On Apr 12 2007 14:26, David Miller wrote: > > > >> Kernel is kernel-smp-2.6.16-1.2128sp4.sparc64.rpm from Aurora Corona. > >> Perhaps it helps, otherwise hold your breath until I reproduce it. > > > >Jan, if you can reproduce this with the current 2.6.20 vanilla > >kernel I'd be very interested in a full trace so that I can > >try to fix this. > > > >With the combination of an old kernel and only part of the > >crash trace, there isn't much I can do with this report. > > Does not seem to happen under 2.6.21-1.3149.al3.2smp anymore. Thanks for following up on this Jan. I'd personally really appreciate reports against upstream instead of dist kernels in the future, and I'm sure the linux-raid maintainers feel similarly :-) - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html