Re: BUGS: internal bitmap during array create
On Wednesday October 18, [EMAIL PROTECTED] wrote: > > > I've provided the requested info, attached as two files (typescript > output): Thanks for persisting with this. There is one bug in mdadm that is causing all of these problems. It only affect the 'offset' layout with raid10. The fix is http://neil.brown.name/git?p=mdadm;a=commitdiff;h=702b557b1c9 and is included below. You might like to grab the latest source from git://neil.brown.name/mdadm and compile that, or just apply the patch. Thanks again, NeilBrown - Fix bugs related to raid10 and the new offset layout. Need to mask of bits above the bottom 16 when calculating number of copies. ### Diffstat output ./ChangeLog |1 + ./Create.c |2 +- ./util.c|2 +- 3 files changed, 3 insertions(+), 2 deletions(-) diff .prev/ChangeLog ./ChangeLog --- .prev/ChangeLog 2006-10-19 16:38:07.0 +1000 +++ ./ChangeLog 2006-10-19 16:38:24.0 +1000 @@ -13,6 +13,7 @@ Changes Prior to this release initramfs, but device doesn't yet exist in /dev. - When --assemble --scan is run, if all arrays that could be found have already been started, don't report an error. +- Fix a couple of bugs related to raid10 and the new 'offset' layout. Changes Prior to 2.5.4 release - When creating devices in /dev/md/ create matching symlinks diff .prev/Create.c ./Create.c --- .prev/Create.c 2006-10-19 16:38:07.0 +1000 +++ ./Create.c 2006-10-19 16:38:24.0 +1000 @@ -363,7 +363,7 @@ int Create(struct supertype *st, char *m * which is array.size * raid_disks / ncopies; * .. but convert to sectors. */ - int ncopies = (layout>>8) * (layout & 255); + int ncopies = ((layout>>8) & 255) * (layout & 255); bitmapsize = (unsigned long long)size * raiddisks / ncopies * 2; /* printf("bms=%llu as=%d rd=%d nc=%d\n", bitmapsize, size, raiddisks, ncopies);*/ } else diff .prev/util.c ./util.c --- .prev/util.c2006-10-19 16:38:07.0 +1000 +++ ./util.c2006-10-19 16:38:24.0 +1000 @@ -179,7 +179,7 @@ int enough(int level, int raid_disks, in /* This is the tricky one - we need to check * which actual disks are present. */ - copies = (layout&255)* (layout>>8); + copies = (layout&255)* ((layout>>8) & 255); first=0; do { /* there must be one of the 'copies' form 'first' */ - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Propose of enhancement of raid1 driver
On Tuesday October 17, [EMAIL PROTECTED] wrote: > I would like to propose an enhancement of raid 1 driver in linux kernel. > The enhancement would be speedup of data reading on mirrored partitions. > The idea is easy. > If we have mirrored partition over 2 disks, and these disk are in sync, there > is > possibility of simultaneous reading of the data from both disks on the same > way > as in raid 0. So it would be chunk1 read from master, chunk2 read from slave > at > the same time. > As result it would give significant speedup of read operation (comparable with > speed of raid 0 disks). This is not as easy as it sounds. Skipping over blocks within a track is no faster than reading blocks in the track, so you would need to make sure that your chunk size is larger than one track - probably it would need to be several tracks. Raid1 already does some read-balancing, though it is possible (even likely) that it doesn't balance very effectively. Working out how best to do the balancing in general in a non-trivial task, but would be worth spending time on. The raid10 module in linux supports a layout described as 'far=2'. In this layout, with two drives, the first half of the drives is used for a raid0, and the second half is used for a mirrored raid0 with the data on the other disk. In this layout reads should certainly go at raid0 speeds, though there is cost in the speed of writes. Maybe you would like to experiment. Write a program that reads from two drives in parallel, reading all the 'odd' chunks from one drive and the 'even' chunks from the other, and find out how fast it is. Maybe you could get it to try lots of different chunk sizes and see which is the fastest. That might be quite helpful in understanding how to get read-balancing working well. NeilBrown - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: new features time-line
On Tuesday October 17, [EMAIL PROTECTED] wrote: > We talked about RAID5E a while ago, is there any thought that this would > actually happen, or is it one of the "would be nice" features? With > larger drives I suspect the number of drives in arrays is going down, > and anything which offers performance benefits for smaller arrays would > be useful. So ... RAID5E is RAID5 using (N-1)/N of each drive (or close to that) and not having a hot spare. On a drive failure, the data is restriped across N-1 drives so that it becomes plain RAID5. This means that instead of having an idle spare, you have spare space at the end of each drive. To implement this you would need kernel code to restripe and array to reduce the number of devices (currently we only increase the number of devices). Probably not too hard - just needs code and motivation. Don't know if/when it will happen, but it probably will especially if someone tries writing some code (hint hint to any potential developers out there...) NeilBrown - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] md: Fix bug where new drives added to an md array sometimes don't sync properly.
FYI, I'm testing 2.6.18.1 and noticed this mis-numbering of RAID10 members issue is still extant. Even with this fix applied to raid10.c, I am still seeing repeatable issues with devices assuming a "Number" greater than that which they had when removed from a running array. Issue 1) I'm seeing inconsistencies in the way a drive is marked (and its behaviour) during rebuild after it is removed and added. In this instance, the re-added drive is picked up and marked as "spare rebuilding". Rebuild Status : 20% complete Name : 0 UUID : ab764369:7cf80f2b:cf61b6df:0b13cd3a Events : 1 Number Major Minor RaidDevice State 0 25300 active sync /dev/dm-0 1 25311 active sync /dev/dm-1 2 253 102 active sync /dev/dm-10 3 253 113 active sync /dev/dm-11 4 253 124 active sync /dev/dm-12 5 253 135 active sync /dev/dm-13 6 25326 active sync /dev/dm-2 7 25337 active sync /dev/dm-3 8 25348 active sync /dev/dm-4 9 25359 active sync /dev/dm-5 10 2536 10 active sync /dev/dm-6 11 2537 11 active sync /dev/dm-7 12 2538 12 active sync /dev/dm-8 13 2539 13 active sync /dev/dm-9 [EMAIL PROTECTED] ~]# cat /proc/mdstat Personalities : [raid10] md0 : active raid10 dm-9[13] dm-8[12] dm-7[11] dm-6[10] dm-5[9] dm-4[8] dm-3[7] dm-2[6] dm-13[5] dm-12[4] dm-11[3] dm-10[2] dm-1[1] dm-0[0] 1003620352 blocks super 1.2 512K chunks 2 offset-copies [14/14] [UU] [>] resync = 21.7% (218664064/1003620352) finish=114.1min speed=114596K/sec However, on the same configuration, it occasionally is pulled right back with a state of "active sync", without indication that it dirty: Issue 2) When a device is removed and subsequently added again (after setting failed and removing from the array), it SHOULD be set back to the "Number" it originally had in the array correct? In the cases when the drive is NOT automatically marked as "active sync" and all members show up fine, it is picked up as a spare and rebuild is started, during which time it is marked down "_" in the /proc/mdstat date, and "spare rebuilding" in mdadm -D output: When device "Number" 10 // STATE WHEN CLEAN: UUID : 6ccd7974:1b23f5b2:047d1560:b5922692 Number Major Minor RaidDevice State 0 25300 active sync /dev/dm-0 1 25311 active sync /dev/dm-1 2 253 102 active sync /dev/dm-10 3 253 113 active sync /dev/dm-11 4 253 124 active sync /dev/dm-12 5 253 135 active sync /dev/dm-13 6 25326 active sync /dev/dm-2 7 25337 active sync /dev/dm-3 8 25348 active sync /dev/dm-4 9 25359 active sync /dev/dm-5 10 2536 10 active sync /dev/dm-6 11 2537 11 active sync /dev/dm-7 12 2538 12 active sync /dev/dm-8 13 2539 13 active sync /dev/dm-9 // STATE AFTER FAILURE: Number Major Minor RaidDevice State 0 25300 active sync /dev/dm-0 1 25311 active sync /dev/dm-1 2 002 removed 3 253 113 active sync /dev/dm-11 4 253 124 active sync /dev/dm-12 5 253 135 active sync /dev/dm-13 6 25326 active sync /dev/dm-2 7 25337 active sync /dev/dm-3 8 25348 active sync /dev/dm-4 9 25359 active sync /dev/dm-5 10 2536 10 active sync /dev/dm-6 11 2537 11 active sync /dev/dm-7 12 2538 12 active sync /dev/dm-8 13 2539 13 active sync /dev/dm-9 2 253 10- faulty spare /dev/dm-10 // STATE AFTER REMOVAL: Number Major Minor RaidDevice State 0 25300 active sync /dev/dm-0 1 25311 active sync /dev/dm-1 2 002 removed 3 253 113 active sync /dev/dm-11 4 253 12
Re: why partition arrays?
On Wed, 2006-10-18 at 15:43 +0200, martin f krafft wrote: > also sprach Doug Ledford <[EMAIL PROTECTED]> [2006.10.18.1526 +0200]: > > There are a couple reasons I can think. > > Thanks for your elaborate response. If you don't mind, I shall link > to it from the FAQ. Sure. > I have one other question: do partitionable and traditional arrays > actually differ in format? Put differently: can I assemble > a traditional array as a partitionable one simply by specifying: > > mdadm --create ... /dev/md0 ... > mdadm --stop /dev/md0 > mdadm --assemble --auto=part ... /dev/md0 ... > > ? Or do the superblocks actually differ? Neil would be more authoritative about what would differ in the superblocks, but yes, it is possible to do as you listed above. In fact, if you create a partitioned array, and your mkinitrd doesn't restart it as a partitioned array, you'll wonder how to mount your filesystems since the system will happily start that originally partitioned array as non partitioned. -- Doug Ledford <[EMAIL PROTECTED]> GPG KeyID: CFBFF194 http://people.redhat.com/dledford Infiniband specific RPMs available at http://people.redhat.com/dledford/Infiniband signature.asc Description: This is a digitally signed message part
Re: why partition arrays?
also sprach Doug Ledford <[EMAIL PROTECTED]> [2006.10.18.1526 +0200]: > There are a couple reasons I can think. Thanks for your elaborate response. If you don't mind, I shall link to it from the FAQ. I have one other question: do partitionable and traditional arrays actually differ in format? Put differently: can I assemble a traditional array as a partitionable one simply by specifying: mdadm --create ... /dev/md0 ... mdadm --stop /dev/md0 mdadm --assemble --auto=part ... /dev/md0 ... ? Or do the superblocks actually differ? Thanks, -- martin; (greetings from the heart of the sun.) \ echo mailto: !#^."<*>"|tr "<*> mailto:"; [EMAIL PROTECTED] spamtraps: [EMAIL PROTECTED] the images rushed around his mind and tried to find somewhere to settle down and make sense. -- douglas adams, "the hitchhiker's guide to the galaxy" signature.asc Description: Digital signature (GPG/PGP)
Re: why partition arrays?
On Wed, 2006-10-18 at 14:42 +0200, martin f krafft wrote: > Why would anyone want to create a partitionable array and put > partitions in it, rather than creating separate arrays for each > filesystem? Intuitively, this makes way more sense as then the > partitions are independent of each other; one array can fail and the > rest still works -- part of the reason why you partition in the > first place. > > Would anyone help me answer this FAQ? There are a couple reasons I can think. First, not all md types make sense to be split up, aka multipath. For those types, when a disk fails, the *entire* disk is considered to be failed, but with different arrays you won't fail over to the next path until each md array has attempted to access the bad path. This can have obvious bad consequences for certain array types that do automatic failover from one port to another (you can end up getting the array in a loop of switching ports repeatedly to satisfy the fact that one array failed over during a path down, then the path came back up, and another array stayed on the old path because it didn't send any commands during the path down time period). Second, convenience. Assume you have a 6 disk raid5 array. If a disk fails and you are using a partitioned md array, then all the partitions on the disk will already be handled without using that disk. No need to manually fail any still active array members from other arrays. Third, safety. Again with the raid5 array. If you use multiple arrays on a single disk, and that disk fails, but it only failed on one array, then you now need to manually fail that disk from the other arrays before shutting down or hot swapping the disk. Generally speaking, that's not a big deal, but people do occasionally have fat finger syndrome and this is a good opportunity for someone to accidentally fail the wrong disk, and when you then go to remove the disk you create a two disk failure instead of one and now you are in real trouble. Forth, to respond to what you wrote about independent of each other -- part of the reason why you partition. I would argue that's not true. If your goal is to salvage as much use from a failing disk as possible, then OK. But, generally speaking, people that have something of value on their disks don't want to salvage any part of a failing disk, they want that disk gone and replaced immediately. There simply is little to no value in an already malfunctioning disk. They're too cheap and the data stored on them too valuable to risk loosing something in an effort to further utilize broken hardware. This of course is written with the understanding that the latest md raid code will do read error rewrites to compensate for minor disk issues, so anything that will throw a disk out of an array is more than just a minor sector glitch. > (btw: [0] and [1] are obviously for public consumption; they are > available under the terms of the artistic licence 2.0) > > 0. > http://svn.debian.org/wsvn/pkg-mdadm/mdadm/trunk/debian/FAQ?op=file&rev=0&sc=0 > 1. > http://svn.debian.org/wsvn/pkg-mdadm/mdadm/trunk/debian/README.recipes?op=file&rev=0&sc=0 > -- Doug Ledford <[EMAIL PROTECTED]> GPG KeyID: CFBFF194 http://people.redhat.com/dledford Infiniband specific RPMs available at http://people.redhat.com/dledford/Infiniband signature.asc Description: This is a digitally signed message part
why partition arrays?
As the Debian mdadm maintainer, I am often subjected to questions about partitionable arrays; people seem to want to use them in favour of normal arrays. I don't understand why. There's possibly an argument to be made about flexibility when it comes to resizing partitions within the array, but even most MD array types can be resized now. There's possibly an argument about saving space because of fewer sectors used/wasted with superblock information, but I am not going to buy that. Why would anyone want to create a partitionable array and put partitions in it, rather than creating separate arrays for each filesystem? Intuitively, this makes way more sense as then the partitions are independent of each other; one array can fail and the rest still works -- part of the reason why you partition in the first place. Would anyone help me answer this FAQ? (btw: [0] and [1] are obviously for public consumption; they are available under the terms of the artistic licence 2.0) 0. http://svn.debian.org/wsvn/pkg-mdadm/mdadm/trunk/debian/FAQ?op=file&rev=0&sc=0 1. http://svn.debian.org/wsvn/pkg-mdadm/mdadm/trunk/debian/README.recipes?op=file&rev=0&sc=0 -- martin; (greetings from the heart of the sun.) \ echo mailto: !#^."<*>"|tr "<*> mailto:"; [EMAIL PROTECTED] spamtraps: [EMAIL PROTECTED] "the liar at any rate recognises that recreation, not instruction, is the aim of conversation, and is a far more civilised being than the blockhead who loudly expresses his disbelief in a story which is told simply for the amusement of the company." -- oscar wilde signature.asc Description: Digital signature (GPG/PGP)
Problem with Software RAID5
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Hi! I´ve got a problem with a software raid5. PC runs debian sarge/sid/etch mix with 2.6.16 self built and mdadm mdadm 2.5.3.git200608202239-7. 1 of 6 SATA 400GB drives failes and I rebooted the PC. After reboot RAID was resyncing. But again the HD died and the PC rebooted again, I pulled off the bad HD and now the RAID5 won´t resync again. I built a 2.6.17 kernel and replaced mdadm to mdadm 2.5.4-1 and still RAID5 won´t resync again: cat /proc/mdstat Personalities : [linear] [raid0] [raid1] [raid10] [raid5] [raid4] [raid6] [multipath] [faulty] md0 : inactive sda1[0] sde1[5] sdd1[4] sdc1[3] sdb1[1] 1953543680 blocks unused devices: mdadm --detail /dev/md0 /dev/md0: Version : 00.90.03 Creation Time : Fri May 12 16:10:24 2006 Raid Level : raid5 Device Size : 390708736 (372.61 GiB 400.09 GB) Raid Devices : 6 Total Devices : 5 Preferred Minor : 0 Persistence : Superblock is persistent Update Time : Tue Oct 17 14:11:56 2006 State : active, degraded, Not Started Active Devices : 5 Working Devices : 5 Failed Devices : 0 Spare Devices : 0 Layout : left-symmetric Chunk Size : 256K UUID : 5ce125ae:b76d7567:a531953b:fbba92fc Events : 0.2818447 Number Major Minor RaidDevice State 0 810 active sync /dev/sda1 1 8 171 active sync /dev/sdb1 2 002 removed 3 8 333 active sync /dev/sdc1 4 8 494 active sync /dev/sdd1 5 8 655 active sync /dev/sde1 mdadm --stop /dev/md0 mdadm: stopped /dev/md0 sinope:~# mdadm --assemble /dev/md0 mdadm: failed to RUN_ARRAY /dev/md0: Input/output error Any hints? tips? MfG, Lars Schimmer - -- - - TU Graz, Institut für ComputerGraphik & WissensVisualisierung Tel: +43 316 873-5405 E-Mail: [EMAIL PROTECTED] Fax: +43 316 873-5402 PGP-Key-ID: 0x4A9B1723 -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.5 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFFNfZgmWhuE0qbFyMRAqBBAJoCcE4gMx83NQl8pksSqgEpBHWNiACfTpKr NVHtinnXRPIbY2Rfv3BUC0s= =svsf -END PGP SIGNATURE- - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html