Re: PROBLEM: raid5 hangs
This is a known bug in 2.6.23 and should be fixed in 2.6.23.2 if the RAID5 bio* patches are applied. Justin. On Wed, 14 Nov 2007, Peter Magnusson wrote: Hey. [1.] One line summary of the problem: raid5 hangs and use 100% cpu [2.] Full description of the problem/report: I have used 2.6.18 for 284 days or something until my powersupply died, no problem what so ever duing that time. After that forced reboot I did these changes; Put in 2 GB more memory so I have 3 GB instead of 1 GB, two disks in the raid5 got badblocks so I didnt trust them anymore so I bought new disks (I managed to save the raid5). I have 6x300 GB in a raid5. Two of them are now 320 GB so created a small raid1 also. That raid5 is encrypted with aes-cbc-plain. The raid1 is encrypted with aes-cbc-essiv:sha256. I compiled linux-2.6.22.3 and started to use that. I used the same .config as in default FC5, I think i just selected P4 cpu and preemptive kernel type. After 11 or 12 days the computer froze, I wasnt home when it happend and couldnt fix it for like 3 days. It was just to reboot it as it wasnt possible to login remotely or on console. It did respond to ping however. After reboot it rebuilded the raid5. Then it happend again after approx the same time, 11 or 12 days. I noticed that the process md1_raid5 used 100% cpu all the time. After reboot it rebuilded the raid5. I compiled linux-2.6.23. And then... it happend again... After about the same time as before. md1_raid5 used 100% cpu. I also noticed that I wasnt able to save anything in my homedir, it froze during save. I could read from it however. My homedir isnt on raid5 but its encrypted. Its not on any disk that has to do with raid. This problem didnt happend when I used 2.6.18. Currently I use 2.6.18 as I kinda need the computer stable. After reboot it rebuilded the raid5. top looked like this: - 02:37:32 up 11 days, 2:00, 29 users, load average: 21.06, 17.45, 9.38 Tasks: 284 total, 2 running, 282 sleeping, 0 stopped, 0 zombie Cpu(s): 2.1%us, 51.2%sy, 0.0%ni, 0.0%id, 46.6%wa, 0.0%hi, 0.0%si, 0.0%st Mem: 3114928k total, 2981720k used, 133208k free, 8244k buffers Swap: 2096472k total, 252k used, 2096220k free, 1690196k cached PID USER PR NI VIRT RES SHR S %CPU %MEMTIME+ COMMAND 2147 root 15 -5 000 R 100 0.0 80:25.80 md1_raid5 11328 iocc 20 0 536m 374m 28m S3 12.3 249:32.38 firefox-bin After some time, just before I rebooted I had this load: 02:48:36 up 11 days, 2:11, 29 users, load average: 86.10, 70.80, 40.07 [3.] Keywords (i.e., modules, networking, kernel): raid5, possible dm_mod [4.] Kernel version (from /proc/version): Not using 2.6.23 now but anyway... Linux version 2.6.18 ([EMAIL PROTECTED]) (gcc version 4.1.1 20060525 (Red Hat 4.1.1-1)) #1 SMP Sun Sep 24 12:58:16 CEST 2006 [5.] Output of Oops.. message (if applicable) with symbolic information resolved (see Documentation/oops-tracing.txt) No oopses, doesnt log anything. [6.] A small shell script or example program which triggers the problem (if possible) - [7.] Environment Hmm.. FilesystemSize Used Avail Use% Mounted on /dev/sda1 7.8G 7.0G 761M 91% /<- unencrypted fs tmpfs 1.5G 0 1.5G 0% /dev/shm /dev/mapper/home 24G 23G 1.6G 94% /home<- encrypted fs /dev/mapper/temp 1.4T 822G 555G 60% /temp<- encrypted fs,raid5 /dev/mapper/jb 18G 17G 1.2G 94% /mnt/jb <- encrypted fs,raid1 [EMAIL PROTECTED] linux-2.6.23]# cryptsetup status home /dev/mapper/home is active: cipher: aes-cbc-plain keysize: 256 bits device: /dev/sda3 offset: 0 sectors size:50861790 sectors mode:read/write [EMAIL PROTECTED] linux-2.6.23]# cryptsetup status temp /dev/mapper/temp is active: cipher: aes-cbc-plain keysize: 256 bits device: /dev/md1 offset: 0 sectors size:2930496000 sectors mode:read/write [EMAIL PROTECTED] linux-2.6.23]# cryptsetup status jb /dev/mapper/jb is active: cipher: aes-cbc-essiv:sha256 keysize: 256 bits device: /dev/md0 offset: 0 sectors size:37238528 sectors mode:read/write [7.1.] Software (add the output of the ver_linux script here) If some fields are empty or look unusual you may have an old version. Compare to the current minimal requirements in Documentation/Changes. Linux flashdance.cx 2.6.18 #1 SMP Sun Sep 24 12:58:16 CEST 2006 i686 i686 i386 GNU/Linux Gnu C 4.1.1 Gnu make 3.80 binutils 2.16.91.0.6 util-linux 2.13-pre7 mount 2.13-pre7 module-init-tools 3.2.2 e2fsprogs 1.38 reiserfsprogs 3.6.19 quota-tools3.13. PPP2.4.3 Linux C Library2.4 Dynamic linker (ldd) 2.4 Procps 3.2.7 Net-tools 1.60 Kbd1.12 oprofile 0.9.1 Sh-utils 5.97 udev
Re: PROBLEM: raid5 hangs
On Wed, 14 Nov 2007, Justin Piszcz wrote: This is a known bug in 2.6.23 and should be fixed in 2.6.23.2 if the RAID5 bio* patches are applied. Ok, good to know. Do you know when it first appeared because it existed in linux-2.6.22.3 also...? - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: PROBLEM: raid5 hangs
On Wed, 14 Nov 2007, Peter Magnusson wrote: On Wed, 14 Nov 2007, Justin Piszcz wrote: This is a known bug in 2.6.23 and should be fixed in 2.6.23.2 if the RAID5 bio* patches are applied. Ok, good to know. Do you know when it first appeared because it existed in linux-2.6.22.3 also...? I am unsure, I and others started noticing it in 2.6.23 mainly; again, not sure, will let others answer this one. Justin. - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RAID5 Recovery
Neil Cavan wrote: > Hello, Hi Neil What kernel version? What mdadm version? > This morning, I woke up to find the array had kicked two disks. This > time, though, /proc/mdstat showed one of the failed disks (U_U_U, one > of the "_"s) had been marked as a spare - weird, since there are no > spare drives in this array. I rebooted, and the array came back in the > same state: one failed, one spare. I hot-removed and hot-added the > spare drive, which put the array back to where I thought it should be > ( still U_U_U, but with both "_"s marked as failed). Then I rebooted, > and the array began rebuilding on its own. Usually I have to hot-add > manually, so that struck me as a little odd, but I gave it no mind and > went to work. Without checking the contents of the filesystem. Which > turned out not to have been mounted on reboot. OK > Because apparently things went horribly wrong. Yep :( > Do I have any hope of recovering this data? Could rebuilding the > reiserfs superblock help if the rebuild managed to corrupt the > superblock but not the data? See below > Nov 13 02:01:03 localhost kernel: [17805772.424000] hdc: dma_intr: > status=0x51 { DriveReady SeekComplete Error } > Nov 13 02:01:06 localhost kernel: [17805775.156000] lost page write > due to I/O error on md0 hdc1 fails > Nov 13 02:01:06 localhost kernel: [17805775.196000] RAID5 conf printout: > Nov 13 02:01:06 localhost kernel: [17805775.196000] --- rd:5 wd:3 fd:2 > Nov 13 02:01:06 localhost kernel: [17805775.196000] disk 0, o:1, dev:hda1 > Nov 13 02:01:06 localhost kernel: [17805775.196000] disk 1, o:0, dev:hdc1 > Nov 13 02:01:06 localhost kernel: [17805775.196000] disk 2, o:1, dev:hde1 > Nov 13 02:01:06 localhost kernel: [17805775.196000] disk 4, o:1, dev:hdi1 hdg1 is already missing? > Nov 13 02:01:06 localhost kernel: [17805775.212000] RAID5 conf printout: > Nov 13 02:01:06 localhost kernel: [17805775.212000] --- rd:5 wd:3 fd:2 > Nov 13 02:01:06 localhost kernel: [17805775.212000] disk 0, o:1, dev:hda1 > Nov 13 02:01:06 localhost kernel: [17805775.212000] disk 2, o:1, dev:hde1 > Nov 13 02:01:06 localhost kernel: [17805775.212000] disk 4, o:1, dev:hdi1 so now the array is bad. a reboot happens and: > Nov 13 07:21:07 localhost kernel: [17179584.712000] md: md0 stopped. > Nov 13 07:21:07 localhost kernel: [17179584.876000] md: bind > Nov 13 07:21:07 localhost kernel: [17179584.884000] md: bind > Nov 13 07:21:07 localhost kernel: [17179584.884000] md: bind > Nov 13 07:21:07 localhost kernel: [17179584.884000] md: bind > Nov 13 07:21:07 localhost kernel: [17179584.892000] md: bind > Nov 13 07:21:07 localhost kernel: [17179584.892000] md: kicking > non-fresh hdg1 from array! > Nov 13 07:21:07 localhost kernel: [17179584.892000] md: unbind > Nov 13 07:21:07 localhost kernel: [17179584.892000] md: export_rdev(hdg1) > Nov 13 07:21:07 localhost kernel: [17179584.896000] raid5: allocated > 5245kB for md0 ... apparently hdc1 is OK? Hmmm. > Nov 13 07:21:07 localhost kernel: [17179665.524000] ReiserFS: md0: > found reiserfs format "3.6" with standard journal > Nov 13 07:21:07 localhost kernel: [17179676.136000] ReiserFS: md0: > using ordered data mode > Nov 13 07:21:07 localhost kernel: [17179676.164000] ReiserFS: md0: > journal params: device md0, size 8192, journal first block 18, max > trans len 1024, max batch 900, max commit age 30, max trans age 30 > Nov 13 07:21:07 localhost kernel: [17179676.164000] ReiserFS: md0: > checking transaction log (md0) > Nov 13 07:21:07 localhost kernel: [17179676.828000] ReiserFS: md0: > replayed 7 transactions in 1 seconds > Nov 13 07:21:07 localhost kernel: [17179677.012000] ReiserFS: md0: > Using r5 hash to sort names > Nov 13 07:21:09 localhost kernel: [17179682.064000] lost page write > due to I/O error on md0 Reiser tries to mount/replay itself relying on hdc1 (which is partly bad) > Nov 13 07:25:39 localhost kernel: [17179584.828000] md: raid5 > personality registered as nr 4 > Nov 13 07:25:39 localhost kernel: [17179585.708000] md: kicking > non-fresh hdg1 from array! Another reboot... > Nov 13 07:25:40 localhost kernel: [17179666.064000] ReiserFS: md0: > found reiserfs format "3.6" with standard journal > Nov 13 07:25:40 localhost kernel: [17179676.904000] ReiserFS: md0: > using ordered data mode > Nov 13 07:25:40 localhost kernel: [17179676.928000] ReiserFS: md0: > journal params: device md0, size 8192, journal first block 18, max > trans len 1024, max batch 900, max commit age 30, max trans age 30 > Nov 13 07:25:40 localhost kernel: [17179676.932000] ReiserFS: md0: > checking transaction log (md0) > Nov 13 07:25:40 localhost kernel: [17179677.08] ReiserFS: md0: > Using r5 hash to sort names > Nov 13 07:25:42 localhost kernel: [17179683.128000] lost page write > due to I/O error on md0 Reiser tries again... > Nov 13 07:26:57 localhost kernel: [17179757.524000] md: unbind > Nov 13 07:26:57 localhost kernel: [17179757.524000] md: export_rdev(hdc1) > Nov 13 07:27:03 localhost kernel: [17
Fwd: RAID5 Recovery
Thanks for taking a look, David. Kernel: 2.6.15-27-k7, stock for Ubuntu 6.06 LTS mdadm: mdadm - v1.12.0 - 14 June 2005 You're right, earlier in /var/log/messages there's a notice that hdg dropped, I missed it before. I use mdadm --monitor, but I recently changed the target email address - I guess it didn't take properly. As for replacing hdc, thanks for the diagnosis but it won't help: the drive is actually fine, as is hdg. I've replaced hdc before, only to have the brand new hdc show the same behaviour, and SMART says the drive is A-OK. There's something flaky about these PCI IDE controllers. I think it's new system time. Reiserfs recovery-wise: any suggestions? A simple fsck doesn't find a file system superblock. Is --rebuild-sb the way to go here? Thanks, Neil On Nov 14, 2007 5:58 AM, David Greaves <[EMAIL PROTECTED]> wrote: > Neil Cavan wrote: > > Hello, > Hi Neil > > What kernel version? > What mdadm version? > > > This morning, I woke up to find the array had kicked two disks. This > > time, though, /proc/mdstat showed one of the failed disks (U_U_U, one > > of the "_"s) had been marked as a spare - weird, since there are no > > spare drives in this array. I rebooted, and the array came back in the > > same state: one failed, one spare. I hot-removed and hot-added the > > spare drive, which put the array back to where I thought it should be > > ( still U_U_U, but with both "_"s marked as failed). Then I rebooted, > > and the array began rebuilding on its own. Usually I have to hot-add > > manually, so that struck me as a little odd, but I gave it no mind and > > went to work. Without checking the contents of the filesystem. Which > > turned out not to have been mounted on reboot. > OK > > > Because apparently things went horribly wrong. > Yep :( > > > Do I have any hope of recovering this data? Could rebuilding the > > reiserfs superblock help if the rebuild managed to corrupt the > > superblock but not the data? > See below > > > > > Nov 13 02:01:03 localhost kernel: [17805772.424000] hdc: dma_intr: > > status=0x51 { DriveReady SeekComplete Error } > > > Nov 13 02:01:06 localhost kernel: [17805775.156000] lost page write > > due to I/O error on md0 > hdc1 fails > > > > Nov 13 02:01:06 localhost kernel: [17805775.196000] RAID5 conf printout: > > Nov 13 02:01:06 localhost kernel: [17805775.196000] --- rd:5 wd:3 fd:2 > > Nov 13 02:01:06 localhost kernel: [17805775.196000] disk 0, o:1, dev:hda1 > > Nov 13 02:01:06 localhost kernel: [17805775.196000] disk 1, o:0, dev:hdc1 > > Nov 13 02:01:06 localhost kernel: [17805775.196000] disk 2, o:1, dev:hde1 > > Nov 13 02:01:06 localhost kernel: [17805775.196000] disk 4, o:1, dev:hdi1 > > hdg1 is already missing? > > > Nov 13 02:01:06 localhost kernel: [17805775.212000] RAID5 conf printout: > > Nov 13 02:01:06 localhost kernel: [17805775.212000] --- rd:5 wd:3 fd:2 > > Nov 13 02:01:06 localhost kernel: [17805775.212000] disk 0, o:1, dev:hda1 > > Nov 13 02:01:06 localhost kernel: [17805775.212000] disk 2, o:1, dev:hde1 > > Nov 13 02:01:06 localhost kernel: [17805775.212000] disk 4, o:1, dev:hdi1 > > so now the array is bad. > > a reboot happens and: > > Nov 13 07:21:07 localhost kernel: [17179584.712000] md: md0 stopped. > > Nov 13 07:21:07 localhost kernel: [17179584.876000] md: bind > > Nov 13 07:21:07 localhost kernel: [17179584.884000] md: bind > > Nov 13 07:21:07 localhost kernel: [17179584.884000] md: bind > > Nov 13 07:21:07 localhost kernel: [17179584.884000] md: bind > > Nov 13 07:21:07 localhost kernel: [17179584.892000] md: bind > > Nov 13 07:21:07 localhost kernel: [17179584.892000] md: kicking > > non-fresh hdg1 from array! > > Nov 13 07:21:07 localhost kernel: [17179584.892000] md: unbind > > Nov 13 07:21:07 localhost kernel: [17179584.892000] md: export_rdev(hdg1) > > Nov 13 07:21:07 localhost kernel: [17179584.896000] raid5: allocated > > 5245kB for md0 > ... apparently hdc1 is OK? Hmmm. > > > Nov 13 07:21:07 localhost kernel: [17179665.524000] ReiserFS: md0: > > found reiserfs format "3.6" with standard journal > > Nov 13 07:21:07 localhost kernel: [17179676.136000] ReiserFS: md0: > > using ordered data mode > > Nov 13 07:21:07 localhost kernel: [17179676.164000] ReiserFS: md0: > > journal params: device md0, size 8192, journal first block 18, max > > trans len 1024, max batch 900, max commit age 30, max trans age 30 > > Nov 13 07:21:07 localhost kernel: [17179676.164000] ReiserFS: md0: > > checking transaction log (md0) > > Nov 13 07:21:07 localhost kernel: [17179676.828000] ReiserFS: md0: > > replayed 7 transactions in 1 seconds > > Nov 13 07:21:07 localhost kernel: [17179677.012000] ReiserFS: md0: > > Using r5 hash to sort names > > Nov 13 07:21:09 localhost kernel: [17179682.064000] lost page write > > due to I/O error on md0 > Reiser tries to mount/replay itself relying on hdc1 (which is partly bad) > > > Nov 13 07:25:39 localhost kernel: [17179584.828000] md: raid5 > > personality registered as nr 4 > > Nov 13 07:25:39 localh
Re: Fwd: RAID5 Recovery
Neil Cavan wrote: > Thanks for taking a look, David. No problem. > Kernel: > 2.6.15-27-k7, stock for Ubuntu 6.06 LTS > > mdadm: > mdadm - v1.12.0 - 14 June 2005 OK - fairly old then. Not really worth trying to figure out why hdc got re-added when things had gone wrong. > You're right, earlier in /var/log/messages there's a notice that hdg > dropped, I missed it before. I use mdadm --monitor, but I recently > changed the target email address - I guess it didn't take properly. > > As for replacing hdc, thanks for the diagnosis but it won't help: the > drive is actually fine, as is hdg. I've replaced hdc before, only to > have the brand new hdc show the same behaviour, and SMART says the > drive is A-OK. There's something flaky about these PCI IDE > controllers. I think it's new system time. Any excuse eh? :) > Reiserfs recovery-wise: any suggestions? A simple fsck doesn't find a > file system superblock. Is --rebuild-sb the way to go here? No idea, sorry. I only ever tried Reiser once and it failed. It was very hard to get recovered so I swapped back to XFS. Good luck on the fscking David - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Building a new raid6 with bitmap does not clear bits during resync
Neil Brown wrote: On Monday November 12, [EMAIL PROTECTED] wrote: Neil Brown wrote: However there is value in regularly updating the bitmap, so add code to periodically pause while all pending sync requests complete, then update the bitmap. Doing this only every few seconds (the same as the bitmap update time) does not notciable affect resync performance. I wonder if a minimum time and minimum number of stripes would be better. If a resync is going slowly because it's going over a slow link to iSCSI, nbd, or a box of cheap drives fed off a single USB port, just writing the updated bitmap may represent as much data as has been resynced in the time slice. Not a suggestion, but a request for your thoughts on that. Thanks for your thoughts. Choosing how often to update the bitmap during a sync is certainly not trivial. In different situations, different requirements might rule. I chose to base it on time, and particularly on the time we already have for "how soon to write back clean bits to the bitmap" because it is fairly easy to users to understand the implications (if I set the time to 30 seconds, then I might have to repeat 30second of resync) and it is already configurable (via the "--delay" option to --create --bitmap). Sounds right, that part of it is pretty user friendly. Presumably if someone has a very slow system and wanted to use bitmaps, they would set --delay relatively large to reduce the cost and still provide significant benefits. This would effect both normal clean-bit writeback and during-resync clean-bit-writeback. Hope that clarifies my approach. Easy to implement and understand is always a strong point, and a user can make an informed decision. Thanks for the discussion. -- bill davidsen <[EMAIL PROTECTED]> CTO TMR Associates, Inc Doing interesting things with small computers since 1979 - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Proposal: non-striping RAID4
James Lee wrote: >From a quick search through this mailing list, it looks like I can answer my own question regarding RAID1 --> RAID5 conversion. Instead of creating a RAID1 array for the partitions on the two biggest drives, it should just create a 2-drive RAID5 (which is identical, but can be expanded as with any other RAID5 array). So it looks like this should work I guess. I believe what you want to create might be a three drive raid-5 with one failed drive. That way you can just add a drive when you want. mdadm -C -c32 -l5 -n3 -amd /dev/md7 /dev/loop[12] missing Then you can add another drive: mdadm --add /dev/md7 /dev/loop3 The output are at the end of this message. But in general think it would be really great to be able to have a format which would do raid-5 or raid-6 over all the available parts of multiple drives, and since there's some similar logic for raid-10 over a selection of drives it is clearly possible. But in terms of the benefit to be gained, unless it fails out of the code and someone feels the desire to do it, I can't see much joy to ever having such a thing. The feature I would really like to have is raid5e, distributed spare so head motion is spread over all drives. Don't have time to look at that one, either, but it really helps performance under load with small arrays. -- bill davidsen <[EMAIL PROTECTED]> CTO TMR Associates, Inc Doing interesting things with small computers since 1979 - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [stable] [PATCH 000 of 2] md: Fixes for md in 2.6.23
On Tue, Nov 13, 2007 at 10:36:30PM -0700, Dan Williams wrote: > On Nov 13, 2007 8:43 PM, Greg KH <[EMAIL PROTECTED]> wrote: > > > > > > Careful, it looks like you cherry picked commit 4ae3f847 "md: raid5: > > > fix clearing of biofill operations" which ended up misapplied in > > > Linus' tree, You should either also pick up def6ae26 "md: fix > > > misapplied patch in raid5.c" or I can resend the original "raid5: fix > > > clearing of biofill operations." > > > > > > The other patch for -stable "raid5: fix unending write sequence" is > > > currently in -mm. > > > > Hm, I've attached the two patches that I have right now in the -stable > > tree so far (still have over 100 patches to go, so I might not have > > gotten to them yet if you have sent them). These were sent to me by > > Andrew on their way to Linus. if I should drop either one, or add > > another one, please let me know. > > > > Drop md-raid5-fix-clearing-of-biofill-operations.patch and replace it > with the attached > md-raid5-not-raid6-fix-clearing-of-biofill-operations.patch (the > original sent to Neil). > > The critical difference is that the replacement patch touches > handle_stripe5, not handle_stripe6. Diffing the patches shows the > changes for hunk #3: > > -@@ -2903,6 +2907,13 @@ static void handle_stripe6(struct stripe > +@@ -2630,6 +2634,13 @@ static void handle_stripe5(struct stripe_head *sh) Ah, ok, thanks, will do that. > raid5-fix-unending-write-sequence.patch is in -mm and I believe is > waiting on an Acked-by from Neil? I don't see it in Linus's tree yet, so I can't apply it to -stable... thanks, greg k-h - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: PROBLEM: raid5 hangs
Justin Piszcz wrote: This is a known bug in 2.6.23 and should be fixed in 2.6.23.2 if the RAID5 bio* patches are applied. Note below he's running 2.6.22.3 which doesn't have the bug unless -STABLE added it. So should not really be in 2.6.22.anything. I assume you're talking the endless write or bio issue? Justin. On Wed, 14 Nov 2007, Peter Magnusson wrote: Hey. [1.] One line summary of the problem: raid5 hangs and use 100% cpu [2.] Full description of the problem/report: I have used 2.6.18 for 284 days or something until my powersupply died, no problem what so ever duing that time. After that forced reboot I did these changes; Put in 2 GB more memory so I have 3 GB instead of 1 GB, two disks in the raid5 got badblocks so I didnt trust them anymore so I bought new disks (I managed to save the raid5). I have 6x300 GB in a raid5. Two of them are now 320 GB so created a small raid1 also. That raid5 is encrypted with aes-cbc-plain. The raid1 is encrypted with aes-cbc-essiv:sha256. I compiled linux-2.6.22.3 and started to use that. I used the same .config as in default FC5, I think i just selected P4 cpu and preemptive kernel type. After 11 or 12 days the computer froze, I wasnt home when it happend and couldnt fix it for like 3 days. It was just to reboot it as it wasnt possible to login remotely or on console. It did respond to ping however. After reboot it rebuilded the raid5. Then it happend again after approx the same time, 11 or 12 days. I noticed that the process md1_raid5 used 100% cpu all the time. After reboot it rebuilded the raid5. I compiled linux-2.6.23. And then... it happend again... After about the same time as before. md1_raid5 used 100% cpu. I also noticed that I wasnt able to save anything in my homedir, it froze during save. I could read from it however. My homedir isnt on raid5 but its encrypted. Its not on any disk that has to do with raid. This problem didnt happend when I used 2.6.18. Currently I use 2.6.18 as I kinda need the computer stable. After reboot it rebuilded the raid5. -- bill davidsen <[EMAIL PROTECTED]> CTO TMR Associates, Inc Doing interesting things with small computers since 1979 - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: PROBLEM: raid5 hangs
On Wed, 14 Nov 2007, Bill Davidsen wrote: Justin Piszcz wrote: This is a known bug in 2.6.23 and should be fixed in 2.6.23.2 if the RAID5 bio* patches are applied. Note below he's running 2.6.22.3 which doesn't have the bug unless -STABLE added it. So should not really be in 2.6.22.anything. I assume you're talking the endless write or bio issue? The bio issue is the root cause of the bug yes? -- I am uncertain but I remember this happening in the past but I thought it was something I was doing (possibly < 2.6.23) so it may have been happenign earlier than that but I am not positive. Justin. On Wed, 14 Nov 2007, Peter Magnusson wrote: Hey. [1.] One line summary of the problem: raid5 hangs and use 100% cpu [2.] Full description of the problem/report: I have used 2.6.18 for 284 days or something until my powersupply died, no problem what so ever duing that time. After that forced reboot I did these changes; Put in 2 GB more memory so I have 3 GB instead of 1 GB, two disks in the raid5 got badblocks so I didnt trust them anymore so I bought new disks (I managed to save the raid5). I have 6x300 GB in a raid5. Two of them are now 320 GB so created a small raid1 also. That raid5 is encrypted with aes-cbc-plain. The raid1 is encrypted with aes-cbc-essiv:sha256. I compiled linux-2.6.22.3 and started to use that. I used the same .config as in default FC5, I think i just selected P4 cpu and preemptive kernel type. After 11 or 12 days the computer froze, I wasnt home when it happend and couldnt fix it for like 3 days. It was just to reboot it as it wasnt possible to login remotely or on console. It did respond to ping however. After reboot it rebuilded the raid5. Then it happend again after approx the same time, 11 or 12 days. I noticed that the process md1_raid5 used 100% cpu all the time. After reboot it rebuilded the raid5. I compiled linux-2.6.23. And then... it happend again... After about the same time as before. md1_raid5 used 100% cpu. I also noticed that I wasnt able to save anything in my homedir, it froze during save. I could read from it however. My homedir isnt on raid5 but its encrypted. Its not on any disk that has to do with raid. This problem didnt happend when I used 2.6.18. Currently I use 2.6.18 as I kinda need the computer stable. After reboot it rebuilded the raid5. -- bill davidsen <[EMAIL PROTECTED]> CTO TMR Associates, Inc Doing interesting things with small computers since 1979 - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: PROBLEM: raid5 hangs
On Nov 14, 2007 5:05 PM, Justin Piszcz <[EMAIL PROTECTED]> wrote: > On Wed, 14 Nov 2007, Bill Davidsen wrote: > > Justin Piszcz wrote: > >> This is a known bug in 2.6.23 and should be fixed in 2.6.23.2 if the RAID5 > >> bio* patches are applied. > > > > Note below he's running 2.6.22.3 which doesn't have the bug unless -STABLE > > added it. So should not really be in 2.6.22.anything. I assume you're > > talking > > the endless write or bio issue? > The bio issue is the root cause of the bug yes? Not if this is a 2.6.22 issue. Neither of the bugs fixed by "raid5: fix clearing of biofill operations" or "raid5: fix unending write sequence" existed prior to 2.6.23. - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Proposal: non-striping RAID4
But creating a 3-drive RAID5 with a missing device for the final two drives wouldn't give me what I'm looking for, as that array would no longer be fault-tolerant. So I think what we'd have on an array of n differently-sized drives is: - One n drive RAID5 array. - One (n-1) drive RAID5 array. ... - One 2 drive RAID5 array. - One non-RAIDed single partition. All of these except for the non-RAIDed partition would then be used as elements in a linear array (which would tolerate the failure of any single drive, as each of its constituent arrays does). This would leave a single non-RAIDed partition which can be used for anything else. Thinking back over it, I think one potential issue might be how resync works. If all of the RAID5 arrays become in need of resync at the same time (which is perfectly likely - e.g. if the system is powered down abruptly, a drive is replaced, ...) will the md driver attempt to resync each of the arrays sequentially or in parallel? If the latter, this is likely to be extremely slow, as it'll be trying to resync multiple arrays on the same drives (and therefore doing huge amounts of seeking, etc.). The other issue is that it looks like (correct me if I'm wrong here), mdadm doesn't support growing a linear array by increasing the size of it's constituent parts (which is what would be required here to be able to expand the entire array when adding a new drive). I don't know how hard this would be to implement (I don't know how data gets arranged in a linear array - does it start with all of the first drive, then the second, and so on or does it write bits to each?). Neil: any comments on whether this would be desirable / useful / feasible? James PS: and as you say, all of the above could also be done with RAID6 arrays instead of RAID5. On 14/11/2007, Bill Davidsen <[EMAIL PROTECTED]> wrote: > James Lee wrote: > > >From a quick search through this mailing list, it looks like I can > > answer my own question regarding RAID1 --> RAID5 conversion. Instead > > of creating a RAID1 array for the partitions on the two biggest > > drives, it should just create a 2-drive RAID5 (which is identical, but > > can be expanded as with any other RAID5 array). > > > > So it looks like this should work I guess. > > I believe what you want to create might be a three drive raid-5 with one > failed drive. That way you can just add a drive when you want. > > mdadm -C -c32 -l5 -n3 -amd /dev/md7 /dev/loop[12] missing > > Then you can add another drive: > > mdadm --add /dev/md7 /dev/loop3 > > The output are at the end of this message. > > But in general think it would be really great to be able to have a > format which would do raid-5 or raid-6 over all the available parts of > multiple drives, and since there's some similar logic for raid-10 over a > selection of drives it is clearly possible. But in terms of the benefit > to be gained, unless it fails out of the code and someone feels the > desire to do it, I can't see much joy to ever having such a thing. > > The feature I would really like to have is raid5e, distributed spare so > head motion is spread over all drives. Don't have time to look at that > one, either, but it really helps performance under load with small arrays. > > -- > bill davidsen <[EMAIL PROTECTED]> > CTO TMR Associates, Inc > Doing interesting things with small computers since 1979 > > - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [stable] [PATCH 000 of 2] md: Fixes for md in 2.6.23
On Tuesday November 13, [EMAIL PROTECTED] wrote: > > raid5-fix-unending-write-sequence.patch is in -mm and I believe is > waiting on an Acked-by from Neil? > It seems to have just been sent on to Linus, so it probably will go in without: Acked-By: NeilBrown <[EMAIL PROTECTED]> I'm beginning to think that I really should sit down and make sure I understand exactly how those STRIPE_OP_ flags are uses. They generally make sense but there seem to be a number of corner cases where they aren't quite handled properly.. Maybe they are all found now, or maybe NeilBrown - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Proposal: non-striping RAID4
On Thursday November 15, [EMAIL PROTECTED] wrote: > > Neil: any comments on whether this would be desirable / useful / feasible? 1/ Have in raid4 variant which arranges the data like 'linear' is something I am planning to do eventually. If your filesystem nows about the geometry of the array , then it can distribute the data across the drives and can make up for a lot of the benefits of striping. The big advantage of such an arrangement is that it is trivial to add a drive - just zero it and make it part of the array. No need to re-arrange what is currently there. However I was not thinking of support different sizes devices in such a configuration. 2/ Having an array with redundancy where drives are of different sizes is awkward, primarily because if there was a spare that as not as large as the largest device, you may-or-may not be able to rebuild in that situation. Certainly I could code up those decisions, but I'm not sure the scenario is worth the complexity. If you have drives of different sizes, use raid0 to combine pairs of smaller one to match larger ones, and do raid5 across devices that look like the same size. 3/ If you really want to use exactly what you have, you can partition them into bits and make a variety of raid5 arrays as you suggest. md will notice and will resync in series so that you don't kill performance. NeilBrown - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html