Re: Raid 4 idea!
Hello, ... What do you think, Neil? I don't know what Neil thinks, but I have never liked the performance implications of RAID-4, could you say a few words about why 4 rather than 5? My one test with RAID-4 showed the parity drive as a huge bottleneck, and seeing that practice followed theory, I gave up on it. I think the performance is depend on the specific job. In my case, the level 4 is better than level 5. My system is a download server. This job makes a lot of reads, and some write. I use 4x12 unit raid4 array, and 1 raid0 array from 4 raid4. Why? let me see: 1. easy to start without the parity disk. 1/a without the parity disk, it is more faster than raid5/4, and i can easy and fast upload the server with data, and after the upload is done, relatively quickly can generate the 4x1 parity information at one time! 1/b if more disk fails, a little easyer to recover than raid5 2. I speaks more read and a little write. On raid5 if somebody want to write only one bit to the array, all drive need to read, and two disk is need to write after. This requires too much time to force to wait the read processes. But on raid4, all the drives need to read, and only one of the valuable drives need to write! (+ the parity drive) This is a little bit faster for me... 3. and this is the most important: My system have 2 bottleneck with about 1000 downloader users at one time: a, the drives seek time b, the io bandwith. I can make the balance between this 2 bottleneck with the readahead settings. On raid 5 the blockdev readahead _reads the parity too_, and waste the bandwith, and the cache of the drive, the cache in memory, but can seeks N drive. On raid 4, the readahead is all valuable, but i can use only N-1 drive to seeking. 4. on case of very high download traffic, and requires an upload too, i can disable the party, and speed up the write process, and after the load is fall back to normal, i can recreate the parity again. This is a balance between the performance, and redundancy. This is a little dangerous, but this is my choise, and this way is, why linux is so beautiful! :-) 5. with Neils patch i can use the bitmap too. ;-) 6. the parity drive becomes the bottleneck because it is offload the other drives. On other hand, if i plan to upgrade the system, i only need to buy faster parity device! :-) 7/a on extreme case, i can move the parity out from the box, using NBD. The nbd server can speed up and / or can store all the four parity drive with more cost effective way. 7/b Optionally i can set up the NBD server again, and silenty (slowly) reconstruct the parity again using the legacy raid 1, and i can use one USB mobile rack to move the live parity from loop device to the new HDD in rack, and i need to stop the system only for replace the bad disk to the new done synced parity drive. (I did not use hot-swap at the moment.) And in return I'll point out that this makes recovery very expensive, read everything while reconstructing, then read everything all over again when making a new parity drive after repair. On my idea? Yes, this is right. But! If one drive is failing, the parity disk convertion close equal time to reconstruction, except, it goes more and more faster while the degraded raid4 array gets closer to the clean raid0! (raid4 without parity) And with one (exactly 4x1) failed drive my system can go on the top performance, until i replace the old drive to new one. The final parity recreation on raid4: I can only point to the mdadm default raid5 creation mechanism, the fantom spare drive! Neil sad, this is faster than norma raid5 creation, and he have right! With this option, only 1 disk is writing, and all other is only reading! Cheers, Janos -- bill davidsen [EMAIL PROTECTED] CTO TMR Associates, Inc Doing interesting things with small computers since 1979 - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
(X)FS corruption on 2 SATA disk RAID 1
Hello, list, I think, this is generally hardware error, but looks like software problem too. At this point there is no dirty data in memory! Cheers, Janos [EMAIL PROTECTED] /]# cmp -b /dev/sda1 /dev/sdb1 /dev/sda1 /dev/sdb1 differ: byte 68881481729, line 308395510 is 301 M-A 74 [EMAIL PROTECTED] /]# cat /proc/mdstat Personalities : [linear] [raid0] [raid1] [raid5] [multipath] [faulty] md10 : active raid1 sdb1[1] sda1[0] 136729088 blocks [2/2] [UU] bitmap: 0/131 pages [0KB], 512KB chunk unused devices: none [EMAIL PROTECTED] /]# mount 192.168.0.1://NFS/ROOT-BASE/ on / type nfs (rw,hard,rsize=8192,wsize=8192,timeo= 5,retrans=0,actimeo=1) none on /proc type proc (rw,noexec,nosuid,nodev) none on /dev/pts type devpts (rw,gid=5,mode=620) none on /dev/shm type tmpfs (rw) none on /sys type sysfs (rw) /dev/ram0 on /mnt/fast type ext2 (rw) none on /dev/cpuset type cpuset (rw) /dev/md10 on /mnt/1 type xfs (ro) [EMAIL PROTECTED] /]# cut from log: Mar 29 08:14:45 dy-xeon-1 kernel: scsi1 : ata_piix Mar 29 08:14:45 dy-xeon-1 kernel: Vendor: ATA Model: WDC WD2000JD-19H Rev: 08.0 Mar 29 08:14:45 dy-xeon-1 kernel: Type: Direct-Access ANSI SCSI revision: 05 Mar 29 08:14:45 dy-xeon-1 kernel: Vendor: ATA Model: WDC WD2000JD-19H Rev: 08.0 Mar 29 08:14:45 dy-xeon-1 kernel: Type: Direct-Access ANSI SCSI revision: 05 Mar 29 08:14:45 dy-xeon-1 kernel: SCSI device sda: 390721968 512-byte hdwr sectors (200050 MB) Mar 29 08:14:45 dy-xeon-1 kernel: SCSI device sda: drive cache: write back Mar 29 08:14:45 dy-xeon-1 kernel: SCSI device sda: 390721968 512-byte hdwr sectors (200050 MB) Mar 29 08:14:45 dy-xeon-1 kernel: SCSI device sda: drive cache: write back Mar 29 08:14:45 dy-xeon-1 kernel: sda: sda1 sda2 Mar 29 08:14:45 dy-xeon-1 kernel: sd 0:0:0:0: Attached scsi disk sda Mar 29 08:14:45 dy-xeon-1 kernel: SCSI device sdb: 390721968 512-byte hdwr sectors (200050 MB) Mar 29 08:14:45 dy-xeon-1 kernel: SCSI device sdb: drive cache: write back Mar 29 08:14:45 dy-xeon-1 kernel: SCSI device sdb: 390721968 512-byte hdwr sectors (200050 MB) Mar 29 08:14:45 dy-xeon-1 kernel: SCSI device sdb: drive cache: write back Mar 29 08:14:45 dy-xeon-1 kernel: sdb: sdb1 sdb2 Mar 29 08:14:45 dy-xeon-1 kernel: sd 1:0:0:0: Attached scsi disk sdb Mar 29 08:14:45 dy-xeon-1 kernel: sd 0:0:0:0: Attached scsi generic sg0 type 0 Mar 29 08:14:45 dy-xeon-1 kernel: sd 1:0:0:0: Attached scsi generic sg1 type 0 Smart logs: sda: SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000b 200 200 051Pre-fail s - 0 3 Spin_Up_Time0x0007 130 124 021Pre-fail s - 6025 4 Start_Stop_Count0x0032 100 100 040Old_age ys - 97 5 Reallocated_Sector_Ct 0x0033 200 200 140Pre-fail s - 0 7 Seek_Error_Rate 0x000b 200 200 051Pre-fail s - 0 9 Power_On_Hours 0x0032 089 089 000Old_age ys - 8047 10 Spin_Retry_Count0x0013 100 253 051Pre-fail s - 0 11 Calibration_Retry_Count 0x0013 100 253 051Pre-fail s - 0 12 Power_Cycle_Count 0x0032 100 100 000Old_age ys - 97 194 Temperature_Celsius 0x0022 120 111 000Old_age ys - 30 196 Reallocated_Event_Count 0x0032 200 200 000Old_age ys - 0 197 Current_Pending_Sector 0x0012 200 200 000Old_age ys - 0 198 Offline_Uncorrectable 0x0012 200 200 000Old_age ys - 0 199 UDMA_CRC_Error_Count0x000a 200 253 000Old_age ys - 0 200 Multi_Zone_Error_Rate 0x0009 200 200 051Pre-fail Offline - 0 SMART Error Log Version: 1 No Errors Logged sdb: SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000b 200 200 051Pre-fail s - 0 3 Spin_Up_Time0x0007 127 120 021Pre-fail s - 6175 4 Start_Stop_Count0x0032 100 100 040Old_age ys - 94 5 Reallocated_Sector_Ct 0x0033 200 200 140Pre-fail s - 0 7 Seek_Error_Rate 0x000b 200 200 051Pre-fail s - 0 9 Power_On_Hours 0x0032 089 089 000Old_age ys - 8065 10 Spin_Retry_Count0x0013 100 253 051Pre-fail s - 0 11 Calibration_Retry_Count 0x0013 100 253 051Pre-fail s - 0 12 Power_Cycle_Count 0x0032 100 100 000Old_age ys - 94 194 Temperature_Celsius 0x0022 117 109 000Old_age ys - 33
Re: Help Please! mdadm hangs when using nbd or gnbd
- Original Message - From: Brian Kelly [EMAIL PROTECTED] To: linux-raid@vger.kernel.org Sent: Thursday, February 23, 2006 1:25 AM Subject: Help Please! mdadm hangs when using nbd or gnbd Hail to the Great Linux RAID Gurus! I humbly seek any assistance you can offer. I am building a couple of 20 TB logical volumes from six storage nodes each offering two 8TB raw storage devices built with Broadcom RAIDCore BC4852 SATA cards. Each storage node (called leadstor1-6) needs to publish its two raw devices with iSCSI, nbd or gnbd over a gigabit network which the head node (leadstor) combines into a RAID 5 volume using mdadm. My problem is that when using nbd or gnbd the original build of the array on the head node quickly halts, as if a deadlock has occurred. I have this problem with RAID 1 and RAID 5 configurations regardless of the size of the storage node published devices. Here's a demonstration with two 4 TB drives being mirrored using nbd: *** Begin Demonstration *** [EMAIL PROTECTED] nbd-2.8.3]# uname -a Linux leadstor.unidata.ucar.edu 2.6.15-1.1831_FC4smp #1 SMP Tue Feb 7 13:51:52 EST 2006 x86_64 x86_64 x86_64 GNU/Linux I start by preparing the system for nbd and md devices [EMAIL PROTECTED] ~]# modprobe nbd [EMAIL PROTECTED] ~]# cd /dev [EMAIL PROTECTED] dev]# ./MAKEDEV nb [EMAIL PROTECTED] dev]# ./MAKEDEV md I then mount two 4TB volumes from leadstor5 and leadstor6 [EMAIL PROTECTED] dev]# cd /opt/nbd-2.8.3 [EMAIL PROTECTED] nbd-2.8.3]# ./nbd-client leadstor5 2002 /dev/nb5 Negotiation: ..size = 3899484160KB bs=1024, sz=3899484160 [EMAIL PROTECTED] nbd-2.8.3]# ./nbd-client leadstor6 2002 /dev/nb6 Negotiation: ..size = 3899484160KB bs=1024, sz=3899484160 I confirm the volumes are mounted properly [EMAIL PROTECTED] nbd-2.8.3]# fdisk -l /dev/nb5 Disk /dev/nb5: 3993.0 GB, 3993071779840 bytes 255 heads, 63 sectors/track, 485463 cylinders Units = cylinders of 16065 * 512 = 8225280 bytes Disk /dev/nb5 doesn't contain a valid partition table [EMAIL PROTECTED] nbd-2.8.3]# fdisk -l /dev/nb6 Disk /dev/nb6: 3993.0 GB, 3993071779840 bytes 255 heads, 63 sectors/track, 485463 cylinders Units = cylinders of 16065 * 512 = 8225280 bytes Disk /dev/nb6 doesn't contain a valid partition table I prepare the drives to be used in mdadm [EMAIL PROTECTED] nbd-2.8.3]# mdadm -V mdadm - v1.12.0 - 14 June 2005 [EMAIL PROTECTED] nbd-2.8.3]# mdadm --zero-superblock /dev/nb5 [EMAIL PROTECTED] nbd-2.8.3]# mdadm --zero-superblock /dev/nb6 I create a device to mirror the two volumes [EMAIL PROTECTED] nbd-2.8.3]# mdadm --create /dev/md2 -l 1 -n 2 /dev/nb5 /dev/nb6 mdadm: array /dev/md2 started. And watch the progress in /proc/mdstat [EMAIL PROTECTED] nbd-2.8.3]# date Wed Feb 22 16:18:55 MST 2006 [EMAIL PROTECTED] nbd-2.8.3]# cat /proc/mdstat Personalities : [raid1] md2 : active raid1 nbd6[1] nbd5[0] 3899484096 blocks [2/2] [UU] [] resync = 0.0% (1408/3899484096) finish=389948.2min speed=156K/sec md1 : active raid1 sdb3[1] sda3[0] 78188288 blocks [2/2] [UU] md0 : active raid1 sdb1[1] sda1[0] 128384 blocks [2/2] [UU] unused devices: none But no more has been done a minute later [EMAIL PROTECTED] nbd-2.8.3]# date Wed Feb 22 16:19:49 MST 2006 [EMAIL PROTECTED] nbd-2.8.3]# cat /proc/mdstat Personalities : [raid1] md2 : active raid1 nbd6[1] nbd5[0] 3899484096 blocks [2/2] [UU] [] resync = 0.0% (1408/3899484096) finish=2599655.1min speed=23K/sec md1 : active raid1 sdb3[1] sda3[0] 78188288 blocks [2/2] [UU] md0 : active raid1 sdb1[1] sda1[0] 128384 blocks [2/2] [UU] unused devices: none And later still, no more of the resync has been done [EMAIL PROTECTED] nbd-2.8.3]# date Wed Feb 22 16:20:38 MST 2006 [EMAIL PROTECTED] nbd-2.8.3]# cat /proc/mdstat Personalities : [raid1] md2 : active raid1 nbd6[1] nbd5[0] 3899484096 blocks [2/2] [UU] [] resync = 0.0% (1408/3899484096) finish=4679379.2min speed=13K/sec md1 : active raid1 sdb3[1] sda3[0] 78188288 blocks [2/2] [UU] md0 : active raid1 sdb1[1] sda1[0] 128384 blocks [2/2] [UU] unused devices: none At this point, the resync is stuck and the system is idle. I have left it overnight, but it progresses no further. 100% of the time this test will stop at 1408 on the rebuild. With other configurations, the number will change (for example, it was 1280 for a 6 column RAID 5), but always halt at the same spot. Nothing is logged in the system files [EMAIL PROTECTED] nbd-2.8.3]# tail -15 /var/log/messages Feb 22 15:48:35 leadstor kernel: parport: PnPBIOS parport detected. Feb 22 15:48:35 leadstor kernel: parport0: PC-style at 0x378, irq 7 [PCSPP] Feb 22 15:48:35 leadstor kernel: lp0: using parport0 (interrupt-driven). Feb 22 15:48:35 leadstor kernel: lp0: console ready Feb 22 15:48:37 leadstor
Re: raid 4, and bitmap.
Ahh, i almost forget! The mdadm is sometimes drop cannot allocate memory and next try segfault when i try -G --bitmap=internal on 2TB arrays! And after segfault, the full raid is stops... Cheers, Janos I think i found the bug, its me. :-) Today it happens again, and i see, i have misstyped the internal word like intarnal. The mdadm is accepted that, and try to make the bitmap on NFS again, and crashed the raid. I think it is neccessary to better test: - the bitmap file's fs - the filename itself. (i mean did not allow the current directory, or the internal and none to be a separated option) Cheers, Janos - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Raid 4 resize, raid0 limit question
--cut-- I plan to resize (grow) one raid4 array. 1. stop the array. 2. resize the partition on all disks to fit the maximum size. The approach is currently not supported. It would need a change to mdadm to find the old superblock and relocate it to the new end of the partition. The only currently 'supported' way it to remove devices one at a time, resize them, and add them back in as new devices, waiting for the resync. Good news! :-) This takes about 1 weeks for me... :-( I should recreate NeilBrown Neil! What do you think about making 2 files into proc or sys, as 2 margin for raid sync? The default value is 0 and sectorcount. (or KiB of the array) The user can set this befor the sync is starts, or when the sync is run, and if the sync is done, the default values are set again automatically. The sync is move between the two value only. This is easy to write - i think-, not too dangerous, and some times (or often) very practical. This will help for me often, including this time, to raid4 resize from 2TB to 3.6TB. After this restart(assemble) the array is possiple? I mean, how can the kernel find the superblock fits on the half of the new partitions? I need to recreate the array instead of using -G option? Can i force raid to resync only the new area? The raid0 in 2.6.16-rc1 supports 4x 3.6TB soure devices? :-) ... maybe? I think it does, but I cannot promise anything. Anyway, i will test it on the weekend, and i dont need to grow the FS too on it. How can i safe test it (and NBD 2TB) to work well, without data lost? Anybody know any good tool to test the 13.4TB raid0 array inside 8TB live, and valuable fs without data lost? I need to test the raid0 and NBD before resize the FS to fit to the array. I can think only to dd with skip=NN option, but at this time i did'nt trust the dd enough. :-) Thanks, Janos Thanks, Janos NeilBrown Thanks, Janos - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: raid 4, and bitmap.
- Original Message - From: Neil Brown [EMAIL PROTECTED] To: JaniD++ [EMAIL PROTECTED] Cc: linux-raid@vger.kernel.org Sent: Friday, February 03, 2006 1:09 AM Subject: Re: raid 4, and bitmap. On Friday February 3, [EMAIL PROTECTED] wrote: Hello, list, Neil, I try to add bitmaps to raid4, and mdadm is done this fine. In the /proc/mdstat shows this, and it is really works well. But on reboot, the kernel drops the bitmap(, and resync the entire array if it is unclean). :( It is still uncomplete now? (2.6.16-rc1) It is an 'internal' bitmap, or is the bitmap in a file? It is internal. The external bitmap is not works for me, because on the boot, only the NFS is reachable, and it cause crash. (note: this is raid4 not raid5!) If the bitmap is in a file, you need to me sure that the file is provided by mdadm when the array is assembled - using in-kernel autodetect won't work. If it is an internal bitmap it should work. Are there any kernel messages during boot that might be interesting? It is something, i will find it one minute! :-) NeilBrown - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Raid 4 resize, raid0 limit question
- Original Message - From: Neil Brown [EMAIL PROTECTED] To: JaniD++ [EMAIL PROTECTED] Cc: linux-raid@vger.kernel.org Sent: Friday, February 03, 2006 1:12 AM Subject: Re: Raid 4 resize, raid0 limit question On Friday February 3, [EMAIL PROTECTED] wrote: Hello, list, I plan to resize (grow) one raid4 array. 1. stop the array. 2. resize the partition on all disks to fit the maximum size. The approach is currently not supported. It would need a change to mdadm to find the old superblock and relocate it to the new end of the partition. The only currently 'supported' way it to remove devices one at a time, resize them, and add them back in as new devices, waiting for the resync. Good news! :-) This takes about 1 weeks for me... :-( I should recreate NeilBrown After this restart(assemble) the array is possiple? I mean, how can the kernel find the superblock fits on the half of the new partitions? I need to recreate the array instead of using -G option? Can i force raid to resync only the new area? The raid0 in 2.6.16-rc1 supports 4x 3.6TB soure devices? :-) ... maybe? I think it does, but I cannot promise anything. Anyway, i will test it on the weekend, and i dont need to grow the FS too on it. How can i safe test it (and NBD 2TB) to work well, without data lost? Thanks, Janos NeilBrown Thanks, Janos - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Fw: raid 4, and bitmap.
- Original Message - From: JaniD++ [EMAIL PROTECTED] To: Neil Brown [EMAIL PROTECTED] Sent: Friday, February 03, 2006 1:20 AM Subject: Re: raid 4, and bitmap. - Original Message - From: Neil Brown [EMAIL PROTECTED] To: JaniD++ [EMAIL PROTECTED] Cc: linux-raid@vger.kernel.org Sent: Friday, February 03, 2006 1:09 AM Subject: Re: raid 4, and bitmap. On Friday February 3, [EMAIL PROTECTED] wrote: Hello, list, Neil, I try to add bitmaps to raid4, and mdadm is done this fine. In the /proc/mdstat shows this, and it is really works well. But on reboot, the kernel drops the bitmap(, and resync the entire array if it is unclean). :( It is still uncomplete now? (2.6.16-rc1) It is an 'internal' bitmap, or is the bitmap in a file? If the bitmap is in a file, you need to me sure that the file is provided by mdadm when the array is assembled - using in-kernel autodetect won't work. If it is an internal bitmap it should work. Are there any kernel messages during boot that might be interesting? Sorry, i did not log this, and i dont want to restart the sync and system for this. Anyway, i add back the bitmap, and the next crash we will see I mean this was: bitmap is only support in raid1. bitmap is removed. But not so sure. :( Cheers, Janos NeilBrown - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: where is the spare drive? :-)
- Original Message - From: Neil Brown [EMAIL PROTECTED] To: JaniD++ [EMAIL PROTECTED] Cc: linux-raid@vger.kernel.org Sent: Thursday, January 12, 2006 4:07 AM Subject: Re: where is the spare drive? :-) On Monday January 2, [EMAIL PROTECTED] wrote: 5. The question Why shows sdh2 as spare? The MD array size is correct. And i really can see, the all drive is reading, and sdh2 is *ONLY* writing. man mdadm Towards the end of the CREATE MODE section: When creating a RAID5 array, mdadm will automatically create a degraded array with an extra spare drive. This is because building the spare into a degraded array is in general faster than resyncing the parity on a non-degraded, but not clean, array. This feature can be over-ridden with the --force option. I hope this clarifies the situation. NeilBrown Ahh, this was avoid my attention. The mdadm man page (and functionallity) is quite large. I think this is more important to let some people to overwrite own data. I think it is neccessary to place some note to the man page to warn people about this exception. Anyway this is a good idea! :-) Thanks to note me about this. Cheers, Janos - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
built in readahead? - chunk size question
Hello, list, I have found one interesting issue. I use 4 disk node with NBD, and the concentrator distributes the load equal thanks to 32KB chunksize RAID0 inside. At this time i am working on the system upgrade, and found one interesting issue, and possibly one bottleneck on the system. The concentrator shows this with iostat -d -k -x 10: (I have marked the interesting parts with [ ]) Device:rrqm/s wrqm/s r/s w/s rsec/s wsec/srkB/swkB/s avgrq-sz avgqu-sz await svctm %util nbd054.15 0.00 45.85 0.00 6169.830.00 3084.92 0.00 134.55 1.43 31.11 7.04 32.27 --node-1 nbd158.24 0.00 44.06 0.00 6205.790.00 [3102.90] 0.00 140.86 516.74 11490.79 22.70 100.00 --node-2 nbd255.84 0.00 44.76 0.00 6159.440.00 3079.72 0.00 137.62 1.51 33.73 6.88 30.77 nbd355.34 0.00 45.05 0.00 6169.030.00 3084.52 0.00 136.92 1.07 23.79 5.72 25.77 md31 0.00 0.00 401.70 0.10 24607.391.00 12303.70 0.50 61.25 0.000.00 0.00 0.00 The old node-1 shows this: Device:rrqm/s wrqm/s r/s w/s rsec/s wsec/srkB/swkB/s avgrq-sz avgqu-sz await svctm %util hda140.26 0.80 9.19 3.50 1195.60 34.37 597.8017.18 96.94 0.20 15.43 11.81 14.99 hdc133.37 0.00 8.89 3.30 1138.06 26.37 569.0313.19 95.54 0.17 13.85 11.15 13.59 hde142.76 1.40 13.99 3.90 1253.95 42.36 626.9721.18 72.49 0.29 16.31 10.00 17.88 hdi136.56 0.20 13.19 3.10 1197.20 26.37 598.6013.19 75.14 0.33 20.12 12.82 20.88 hdk134.07 0.30 13.89 3.40 1183.62 29.57 591.8114.79 70.20 0.28 16.30 10.87 18.78 hdm137.46 0.20 13.39 3.80 1205.99 31.97 603.0015.98 72.05 0.38 21.98 12.67 21.78 hdo125.07 0.10 11.69 3.20 1093.31 26.37 546.6513.19 75.22 0.32 21.54 14.23 21.18 hdq131.37 1.20 12.49 3.70 1150.85 39.16 575.4219.58 73.53 0.30 18.77 12.04 19.48 hds130.97 1.40 13.59 4.10 1155.64 43.96 577.8221.98 67.84 0.57 32.37 14.80 26.17 sda148.55 1.30 10.09 3.70 1269.13 39.96 634.5719.98 94.96 0.30 21.81 14.86 20.48 sdb131.07 0.10 9.69 3.30 1125.27 27.17 562.6413.59 88.74 0.18 13.92 11.31 14.69 md0 0.00 0.00 1611.49 5.29 12891.91 42.36 [6445.95]21.18 8.00 0.000.00 0.00 0.00 The new node #2 shows this: Device:rrqm/s wrqm/s r/s w/s rsec/s wsec/srkB/swkB/s avgrq-sz avgqu-sz await svctm %util hda1377.02 0.00 15.88 0.20 11143.261.60 5571.63 0.80 692.92 0.39 24.47 18.76 30.17 hdb1406.79 0.00 8.59 0.20 11323.081.60 5661.54 0.80 1288.18 0.28 32.16 31.48 27.67 hde1430.77 0.00 8.19 0.20 11511.691.60 5755.84 0.80 1372.00 0.27 32.74 29.17 24.48 hdf1384.42 0.00 6.99 0.20 11130.471.60 5565.23 0.80 1547.67 0.40 56.94 54.86 39.46 sda1489.11 0.00 15.08 0.20 12033.571.60 6016.78 0.80 787.40 0.36 23.33 14.38 21.98 sdb1392.11 0.00 14.39 0.20 11251.951.60 5625.97 0.80 771.56 0.39 26.78 16.16 23.58 sdc1468.33 3.00 14.29 0.40 11860.94 27.17 5930.4713.59 809.52 0.37 25.24 14.97 21.98 sdd1498.30 1.50 14.99 0.30 12106.29 14.39 6053.15 7.19 792.99 0.40 26.21 15.82 24.18 sde1446.55 0.00 13.79 0.20 11683.521.60 5841.76 0.80 835.49 0.37 26.36 16.14 22.58 sdf1510.59 0.00 13.19 0.20 12191.011.60 6095.50 0.80 910.81 0.39 28.96 17.39 23.28 sdg1421.18 0.00 14.69 0.20 11486.911.60 5743.46 0.80 771.81 0.35 23.83 15.23 22.68 sdh 4.50 4.50 0.30 0.50 38.36 39.9619.1819.98 98.00 0.001.25 1.25 0.10 md1 0.00 0.00 15960.54 4.80 127684.32 38.36 [63842.16] 19.18 8.00 0.000.00 0.00 0.00 The node-1 (+3,4) have one raid-5 with chunksize 32K The new node-2 have currently raid4, chunksize 1024K The NBD is serves only 1KB blocks. (ethernet network) Currently to clean test, the readahead on all nodes is set to 0 on all devices, including md[0-1]! The question is this: The 3.1MB/s requests on concentrator how can generate 6.4MB/s read on node1 and 63.8MB/s on node2 with all readahead 0? Does the raid 4,5 hardcoded readahead? Or if the nbd-server fetch one kb, the raid (or another part of OS) reads the entire chunk? Thanks, Janos - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: raid5 read performance
- Original Message - From: Raz Ben-Jehuda(caro) [EMAIL PROTECTED] To: JaniD++ [EMAIL PROTECTED] Cc: Linux RAID Mailing List linux-raid@vger.kernel.org Sent: Tuesday, January 10, 2006 12:25 AM Subject: Re: raid5 read performance 1. it is not good to use so many disks in one raid. this means that in degraded mode 10 disks would be needed to reconstruct one slice of data. 2. i did not understand what is raid purpose. Yes, i know that. In my system, this was the best choise. I have 4 disk node inside 4x12 Maxtor 200GB (exactly 10xIDE+2xSATA). The disk nodes sevres nbd. The concentrator joins the nodes with sw-raid0 The system is a generally free web storage. 3. 10 MB/s is very slow. what sort of disks do u have ? 4x(2xSATA+10xIDE) Maxtor 200GB The system sometimes have 500-800-1000 downloaders at same time. In this load, the per node traffic is only 10MB/s. (~100Mbit/s) First i think the sync/async IO problem. At this time i think the bottleneck on the nodes is the PCI-32 with 8 HDD. :( 4. what is the raid stripe size ? Currently all raid layers have 32KB chunks. Cheers, Janos On 1/4/06, JaniD++ [EMAIL PROTECTED] wrote: - Original Message - From: Raz Ben-Jehuda(caro) [EMAIL PROTECTED] To: JaniD++ [EMAIL PROTECTED] Cc: Linux RAID Mailing List linux-raid@vger.kernel.org Sent: Wednesday, January 04, 2006 2:49 PM Subject: Re: raid5 read performance 1. do you want the code ? Yes. If it is difficult to set. I use 4 big raid5 array (4 disk node), and the performace is not too good. My standalone disk can do ~50MB/s, but 11 disk in one raid array does only ~150Mbit/s. (With linear read using dd) At this time i think this is my systems pci-bus bottleneck. But on normal use, and random seeks, i am happy, if one disk-node can do 10MB/s ! :-( Thats why i am guessing this... 2. I managed to gain linear perfromance with raid5. it seems that both raid 5 and raid 0 are caching read a head buffers. raid 5 cached small amount of read a head while raid0 did not. Aham. But... I dont understand... You wrote that, the RAID5 is slower than RAID0. The read a head buffering/caching is bad for performance? Cheers, Janos On 1/4/06, JaniD++ [EMAIL PROTECTED] wrote: - Original Message - From: Raz Ben-Jehuda(caro) [EMAIL PROTECTED] To: Mark Hahn [EMAIL PROTECTED] Cc: Linux RAID Mailing List linux-raid@vger.kernel.org Sent: Wednesday, January 04, 2006 9:14 AM Subject: Re: raid5 read performance I guess i was not clear enough. i am using raid5 over 3 maxtor disks. the chunk size is 1MB. i mesured the io coming from one disk alone when I READ from it with 1MB buffers , and i know that it is ~32MB/s. I created raid0 over two disks and my throughput grown to 64 MB/s. Doing the same thing with raid5 ended in 32 MB/s. I am using async io since i do not want to wait for several disks when i send an IO. By sending a buffer which is striped aligned i am supposed to have one to one relation between a disk and an io. iostat show that all of the three disks work but not fully. Hello, How do you set sync/async io? Please, let me know! :-) Thanks, Janos -- Raz -- Raz - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: raid5 read performance
- Original Message - From: Raz Ben-Jehuda(caro) [EMAIL PROTECTED] To: JaniD++ [EMAIL PROTECTED] Cc: linux-raid@vger.kernel.org Sent: Tuesday, January 10, 2006 9:05 PM Subject: Re: raid5 read performance NBD for network block device ? Yes. :-) why do u use it ? I need only one big block device. In the beginning, i try almost all tool to transport the block devices to the concentrator, and the best choise (speed and stability) looks like RedHat's GNBD. But GNBD is have the same problem, like NBD, the old deadlock problem on heavy write. The only difference is the GNBD issues that rarely than NBD. Couple of months ago, Herbert Xu have fixed the NBD-deadlock problem (with my help:-), and now the fixed NBD is the best choise! Do you have better idea? :-) Please let me know! what type of elevator do you use ? Elevator? What do you think exactly? My system's actually performance is thanks to block devices good readahead settings. (in all layer, including nbd) Cheers, Janos On 1/10/06, JaniD++ [EMAIL PROTECTED] wrote: - Original Message - From: Raz Ben-Jehuda(caro) [EMAIL PROTECTED] To: JaniD++ [EMAIL PROTECTED] Cc: Linux RAID Mailing List linux-raid@vger.kernel.org Sent: Tuesday, January 10, 2006 12:25 AM Subject: Re: raid5 read performance 1. it is not good to use so many disks in one raid. this means that in degraded mode 10 disks would be needed to reconstruct one slice of data. 2. i did not understand what is raid purpose. Yes, i know that. In my system, this was the best choise. I have 4 disk node inside 4x12 Maxtor 200GB (exactly 10xIDE+2xSATA). The disk nodes sevres nbd. The concentrator joins the nodes with sw-raid0 The system is a generally free web storage. 3. 10 MB/s is very slow. what sort of disks do u have ? 4x(2xSATA+10xIDE) Maxtor 200GB The system sometimes have 500-800-1000 downloaders at same time. In this load, the per node traffic is only 10MB/s. (~100Mbit/s) First i think the sync/async IO problem. At this time i think the bottleneck on the nodes is the PCI-32 with 8 HDD. :( 4. what is the raid stripe size ? Currently all raid layers have 32KB chunks. Cheers, Janos On 1/4/06, JaniD++ [EMAIL PROTECTED] wrote: - Original Message - From: Raz Ben-Jehuda(caro) [EMAIL PROTECTED] To: JaniD++ [EMAIL PROTECTED] Cc: Linux RAID Mailing List linux-raid@vger.kernel.org Sent: Wednesday, January 04, 2006 2:49 PM Subject: Re: raid5 read performance 1. do you want the code ? Yes. If it is difficult to set. I use 4 big raid5 array (4 disk node), and the performace is not too good. My standalone disk can do ~50MB/s, but 11 disk in one raid array does only ~150Mbit/s. (With linear read using dd) At this time i think this is my systems pci-bus bottleneck. But on normal use, and random seeks, i am happy, if one disk-node can do 10MB/s ! :-( Thats why i am guessing this... 2. I managed to gain linear perfromance with raid5. it seems that both raid 5 and raid 0 are caching read a head buffers. raid 5 cached small amount of read a head while raid0 did not. Aham. But... I dont understand... You wrote that, the RAID5 is slower than RAID0. The read a head buffering/caching is bad for performance? Cheers, Janos On 1/4/06, JaniD++ [EMAIL PROTECTED] wrote: - Original Message - From: Raz Ben-Jehuda(caro) [EMAIL PROTECTED] To: Mark Hahn [EMAIL PROTECTED] Cc: Linux RAID Mailing List linux-raid@vger.kernel.org Sent: Wednesday, January 04, 2006 9:14 AM Subject: Re: raid5 read performance I guess i was not clear enough. i am using raid5 over 3 maxtor disks. the chunk size is 1MB. i mesured the io coming from one disk alone when I READ from it with 1MB buffers , and i know that it is ~32MB/s. I created raid0 over two disks and my throughput grown to 64 MB/s. Doing the same thing with raid5 ended in 32 MB/s. I am using async io since i do not want to wait for several disks when i send an IO. By sending a buffer which is striped aligned i am supposed to have one to one relation between a disk and an io. iostat show that all of the three disks work but not fully. Hello, How do you set sync/async io? Please, let me know! :-) Thanks, Janos -- Raz -- Raz - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html -- Raz - To unsubscribe from this list: send the line unsubscribe linux-raid
Re: where is the spare drive? :-)
- Original Message - From: Marc [EMAIL PROTECTED] To: JaniD++ [EMAIL PROTECTED]; linux-raid@vger.kernel.org Sent: Thursday, January 05, 2006 7:16 AM Subject: Re: where is the spare drive? :-) On Mon, 2 Jan 2006 00:26:58 +0100, JaniD++ wrote Hello, list, I found something interesting when i try to create a brand new array on brand new drives snip 5. The question Why shows sdh2 as spare? The MD array size is correct. And i really can see, the all drive is reading, and sdh2 is *ONLY* writing. I'm not 100% sure but from a post by Neil a while a go on the list, the spare device is a temporary construct created during the resync operation. Once the resync is complete it should disappear. You could try searching the list archives for the post - choice of keywords is up to you ;) Thanks, but i have found the bug already. ;-) If i create new raid5, it should only parity resyncing, and not spare rebuilding! This happens, only if i use mdadm. With raidtools works fine. My problem is now the bitmap. :( Only mdadm supports this Cheers, Janos Regards, Marc - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
raid-reconf question
Hello, list, I try to test raidreconf utility on my spare drives in my disk nodes. (i want to convert raid0 chunksize 32K to 1M) Why happenning this? [EMAIL PROTECTED] raid-converter]# cat /proc/mdstat Personalities : [linear] [raid0] [raid1] [raid5] [multipath] [faulty] md20 : active raid0 nbd7[3] nbd6[2] nbd5[1] nbd4[0] 39101696 blocks 32k chunks unused devices: none [EMAIL PROTECTED] raid-converter]# mdadm -D /dev/md20 /dev/md20: Version : 00.90.03 Creation Time : Thu Dec 29 02:02:45 2005 Raid Level : raid0 Array Size : 39101696 (37.29 GiB 40.04 GB) Raid Devices : 4 Total Devices : 4 Preferred Minor : 20 Persistence : Superblock is persistent Update Time : Thu Dec 29 02:02:45 2005 State : clean Active Devices : 4 Working Devices : 4 Failed Devices : 0 Spare Devices : 0 Chunk Size : 32K UUID : 6865dd35:58a41e0c:b0c1a78a:2aa6d02d Events : 0.1 Number Major Minor RaidDevice State 0 4340 active sync /dev/nb4 1 4351 active sync /dev/nb5 2 4362 active sync /dev/nb6 3 4373 active sync /dev/nb7 [EMAIL PROTECTED] raid-converter]# raidstop --configfile=raidtab.old /dev/md20 [EMAIL PROTECTED] raid-converter]# ./raidreconf -o raidtab.old -n raidtab.new -m /dev/md20 Working with device /dev/md20 Parsing raidtab.old Parsing raidtab.new Your on-disk array MUST be clean first. reconfiguration failed [EMAIL PROTECTED] raid-converter]# Anybody has an idea? Thanks, Janos File: raidtab.old raiddev /dev/md20 raid-level 0 nr-raid-disks 4 chunk-size 32 persistent-superblock 1 device /dev/nb4 raid-disk 0 device /dev/nb5 raid-disk 1 device /dev/nb6 raid-disk 2 device /dev/nb7 raid-disk 3 File: raidtab.new raiddev /dev/md20 raid-level 0 nr-raid-disks 4 chunk-size 1024 persistent-superblock 1 device /dev/nb4 raid-disk 0 device /dev/nb5 raid-disk 1 device /dev/nb6 raid-disk 2 device /dev/nb7 raid-disk 3 - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RAID5 resync question BUGREPORT!
- Original Message - From: Neil Brown [EMAIL PROTECTED] To: JaniD++ [EMAIL PROTECTED] Cc: linux-raid@vger.kernel.org Sent: Thursday, December 22, 2005 5:46 AM Subject: Re: RAID5 resync question BUGREPORT! On Monday December 19, [EMAIL PROTECTED] wrote: - Original Message - From: Neil Brown [EMAIL PROTECTED] To: JaniD++ [EMAIL PROTECTED] Cc: linux-raid@vger.kernel.org Sent: Monday, December 19, 2005 1:57 AM Subject: Re: RAID5 resync question BUGREPORT! How big is your array? Raid Level : raid5 Array Size : 1953583360 (1863.08 GiB 2000.47 GB) Device Size : 195358336 (186.31 GiB 200.05 GB) The default bitmap-chunk-size when the bitmap is in a file is 4K, this makes a very large bitmap on a large array. Hmmm The bitmap chunks are in the device space rather than the array space. So 4K chunks in 186GiB is 48million chunks, so 48million bits. 8*4096 bits per page, so 1490 pages, which is a lot, and maybe a waste, but you should be able to allocate 4.5Meg... But there is a table which holds pointers to these pages. 4 bytes per pointer (8 on a 64bit machine) so 6K or 12K for the table. Allocating anything bigger than 4K can be a problem, so that is presumably the limit you hit. The max the table size should be is 4K, which is 1024 pages (on a 32bit machine), which is 33 million bits. So we shouldn't allow more than 33million (33554432 actually) chunks. On you array, that would be 5.8K, so 8K chunks should be ok, unless you have a 64bit machine, then 16K chunks. Still that is wasting a lot of space. My system is currently running on i386, 32. I can see, the 2TB array is usually hit some limits. :-) My first idea was the variables phisical size. (eg: int:32768, double 65535, etc...) Did you chech that? :-) Yes, and if i can see correctly, it makes overflow. Try a larger bitmap-chunk size e.g. mdadm -G --bitmap-chunk=256 --bitmap=/raid.bm /dev/md0 I think it is still uncompleted! [EMAIL PROTECTED] /]# mdadm -G --bitmap-chunk=256 --bitmap=/raid.bm /dev/md0 mdadm: Warning - bitmaps created on this kernel are not portable between different architectured. Consider upgrading the Linux kernel. Segmentation fault Oh dear There should have been an 'oops' message in the kernel logs. Can you post it. Yes, you have right! If i think correclty, the problem is the live bitmap file on NFS. :-) (i am a really good tester! :-D) Dec 19 10:58:37 st-0001 kernel: md0: bitmap file is out of date (0 82198273) -- forcing full recovery Dec 19 10:58:37 st-0001 kernel: md0: bitmap file is out of date, doing full recovery Dec 19 10:58:37 st-0001 kernel: Unable to handle kernel NULL pointer dereference at virtual address 0078 Dec 19 10:58:38 st-0001 kernel: printing eip: Dec 19 10:58:38 st-0001 kernel: c0213524 Dec 19 10:58:38 st-0001 kernel: *pde = Dec 19 10:58:38 st-0001 kernel: Oops: [#1] Dec 19 10:58:38 st-0001 kernel: SMP Dec 19 10:58:38 st-0001 kernel: Modules linked in: netconsole Dec 19 10:58:38 st-0001 kernel: CPU:0 Dec 19 10:58:38 st-0001 kernel: EIP:0060:[c0213524]Not tainted VLI Dec 19 10:58:38 st-0001 kernel: EFLAGS: 00010292 (2.6.14.2-NBDFIX) Dec 19 10:58:38 st-0001 kernel: EIP is at nfs_flush_incompatible+0xf/0x8d Dec 19 10:58:38 st-0001 Dec 19 10:58:38 st-0001 kernel: eax: ebx: 0f00 ecx: edx: 0282 Dec 19 10:58:38 st-0001 kernel: esi: 0001 edi: c1fcaf40 ebp: f7dc7500 esp: e2281d7c Dec 19 10:58:38 st-0001 kernel: ds: 007b es: 007b ss: 0068 Dec 19 10:58:38 st-0001 kernel: Process mdadm (pid: 30771, threadinfo=e228 task=f6f28540) Dec 19 10:58:38 st-0001 kernel: Stack: 0282 c014fd3f c1fcaf40 0060 0f00 0001 c1fcaf40 Dec 19 10:58:38 st-0001 kernel:f7dc7500 c04607e1 c1fcaf40 1000 c1fcaf40 0f00 Dec 19 10:58:38 st-0001 kernel:c1fcaf40 ffaa6000 c04619a7 f7dc7500 c1fcaf40 0001 Dec 19 10:58:38 st-0001 kernel: Call Trace: Dec 19 10:58:38 st-0001 kernel: [c014fd3f] page_address+0x8e/0x94 Dec 19 10:58:38 st-0001 kernel: [c04607e1] write_page+0x5b/0x15d Dec 19 10:58:38 st-0001 kernel: [c04619a7] bitmap_init_from_disk+0x3eb/0x4df Dec 19 10:58:38 st-0001 kernel: [c0462b79] bitmap_create+0x1dc/0x2d3 Dec 19 10:58:38 st-0001 kernel: [c045d579] set_bitmap_file+0x68/0x19f Dec 19 10:58:38 st-0001 kernel: [c045e0f6] md_ioctl+0x456/0x678 Dec 19 10:58:38 st-0001 kernel: [c04f7640] rpcauth_lookup_credcache+0xe3/0x1cb Dec 19 10:58:38 st-0001 kernel: [c04f7781] rpcauth_lookupcred+0x59/0x95 Dec 19 10:58:38 st-0001 kernel: [c020c240] nfs_file_set_open_context+0x29/0x4b Dec 19 10:58:38 st-0001 kernel: [c03656e8] blkdev_driver_ioctl+0x6b/0x80 Dec 19 10:58:38 st-0001 kernel: [c0365824] blkdev_ioctl+0x127/0x19e Dec 19 10:58:38 st-0001 kernel: [c016a2fb] block_ioctl+0x2b/0x2f Dec 19 10:58:38 st-0001 kernel: [c01745ed] do_ioctl+0x2d/0x81 Dec 19 10
Re: RAID0 performance question
- Original Message - From: Neil Brown [EMAIL PROTECTED] To: JaniD++ [EMAIL PROTECTED] Cc: Al Boldi [EMAIL PROTECTED]; linux-raid@vger.kernel.org Sent: Wednesday, December 21, 2005 2:40 AM Subject: Re: RAID0 performance question On Sunday December 18, [EMAIL PROTECTED] wrote: The raid (md) device why dont have scheduler in sysfs? And if it have scheduler, where can i tune it? raid0 doesn't do any scheduling. All it does is take requests from the filesystem, decide which device they should go do (possibly splitting them if needed) and forwarding them on to the device. That is all. The raid0 can handle multiple requests at one time? Yes. But raid0 doesn't exactly 'handle' requests. It 'directs' requests for other devices to 'handle'. For me, the performance bottleneck is cleanly about RAID0 layer used exactly as concentrator to join the 4x2TB to 1x8TB. But it is only a software, and i cant beleave it is unfixable, or tunable. There is really nothing to tune apart from chunksize. You can tune the way the filesystem/vm accesses the device by setting readahead (readahead on component devices of a raid0 has exactly 0 effect). First i want to sorry, about Neil not interested thing in previous mail... :-( I have already try the all available options, including readahead in all layer (result in earlyer mails), and chunksize. But with this settings, i cannot workaround this. And the result is incomprehensible for me! The raid0 performance is not equal with one component , with sum of all component , and not equal with the slowest component! You can tune the underlying devices by choosing a scheduler (for a disk drive) or a packet size (for over-the-network devices) or whatever. The NBD has a scheduler, and this is already tuned for really top performance, and for the components it is really great! :-) (I have planned to set the NBD to 4KB packets, but this is hard, becaused by my NICs are not supported the jumbo packets...) But there is nothing to tune in raid0. Also, rather than doing measurements on the block devices (/dev/mdX) do measurements on a filesystem created on that device. I have often found that the filesystem goes faster than the block device. I use XFS, and the two performance is almost equal, depends on kind of load. But in most often case, it is almost equal. Thanks, Janos NeilBrown - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RAID5 resync question BUGREPORT!
- Original Message - From: Neil Brown [EMAIL PROTECTED] To: JaniD++ [EMAIL PROTECTED] Cc: linux-raid@vger.kernel.org Sent: Monday, December 19, 2005 1:57 AM Subject: Re: RAID5 resync question BUGREPORT! On Thursday November 17, [EMAIL PROTECTED] wrote: Hello, Now i trying the patch [EMAIL PROTECTED] root]# mdadm -G --bitmap=/raid.bm /dev/md0 mdadm: Warning - bitmaps created on this kernel are not portable between different architectured. Consider upgrading the Linux kernel. mdadm: Cannot set bitmap file for /dev/md0: Cannot allocate memory How big is your array? Raid Level : raid5 Array Size : 1953583360 (1863.08 GiB 2000.47 GB) Device Size : 195358336 (186.31 GiB 200.05 GB) The default bitmap-chunk-size when the bitmap is in a file is 4K, this makes a very large bitmap on a large array. Yes, and if i can see correctly, it makes overflow. Try a larger bitmap-chunk size e.g. mdadm -G --bitmap-chunk=256 --bitmap=/raid.bm /dev/md0 I think it is still uncompleted! [EMAIL PROTECTED] /]# mdadm -G --bitmap-chunk=256 --bitmap=/raid.bm /dev/md0 mdadm: Warning - bitmaps created on this kernel are not portable between different architectured. Consider upgrading the Linux kernel. Segmentation fault [EMAIL PROTECTED] /]# And the raid layer is stopped. (The nbd-server stops to serving, and the cat /proc/mdstat is hangs too. i try to sync, and echo b /proc/sysrq-trigger After reboot, everything is back to normal.) This generates one 96000 byte /raid.bm. (Anyway i think the --bitmap-chunk option is neccessary to be automaticaly generated.) [EMAIL PROTECTED] root]# mdadm -X /dev/md0 This usage is only appropriate for arrays with internal bitmaps (I should get mdadm to check that..). Is there a way to check external bitmaps? And now what? :-) Either create an 'internal' bitmap, or choose a --bitmap-chunk size that is larger. First you sad, the space to the internal bitmap is only 64K. My first bitmap file is ~4MB, and with --bitmap-chunk=256 option still 96000 Byte. I don't think so... :-) I am affraid to overwrite an existing data. Cheers, Janos Thanks for the report. NeilBrown - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RAID0 performance question
- Original Message - From: Al Boldi [EMAIL PROTECTED] To: JaniD++ [EMAIL PROTECTED] Cc: linux-raid@vger.kernel.org Sent: Friday, December 02, 2005 8:53 PM Subject: Re: RAID0 performance question JaniD++ wrote: But the cat /dev/md31 /dev/null (RAID0, the sum of 4 nodes) only makes ~450-490 Mbit/s, and i dont know why Somebody have an idea? :-) Try increasing the read-ahead setting on /dev/md31 using 'blockdev'. network block devices are likely to have latency issues and would benefit from large read-ahead. Also try larger chunk-size ~4mb. But i don't know exactly what to try. increase or decrease the chunksize? In the top layer raid (md31,raid0) or in the middle layer raids (md1-4, raid1) or both? What I found is that raid over nbd is highly max-chunksize dependent, due to nbd running over TCP. But increasing chunksize does not necessarily mean better system utilization. Much depends on your application request size. Tuning performance to maximize cat/dd /dev/md# throughput may only be suitable for a synthetic indication of overall performance in system comparisons. Yes, you have right! I already know that. ;-) But the bottleneck-effect is visible with dd/cat too. (and i am a litte bit lazy :-) Now i try the system with my spare drives, with the bigger chunk size (=4096K on RAID0 and all RAID1), and the slowness is still here. :( The problem is _exactly_ the same as previously. I think unneccessary to try smaller chunk size, because the 32k is allready small for 2,5,8MB readahead. The problem is somewhere else... :-/ I have got one (or more) question for the raid list! The raid (md) device why dont have scheduler in sysfs? And if it have scheduler, where can i tune it? The raid0 can handle multiple requests at one time? For me, the performance bottleneck is cleanly about RAID0 layer used exactly as concentrator to join the 4x2TB to 1x8TB. But it is only a software, and i cant beleave it is unfixable, or tunable. ;-) Cheers, Janos If your aim is to increase system utilization, then look for a good benchmark specific to your application requirements which would mimic a realistic load. -- Al - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RAID5 resync question BUGREPORT!
Hello, Neil, [EMAIL PROTECTED] mdadm-2.2]# mdadm --grow /dev/md0 --bitmap=internal mdadm: Warning - bitmaps created on this kernel are not portable between different architectured. Consider upgrading the Linux kernel. Dec 8 23:59:45 st-0001 kernel: md0: bitmap file is out of date (0 81015178) -- forcing full recovery Dec 8 23:59:45 st-0001 kernel: md0: bitmap file is out of date, doing full recovery Dec 8 23:59:46 st-0001 kernel: md0: bitmap initialized from disk: read 12/12 pages, set 381560 bits, status: 0 Dec 8 23:59:46 st-0001 kernel: created bitmap (187 pages) for device md0 And the system is crashed. no ping reply, no netconsole error logging, no panic and reboot. Thanks, Janos - Original Message - From: Neil Brown [EMAIL PROTECTED] To: JaniD++ [EMAIL PROTECTED] Cc: linux-raid@vger.kernel.org Sent: Tuesday, December 06, 2005 2:05 AM Subject: Re: RAID5 resync question On Tuesday December 6, [EMAIL PROTECTED] wrote: - Original Message - From: Neil Brown [EMAIL PROTECTED] To: JaniD++ [EMAIL PROTECTED] Cc: linux-raid@vger.kernel.org Sent: Tuesday, December 06, 2005 1:32 AM Subject: Re: RAID5 resync question On Tuesday December 6, [EMAIL PROTECTED] wrote: Hello, list, Is there a way to force the raid to skip this type of resync? Why would you want to? The array is 'unclean', presumably due to a system crash. The parity isn't certain to be correct so your data isn't safe against a device failure. You *want* this resync. Thanks for the warning. Yes, you have right, the system is crashed. I know, it is some chance to leave some incorrect parity information on the array, but may be corrected by next write. Or it may not be corrected by the next write. The parity-update algorithm assumes that the parity is correct. On my system is very little dirty data, thanks to vm configuration and *very* often flushes. The risk is low, but the time what takes the resync is bigger problem. :-( If i can, i want to break this resync. And same on the fresh NEW raid5 array (One possible way: in this time rebuild the array with --force-skip-resync option or something similar...) If you have mdadm 2.2. then you can recreate the array with '--assume-clean', and all your data should still be intact. But if you get corruption one day, don't complain about it - it's your choice. If you are using 2.6.14 to later you can try turning on the write-intent bitmap (mdadm --grow /dev/md0 --bitmap=internal). That may impact write performance a bit (reports on how much would be appreciated) but will make this resync-after-crash much faster. Hmm. What does this exactly? Divides the array into approximately 200,000 sections (all a power of 2 in size) and keeps track (in a bitmap) of which sections might have inconsistent parity. if you crash, it only syncs sections recorded in the bitmap. Changes the existing array's structure? In a forwards/backwards compatible way (makes use of some otherwise un-used space). Need to resync? :-D You really should let your array sync this time. Once it is synced, add the bitmap. Then next time you have a crash, the cost will be much smaller. Safe with existing data? Yes. What do you think about full external log? Too much overhead without specialised hardware. To use some checkpoints in ext file or device to resync an array? And the better handling of half-synced array? I don't know what these mean. NeilBrown - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RAID5 resync question BUGREPORT!
Hi, After i get this on one of my disk node, imediately send this letter, and go to the hosting company, to see, is any message on the screen. But unfortunately nothing what i found. simple freeze. no message, no ping, no num lock! The full message of the node next reboot is here: http://download.netcenter.hu/bughunt/20051209/boot.log Next step, i try to restart the whole system. (the concentrator is hangs too, caused by lost the st-0001 node) The part of the next reboot message of the concentrator is here: http://download.netcenter.hu/bughunt/20051209/dy-boot.log Next step, i stops everything, to awoid more data lost. Try to remove the possible bitmap from the md0 of node-1 (st-0001). The messages is there: http://download.netcenter.hu/bughunt/20051209/mdadm.log At this time i cannot remove the broken bitmap, only deactivating the use of it. But on next reboot, the node will try to use it again. :( I have try to change the array to use an external bitmap, but the mdadm failed to create it too. The external bitmap file is here: (6 MB!) http://download.netcenter.hu/bughunt/20051209/md0.bitmap The error message is the same of internal bitmap creation. I dont know exactly, what caused the fs-damage, but here is my possible list: (sorted) 1. the mdadm (wrong bitmap size) 2. the kernel (wrong resync on startup) 3. the half written data, caused by first crash. One question: On a working array doing the bitmap creation is safe and race-free? (I mean race between the bitmap-create and bitmap update.) My data lost finally, really minimal. :-) Cheers, Janos - Original Message - From: Neil Brown [EMAIL PROTECTED] To: JaniD++ [EMAIL PROTECTED] Cc: linux-raid@vger.kernel.org Sent: Friday, December 09, 2005 12:43 AM Subject: Re: RAID5 resync question BUGREPORT! On Friday December 9, [EMAIL PROTECTED] wrote: Hello, Neil, [EMAIL PROTECTED] mdadm-2.2]# mdadm --grow /dev/md0 --bitmap=internal mdadm: Warning - bitmaps created on this kernel are not portable between different architectured. Consider upgrading the Linux kernel. Dec 8 23:59:45 st-0001 kernel: md0: bitmap file is out of date (0 81015178) -- forcing full recovery Dec 8 23:59:45 st-0001 kernel: md0: bitmap file is out of date, doing full recovery Dec 8 23:59:46 st-0001 kernel: md0: bitmap initialized from disk: read 12/12 pages, set 381560 bits, status: 0 Dec 8 23:59:46 st-0001 kernel: created bitmap (187 pages) for device md0 And the system is crashed. no ping reply, no netconsole error logging, no panic and reboot. Hmmm, that's unfortunate :-( Exactly what kernel were you running? NeilBrown - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RAID5 resync question
I know, it is some chance to leave some incorrect parity information on the array, but may be corrected by next write. Or it may not be corrected by the next write. The parity-update algorithm assumes that the parity is correct. Hmm. If it works with parity-update algorithm, instead of parity rewrite algorithm, you have right. But it works block-based, and if the entire block is written, the parity is turned to de correct, or not? :-) What is the block size? It isequal to chunk-size? Thanks the warning again! (One possible way: in this time rebuild the array with --force-skip-resync option or something similar...) If you have mdadm 2.2. then you can recreate the array with '--assume-clean', and all your data should still be intact. But if you get corruption one day, don't complain about it - it's your choice. Ahh, thats what i want. :-) (But reading this letter looks like unneccessary in this case...) What does this exactly? Divides the array into approximately 200,000 sections (all a power of 2 in size) and keeps track (in a bitmap) of which sections might have inconsistent parity. if you crash, it only syncs sections recorded in the bitmap. Changes the existing array's structure? In a forwards/backwards compatible way (makes use of some otherwise un-used space). What unused space? In the raid superblock? The end of the drives or the end of the array? It leaves the raid structure unchanged except the superblocks? Need to resync? :-D You really should let your array sync this time. Once it is synced, add the bitmap. Then next time you have a crash, the cost will be much smaller. This looks like really good idea! With this bitmap, the force skip resync is really unnecessary To use some checkpoints in ext file or device to resync an array? And the better handling of half-synced array? I don't know what these mean. (a little background: I have write a little stat program, using /sys/block/#/stat -files, to find performance bottlenecks. In the stat files i can see, if the device is reads or writes, and the needed times for these.) One time while my array is really rebuild one disk (paralel normal workload), i see, the new drive in the array *only* writes. i means with better handling of half-synced array is this: If read request comes to the ?% synced array, and if the read is on the synced half, only need to read from *new* device, instead reading all other to calculate data from parity. On a working system this can be a little speed up the rebuild process, and some offload the system. Or i'm on a wrong clue? :-) Cheers, Janos NeilBrown - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RAID5 resync question
One time while my array is really rebuild one disk (paralel normal workload), i see, the new drive in the array *only* writes. i means with better handling of half-synced array is this: If read request comes to the ?% synced array, and if the read is on the synced half, only need to read from *new* device, instead reading all other to calculate data from parity. On a working system this can be a little speed up the rebuild process, and some offload the system. Or i'm on a wrong clue? :-) Yes, it would probably be possible to get it to read from the recovering drive once that section had been recovered. I'll put it on my todo list. If i can add some idea to the world's greatest raid software, it is my pleasure! :-) But, Neil! It is still something what i cannot understand. (Preliminary, i never have read the raid5 code. However i cannot programming in C or C++, only a little can read.) I cannot cleanly understand what u sad about the parity-updating! If the array is clean, the parity spaces (blocks) only need to write. (or not?) Why use the raid code read-modify-write? I think it is unnecessary to read these blocks! The parity block recalculate in memory is more faster than read-modify-write. Why the parity space is continous area? (if it is...) I think it is only need to be block-based, from a lot of independent blocks. This can be speed up the resync, easy to using always checkpoints, and some more... And if the parity data is damaged (like system crash or sg.), and it is impossible to detect, the next write to the block will turn to correct again the parity. Cheers, Janos NeilBrown - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RAID0 performance question
Hello, But the cat /dev/md31 /dev/null (RAID0, the sum of 4 nodes) only makes ~450-490 Mbit/s, and i dont know why Somebody have an idea? :-) Try increasing the read-ahead setting on /dev/md31 using 'blockdev'. network block devices are likely to have latency issues and would benefit from large read-ahead. Also try larger chunk-size ~4mb. Ahh. This is what i can't do. :-( I dont know how to backup 8TB! ;-) Maybe you could use your mirror!? I have one idea! :-) I can use the spare drives in the disknodes! :-) But i don't know exactly what to try. increase or decrease the chunksize? In the top layer raid (md31,raid0) or in the middle layer raids (md1-4, raid1) or both? Can somebody help me to find the performance problem source? Thanks, Janos -- Al - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RAID0 performance question
Hello, Raz, Think this is not cpu usage problem. :-) The system is divided to 4 cpuset, and each cpuset uses only one disknode. (CPU0-nb0, CPU1-nb1, ...) this top is under cat /dev/md31 (raid0) Thanks, Janos 17:16:01 up 14:19, 4 users, load average: 7.74, 5.03, 4.20 305 processes: 301 sleeping, 4 running, 0 zombie, 0 stopped CPU0 states: 33.1% user 47.0% system0.0% nice 0.0% iowait 18.0% idle CPU1 states: 21.0% user 52.0% system0.0% nice 6.0% iowait 19.0% idle CPU2 states: 2.0% user 74.0% system0.0% nice 3.0% iowait 18.0% idle CPU3 states: 10.0% user 57.0% system0.0% nice 5.0% iowait 26.0% idle Mem: 4149412k av, 3961084k used, 188328k free, 0k shrd, 557032k buff 911068k active,2881680k inactive Swap: 0k av, 0k used, 0k free 2779388k cached PID USER PRI NI SIZE RSS SHARE STAT %CPU %MEM TIME CPU COMMAND 2410 root 0 -19 1584 10836 S 48.3 0.0 21:57 3 nbd-client 16191 root 25 0 4832 820 664 R48.3 0.0 3:04 0 grep 2408 root 0 -19 1588 11236 S 47.3 0.0 24:05 2 nbd-client 2406 root 0 -19 1584 10836 S 40.8 0.0 22:56 1 nbd-client 18126 root 18 0 5780 1604 508 D38.0 0.0 0:12 1 dd 2404 root 0 -19 1588 11236 S 36.2 0.0 22:56 0 nbd-client 294 root 15 0 00 0 SW7.4 0.0 3:22 1 kswapd0 2284 root 16 0 13500 5376 3040 S 7.4 0.1 8:53 2 httpd 18307 root 16 0 6320 2232 1432 S 4.6 0.0 0:00 2 sendmail 16789 root 16 0 5472 1552 952 R 3.7 0.0 0:03 3 top 2431 root 10 -5 00 0 SW 2.7 0.0 7:32 2 md2_raid1 29076 root 17 0 4776 772 680 S 2.7 0.0 1:09 3 xfs_fsr 6955 root 15 0 1588 10836 S 2.7 0.0 0:56 2 nbd-client - Original Message - From: Raz Ben-Jehuda(caro) [EMAIL PROTECTED] To: JaniD++ [EMAIL PROTECTED] Cc: linux-raid@vger.kernel.org Sent: Saturday, November 26, 2005 4:56 PM Subject: Re: RAID0 performance question look at the cpu consumption. On 11/26/05, JaniD++ [EMAIL PROTECTED] wrote: Hello list, I have searching the bottleneck of my system, and found something what i cant cleanly understand. I have use NBD with 4 disk nodes. (raidtab is the bottom of mail) The cat /dev/nb# /dev/nullmakes ~ 350 Mbit/s on each nodes. The cat /dev/nb0 + nb1 + nb2 + nb3 in one time parallel makes ~ 780-800 Mbit/s. - i think this is my network bottleneck. But the cat /dev/md31 /dev/null (RAID0, the sum of 4 nodes) only makes ~450-490 Mbit/s, and i dont know why Somebody have an idea? :-) (the nb31,30,29,28 only possible mirrors) Thanks Janos raiddev /dev/md1 raid-level 1 nr-raid-disks 2 chunk-size 32 persistent-superblock 1 device /dev/nb0 raid-disk 0 device /dev/nb31 raid-disk 1 failed-disk /dev/nb31 raiddev /dev/md2 raid-level 1 nr-raid-disks 2 chunk-size 32 persistent-superblock 1 device /dev/nb1 raid-disk 0 device /dev/hb30 raid-disk 1 failed-disk /dev/nb30 raiddev /dev/md3 raid-level 1 nr-raid-disks 2 chunk-size 32 persistent-superblock 1 device /dev/nb2 raid-disk 0 device /dev/nb29 raid-disk 1 failed-disk /dev/nb29 raiddev /dev/md4 raid-level 1 nr-raid-disks 2 chunk-size 32 persistent-superblock 1 device /dev/nb3 raid-disk 0 device /dev/nb28 raid-disk 1 failed-disk /dev/nb28 raiddev /dev/md31 raid-level 0 nr-raid-disks 4 chunk-size 32 persistent-superblock 1 device /dev/md1 raid-disk 0 device /dev/md2 raid-disk 1 device /dev/md3 raid-disk 2 device /dev/md4 raid-disk 3 - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html -- Raz - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html