Re: staggered stripes
Russell Coker posted on Thu, 15 May 2014 19:00:10 +1000 as excerpted: http://www.cs.wisc.edu/adsl/Publications/corruption-fast08.html Page 13 of the above paper says: # Figure 12 [...] We see in the figure that many disks develop # corruption for a specific set of block numbers. [T]herefore, # RAID system designers may be well-advised to use staggered # stripes such that the blocks that form a stripe (providing # the required redundancy) are placed at different block numbers # on different disks. Does the BTRFS RAID functionality do such staggered stripes? If not could it be added? AFAIK nothing like that yet, but it's reasonably likely to be implemented later. N-way-mirroring is roadmapped for next up after raid56 completion, however. You do mention the partition alternative, but not as I'd do it for such a case. Instead of doing a different sized buffer partition (or using the mkfs.btrfs option to start at some offset into the device) on each device, I'd simply do multiple partitions and reorder them on each device. Tho N-way-mirroring would sure help here too, since if a given area around the same address is assumed to be weak on each device, I'd sure like greater than the current 2-way-mirroring, even if if I had a different filesystem/partition at that spot on each one, since with only two-way-mirroring if one copy is assumed to be weak, guess what, you're down to only one reasonably reliable copy now, and that's not a good spot to be in if that one copy happens to be hit by a cosmic ray or otherwise fail checksum, without another reliable copy to fix it since that other copy is in the weak area already. Another alternative would be using something like mdraid's raid10 far layout, with btrfs on top of that... -- Duncan - List replies preferred. No HTML msgs. Every nonfree program has a lord, a master -- and if you use the program, he is your master. Richard Stallman -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: staggered stripes
On Thu, May 15, 2014 at 07:00:10PM +1000, Russell Coker wrote: http://www.cs.wisc.edu/adsl/Publications/corruption-fast08.html Page 13 of the above paper says: # Figure 12 presents for each block number, the number of disk drives of disk # model ‘E-1’ that developed a checksum mismatch at that block number. We see # in the figure that many disks develop corruption for a specific set of block # numbers. We also verified that (i) other disk models did not develop # multiple check-sum mismatches for the same set of block numbers (ii) the # disks that developed mismatches at the same block numbers belong to # different storage systems, and (iii) our software stack has no specific data # structure that is placed at the block numbers of interest. # # These observations indicate that hardware or firmware bugs that affect # specific sets of block numbers might exist. Therefore, RAID system designers # may be well-advised to use staggered stripes such that the blocks that form # a stripe (providing the required redundancy) are placed at different block # numbers on different disks. Does the BTRFS RAID functionality do such staggered stripes? If not could it be added? Yes, it could, by simply shifting around the chunk locations at allocation time. I'm working in this area at the moment, and I think it should be feasible within the scope of what I'm doing. I'll add it to my list of things to look at. Hugo. I guess there's nothing stopping a sysadmin from allocating an unused partition at the start of each disk and use a different size for each disk. But I think it would be best to do this inside the filesystem. Also this is another reason for having DUP+RAID-1. -- === Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk === PGP key: 65E74AC0 from wwwkeys.eu.pgp.net or http://www.carfax.org.uk --- If you're not part of the solution, you're part --- of the precipiate. signature.asc Description: Digital signature
Re: staggered stripes
On Thu, 15 May 2014 09:31:42 Duncan wrote: Does the BTRFS RAID functionality do such staggered stripes? If not could it be added? AFAIK nothing like that yet, but it's reasonably likely to be implemented later. N-way-mirroring is roadmapped for next up after raid56 completion, however. It's RAID-5/6 when we really need such staggering. It's a reasonably common configuration choice to use two different brands of disk for a RAID-1 array. As the correlation between parts of the disks with errors only applied to disks of the same make and model (and this is expected due to firmware/manufacturing issues) the people who care about such things on RAID-1 have probably already dealt with the issue. You do mention the partition alternative, but not as I'd do it for such a case. Instead of doing a different sized buffer partition (or using the mkfs.btrfs option to start at some offset into the device) on each device, I'd simply do multiple partitions and reorder them on each device. If there are multiple partitions on a device then that will probably make performance suck. Also does BTRFS even allow special treatment of them or will it put two copies from a RAID-10 on the same disk? Tho N-way-mirroring would sure help here too, since if a given area around the same address is assumed to be weak on each device, I'd sure like greater than the current 2-way-mirroring, even if if I had a different filesystem/partition at that spot on each one, since with only two-way-mirroring if one copy is assumed to be weak, guess what, you're down to only one reasonably reliable copy now, and that's not a good spot to be in if that one copy happens to be hit by a cosmic ray or otherwise fail checksum, without another reliable copy to fix it since that other copy is in the weak area already. Another alternative would be using something like mdraid's raid10 far layout, with btrfs on top of that... In the copies= option thread Brendan Hide stated that this sort of thing is planned. -- My Main Blog http://etbe.coker.com.au/ My Documents Bloghttp://doc.coker.com.au/ -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: staggered stripes
On 2014/05/15 04:38 PM, Russell Coker wrote: On Thu, 15 May 2014 09:31:42 Duncan wrote: Does the BTRFS RAID functionality do such staggered stripes? If not could it be added? AFAIK nothing like that yet, but it's reasonably likely to be implemented later. N-way-mirroring is roadmapped for next up after raid56 completion, however. It's RAID-5/6 when we really need such staggering. It's a reasonably common configuration choice to use two different brands of disk for a RAID-1 array. As the correlation between parts of the disks with errors only applied to disks of the same make and model (and this is expected due to firmware/manufacturing issues) the people who care about such things on RAID-1 have probably already dealt with the issue. You do mention the partition alternative, but not as I'd do it for such a case. Instead of doing a different sized buffer partition (or using the mkfs.btrfs option to start at some offset into the device) on each device, I'd simply do multiple partitions and reorder them on each device. If there are multiple partitions on a device then that will probably make performance suck. Also does BTRFS even allow special treatment of them or will it put two copies from a RAID-10 on the same disk? I suspect the approach is similar to the following: sd[abcd][1234] each configured as LVM PVs sda[1234] as an LVM VG sdb[2345] as an LVM VG sdc[3456] as an LVM VG sdd[4567] as an LVM VG btrfs across all four VGs ^ Um - the above is ignoring DOS-style partition limitations Tho N-way-mirroring would sure help here too, since if a given area around the same address is assumed to be weak on each device, I'd sure like greater than the current 2-way-mirroring, even if if I had a different filesystem/partition at that spot on each one, since with only two-way-mirroring if one copy is assumed to be weak, guess what, you're down to only one reasonably reliable copy now, and that's not a good spot to be in if that one copy happens to be hit by a cosmic ray or otherwise fail checksum, without another reliable copy to fix it since that other copy is in the weak area already. Another alternative would be using something like mdraid's raid10 far layout, with btrfs on top of that... In the copies= option thread Brendan Hide stated that this sort of thing is planned. -- __ Brendan Hide http://swiftspirit.co.za/ http://www.webafrica.co.za/?AFF1E97 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: staggered stripes
On Fri, May 16, 2014 at 12:38:04AM +1000, Russell Coker wrote: On Thu, 15 May 2014 09:31:42 Duncan wrote: Does the BTRFS RAID functionality do such staggered stripes? If not could it be added? AFAIK nothing like that yet, but it's reasonably likely to be implemented later. N-way-mirroring is roadmapped for next up after raid56 completion, however. It's RAID-5/6 when we really need such staggering. It's a reasonably common configuration choice to use two different brands of disk for a RAID-1 array. As the correlation between parts of the disks with errors only applied to disks of the same make and model (and this is expected due to firmware/manufacturing issues) the people who care about such things on RAID-1 have probably already dealt with the issue. You do mention the partition alternative, but not as I'd do it for such a case. Instead of doing a different sized buffer partition (or using the mkfs.btrfs option to start at some offset into the device) on each device, I'd simply do multiple partitions and reorder them on each device. If there are multiple partitions on a device then that will probably make performance suck. Also does BTRFS even allow special treatment of them or will it put two copies from a RAID-10 on the same disk? It will do. However, we should be able to fix that with the new allocator, if I ever get it finished... Hugo. Tho N-way-mirroring would sure help here too, since if a given area around the same address is assumed to be weak on each device, I'd sure like greater than the current 2-way-mirroring, even if if I had a different filesystem/partition at that spot on each one, since with only two-way-mirroring if one copy is assumed to be weak, guess what, you're down to only one reasonably reliable copy now, and that's not a good spot to be in if that one copy happens to be hit by a cosmic ray or otherwise fail checksum, without another reliable copy to fix it since that other copy is in the weak area already. Another alternative would be using something like mdraid's raid10 far layout, with btrfs on top of that... In the copies= option thread Brendan Hide stated that this sort of thing is planned. -- === Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk === PGP key: 65E74AC0 from wwwkeys.eu.pgp.net or http://www.carfax.org.uk --- Stick them with the pointy end. --- signature.asc Description: Digital signature
Re: staggered stripes
On Fri, 16 May 2014 00:38:04 +1000 Russell Coker russ...@coker.com.au wrote: You do mention the partition alternative, but not as I'd do it for such a case. Instead of doing a different sized buffer partition (or using the mkfs.btrfs option to start at some offset into the device) on each device, I'd simply do multiple partitions and reorder them on each device. If there are multiple partitions on a device then that will probably make performance suck. Also does BTRFS even allow special treatment of them or will it put two copies from a RAID-10 on the same disk? I try to be brief, omitting the common sense stuff as readable between the lines, and people don't... What I meant is a layout like the one I have now, only staggered partitions. Rather than describe the ideas, here's my actual sda layout. sdb is identical, but would have the same partitions reordered if setup as discussed here. These are actually SSD so the firmware will be scrambling and write-leveling the erase-blocks in any case, but I've long used the same basic layout on spinning rust too, tweaking it only a bit over several generations: # gdisk -l /dev/sda [...] Found valid GPT with protective MBR; using GPT. Disk /dev/sda: 500118192 sectors, 238.5 GiB [...] Partitions will be aligned on 2048-sector boundaries Total free space is 246364781 sectors (117.5 GiB) Number Start (sector)End (sector) Size Code Name 120488191 3.0 MiB EF02 bi0238gcn1+35l0 28192 262143 124.0 MiB EF00 ef0238gcn1+35l0 3 262144 786431 256.0 MiB 8300 bt0238gcn1+35l0 4 786432 2097151 640.0 MiB 8300 lg0238gcn1+35l0 5 209715218874367 8.0 GiB 8300 rt0238gcn1+35l0 61887436860817407 20.0 GiB8300 hm0238gcn1+35l0 760817408 49055 24.0 GiB8300 pk0238gcn1+35l0 8 49056 127926271 8.0 GiB 8300 nr0238gcn1+35l0 9 127926272 144703487 8.0 GiB 8300 rt0238gcn1+35l1 10 144703488 186646527 20.0 GiB8300 hm0238gcn1+35l1 11 186646528 236978175 24.0 GiB8300 pk0238gcn1+35l1 12 236978176 253755391 8.0 GiB 8300 nr0238gcn1+35l1 You will note that partitioning is GPT for reliability and simplicity, even tho my system's standard BIOS. You'll also note I use GPT partition naming to keep track of what's what, with the first two characters denoting partition function (rt=root, hm=home, pk=package, etc), and the last denoting working copy or backup N.[1] Partition #1 is BIOS reserved -- that's where grub2 puts it's core. It starts at the 1 MiB boundary and is 3 MiB, so everything after it is on a 4 MiB boundary minimum. #2 is EFI reserved, so I don't have to repartition if I upgrade to UEFI and want to try it. It starts at 4 MiB and is 124 MiB size, so ends at 128 MiB, and everything after it is at minimum 128 MiB boundaries. Thus the first 128 MiB is special-purpose reserved. Below that, starting with #3, are my normal partitions, all btrfs, raid1 both data/metadata except for /boot. #3 is /boot. Starting at 128 MiB it is 256 MiB size so ends at 384 MiB. Unlike my other btrfs, /boot is single-device dup-mode mixed-bg, with its primary backup on the partner hardware device (sda3 and sdb3, working /boot and and primary /boot backup). This is because it's FAR easier to simply point the grub on each device at its own /boot partition, using the BIOS boot-device selector to decide which one to boot, than it is to dynamically tell grub to use a different /boot at boot-time (tho unlike with grub1, with grub2 it's actually possible due to grub rescue mode). Btrfs dup-mode-mixed-bg effectively means I have only half capacity, 128 MiB, but that's enough for /boot. #4 is /var/log. Starting at 384 MiB it is 640 MiB in size (per device), so it ends at the 1 GiB boundary and all partitions beyond it are whole GiB sized so begin and end on whole GiB boundaries. As it's under a GiB per device it's btrfs mixed-bg mode, not separate data/metadata, and btrfs raid1. Unlike my other btrfs, log has no independent backup copy as I don't find a backup of /var/log particularly useful. But like the others with the exception of /boot and its backup, it's btrfs raid1, so losing a device doesn't mean losing the logs. I'd probably leave the partitions thru #4 as-is, since they're sub-GiB and end on a GiB boundary. If /var/log happens to be on a weak part of the device, oh, well, I'll take the loss, /boot is independent with the backup written far less than the working copy anyway, so if that's a weak spot, the working copy should go out first, with plenty of warning before the The next 8 partitions are split into two sets of four. All are btrfs raid1 mode for both data and metadata. #5 is root (/). It's 8 GiB and contains very nearly everything that the package manager