On Tue, Nov 19, 2013 at 11:16:58PM +0000, Duncan wrote:
> Hugo Mills posted on Tue, 19 Nov 2013 09:06:02 +0000 as excerpted:
> 
> > This will happen with RAID-10. The allocator will write stripes as wide
> > as it can: in this case, the first stripes will run across all 8
> > devices, until the SSDs are full, and then will write across the
> > remaining 4 devices.
> 
> Hugo, it doesn't change the outcome for this case, but either your 
> assertion above is incorrect, or the wiki discussion is incorrect (of 
> course, or possibly I'm the one misunderstanding something, in which case 
> hopefully replies to this will correct my understanding).
> 
> Because I distinctly recall reading on the wiki that for raid, regardless 
> of the raid level, btrfs always allocates in pairs (well, I guess it'd be 
> pairs of pairs for raid10 mode, and I believe that statement pre-dated 
> raid5/6 support so that isn't included).  I was actually shocked by that 
> because while I knew that was the case for raid1, I had thought that 
> other raid levels would stripe as widely as possible, which is what you 
> assert above as well.

   That's incorrect. I used to think that, a few years ago, and it got
into at least one piece of documentation as a result, but once I
worked out the actual behaviour, I did try to correct it (I definitely
remember fixing the sysadmin guide this way). For striped levels
(RAID-0, 10, 5, 6), the FS will use as many stripes as possible -- for
RAID-10, this means an even number; for the others, this is all the
devices with free space on, down to a RAID-level dependent minimum.

RAID-0:  min 2 devices
RAID-10: min 4 devices
RAID-5:  min 2 devices (I think)
RAID-6:  min 3 devices (I think)

> Now I just have to find where I read that on the wiki...
> 
> OK, here's one spot, FAQ, md-raid/device-mapper-raid/btrfs-raid 
> differences, btrfs:
> 
> https://btrfs.wiki.kernel.org/index.php/FAQ#btrfs
> 
> >>>>
> 
> btrfs combines all the drives into a storage pool first, and then 
> duplicates the chunks as file data is created. RAID-1 is defined 
> currently as "2 copies of all the data on different disks". This differs 
> from MD-RAID and dmraid, in that those make exactly n copies for n disks. 
> In a btrfs RAID-1 on 3 1TB drives we get 1.5TB of usable data. Because 
> each block is only copied to 2 drives, writing a given block only 
> requires exactly 2 drives spin up, reading requires only 1 drive to 
> spinup.

   This is correct.

> RAID-0 is similarly defined, with the stripe split among exactly 2 disks. 
> 3 1TB drives yield 3TB usable space, but to read a given stripe only 
> requires 2 disks.

   This is definitely wrong. RAID-0 will use all 3 drives for each
stripe.

> RAID-10 is built on top of these definitions. Every stripe is split 
> across to exactly 2 RAID1 sets and those RAID1 sets are written to 
> exactly 2 disk (hense 4 disk minimum). A btrfs raid-10 volume with 6 1TB 
> drives will yield 3TB usable space with 2 copies of all data, but only 4

   This is also wrong. You will get 3 TB usage out of 6 × 1 TB drives,
but the individual stripes will be 3 drives wide. You would have the
same behaviour (2 copies of 3 stripes wide) on a 7-device array.

> <<<<
> 
> [Yes, that ending sentence is incomplete in the wiki.]
> 
> So we have:
> 
> 1) raid1 is exactly two copies of data, paired devices.
> 
> 2) raid0 is a stripe exactly two devices wide (reinforced by to read a 
> stripe takes only two devices), so again paired devices.
> 
> 3) raid10 is a combination of the above raid0 and raid1 definitions, 
> exactly two raid1 pairs, paired in raid0.
> 
> So btrfs raid10 is pairs of pairs, each raid0 stripe a pair of raid1 
> mirrors.  If there's 8 devices, four smaller, four larger, the first  
> allocated chunks should be one per device, until the smaller devices fill 
> up it'll chunk across the remaining four, but it'll be pairs of pairs of 
> pairs, two pair(0)-of-pair(1) stripes wide instead of a single quad(0)-of-
> pair(1) stripe wide.

   If the RAID code used pairs for its stripes, that'd be the case,
but it doesn't...

   Hugo.

-- 
=== Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk ===
  PGP key: 65E74AC0 from wwwkeys.eu.pgp.net or http://www.carfax.org.uk
               --- emacs: Emacs Makes A Computer Slow. ---               

Attachment: signature.asc
Description: Digital signature

Reply via email to