On Fri, Nov 02, 2012 at 11:23:14PM +0000, Gabriel wrote:
> On Fri, 02 Nov 2012 22:06:04 +0000, Hugo Mills wrote:
> 
> > On Fri, Nov 02, 2012 at 07:05:37PM +0000, Gabriel wrote:
> >> On Fri, 02 Nov 2012 13:02:32 +0100, Goffredo Baroncelli wrote:
> >> > On 2012-11-02 12:18, Martin Steigerwald wrote:
> >> >> Metadata, DUP is displayed as 3,50GB on the device level and as 1,75GB
> >> >> in total. I understand the logic behind this, but this could be a bit
> >> >> confusing.
> >> >> 
> >> >> But it makes sense: Showing real allocation on device level makes
> >> >> sense,
> >> >> cause thats what really allocated on disk. Total makes some sense,
> >> >> cause thats what is being used from the tree by BTRFS.
> >> > 
> >> > Yes, me too. At the first I was confused when you noticed this
> >> > discrepancy. So I have to admit that it is not so obvious to understand.
> >> > However we didn't find any way to make it more clear...
> >> > 
> >> >> It still looks confusing at first…
> >> > We could use "Chunk(s) capacity" instead of total/size ? I would like an
> >> > opinion from a "english people" point of view..
> >> 
> >> This is easy to fix, here's a mockup:
> >> 
> >> Metadata,DUP: Size: 1.75GB ×2, Used: 627.84MB ×2
> >>    /dev/dm-0        3.50GB
> > 
> >    I've not considered the full semantics of all this yet -- I'll try
> > to do that tomorrow. However, I note that the "×2" here could become
> > non-integer with the RAID-5/6 code (which is due Real Soon Now). In
> > the first RAID-5/6 code drop, it won't even be simple to calculate
> > where there are different-sized devices in the filesystem. Putting an
> > exact figure on that number is potentially going to be awkward. I
> > think we're going to need kernel help for working out what that number
> > should be, in the general case.
> 
> DUP can be nested below a device because it represents same-device
> redundancy (purpose: survive smudges but not device failure).
> 
> On the other hand raid levels should occupy the same space on all
> linked devices (a necessary consequence of the guarantee that RAID5
> can survive the loss of any device and RAID6 any two devices).

   No, the multiplier here is variable. Consider:

1 MiB stored in RAID-5 across 3 devices takes up 1.5 MiB -- multiplier ×1.5
   (1 MiB over 2 devices is 512 KiB, plus an additional 512 KiB for parity)
1 MiB stored in RAID-5 across 6 devices takes up 1.2 MiB -- multipler ×1.2
   (1 MiB over 5 devices is 204.8 KiB, plus an additional 204.8 KiB for parity)

   With the (initial) proposed implementation of RAID-5, the
stripe-width (i.e. the number of devices used for any given chunk
allocation) will be *as many as can be allocated*. Chris confirmed
this today on IRC. So if I have a disk array of 2T, 2T, 2T, 1T, 1T,
1T, then the first 1T of allocation will stripe across 6 devices,
giving me 5 data+1 parity, or a multiplier of ×1.2. As soon as the
smaller devices are full, the stripe width will drop to 3 devices, and
we'll be using 2 data+1 parity allocation, or a multiplier of ×1.5 for
any subsequent chunks. So, as more data over the first 5T is stored,
the multiplier steadily decreases, until we fill the FS, and we get a
multiplier of ×1.35 overall. This gets more complicated if you have
devices of many different sizes. (Imagine 6 disks with sizes 500G, 1T,
1.5T, 2T, 3T, 3T).

   We probably can work out the current RAID overhead and feed it back
sensibly, but it's (a) not constant as the allocation of the chunks
increases, and (b) not trivial to compute.

> The two probably won't need to be represented at the same time
> except during a reshape, because I imagine DUP gets converted to
> RAID (1 or 5) as soon as the second device is added.
> 
> A 1→2 reshape would look a bit like this (doing only the data column
> and skipping totals):
> 
> InitialDevice
>   Reserved           1.21TB
>   Used               1.21TB
> RAID1(InitialDevice, SecondDevice)
>   Reserved   1.31TB + 100GB
>   Used             2× 100GB
> 
> RAID5, RAID6: same with fractions, n+1⁄n and n+2⁄n.

   Except that n isn't guaranteed to be constant. That was pretty much
my only point. Don't assume that it will be (or at the very least, be
aware that you are assuming it is, and be prepared for inconsistencies).

   Hugo.

-- 
=== Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk ===
  PGP key: 515C238D from wwwkeys.eu.pgp.net or http://www.carfax.org.uk
        --- Well, sir, the floor is yours.  But remember, the ---        
                              roof is ours!                              

Attachment: signature.asc
Description: Digital signature

Reply via email to