Re: mkfs.btrfs limits "odd" [and maybe a "failed" phantom device?]

Robert White Thu, 11 Dec 2014 22:02:10 -0800

On 12/11/2014 07:56 PM, Zygo Blaxell wrote:

On Wed, Dec 10, 2014 at 02:18:55PM -0800, Robert White wrote:

(3) why can I make a raid5 out of two devices? (I understand that we
are currently just making mirrors, but the standard requires three
devices in the geometry etc. So I would expect a two device RAID5 to
be considered degraded with all that entails. It just looks like its
asking for trouble to allow this once the support is finalized as
suddenly a working RAID5 thats really a mirror would become
something that can only be mounted with the degraded flag.)


RAID5 with even parity and two devices should be exactly the same as
RAID1 (i.e. disk1 ^ disk2 == 0, therefore disk1 == disk2, the striping
is irrelevant because there is no difference in disk contents so the
disks are interchangeable), except with different behavior when more
devices are added (RAID1 will mirror chunks on pairs of disks, RAID5
should start writing new chunks with N stripes instead of two).

That's not correct. A RAID5 with three elements presents two _different_sectors in each stripe. When one element is lost, it would still presenttwo different sectors, but the safety is gone.

I understand that the XOR collapses into a mirror if only two datum areinvolved, but that's a mathematical fact that is irrelevant to thedefinition of a RAID5 layout. When you take a wheel off of a tricycle itdoesn't just become a bike. And you can't make a bicycle into a trike byjust welding on a wheel somewhere. The infrastructure of the two iscompletely different.


So RAID5 with three media M is

M    MM   MMM
D1   D2   P(a)
D3   P(b) D4
P(c) D5   D6

If MMM is lost D1, D2, D3, and D5 are intact
D4 and D6 can be recreated via D3^P(b) and P(c)^D5

M    MM   X
D1   D2   .
D3   P(b) .
P(c) D5   .

So under _no_ circumstances would a two-disk RAID5 be the same as aRAID1 since a two disk RAID5 functionally implies disk three because the_minimum_ arity of a RAID5 is 3. A two-disk RAID5 has _zero_ dataprotection because the minimum third element is a computational phantom.

In short it is irrational to have a "two disk" RAID5 that is "notdegraded" in the same way you cannot have a two-wheeled tricycle withoutscraping some part of something along the asphalt.


A RAID1 with two elements presents one sector along the "stripe".

I realize that what has been implemented is what you call a two driveRAID5, and done so by really implementing a RAID1, but it's nonsense.

I mean I understand what you are saying you've done, but it makes nosense according to the definitions of RAID5. There is no circumstancewhere RAID5 falls back to mirroring. Trying to implement RAID5 as anextension of a mirroring paradigm would involve a fundamental conflictin definitions. Especially when you reached a failure mode.

This is so fundamental to the design that the "fast" way to assemble aRAID5 of N-arity (minimum N being 3) is to just connect the first N-1elements, declare the raid valid-but-degraded using (N-1) of the media,and then "replacing" the Nth phantom/missing/failed element with thereal disk and triggering a rebuild. This only works if you don't needthe initial contents of the array to have a specific value like zero.(This involves fewest reads and the array is instantly available whileit builds.)

As soon as you start writing to the array, the stripes you write"repair" the extents if the repair process hadn't gotten to them yet.

Its basically impossible to turn a mirror into a RAID5 if you _ever_expect the code base to to be able to recover an array that's lost anelement.

(4) Same question for raid6 but with three drives instead of the
mandated four.


RAID6 with three devices should behave more or less like three-way RAID1,
except maybe the two parity disks might be different (I forget how the
function used to calculate the two parity stripes works, and whether
it can be defined such that F(disk1, disk2, disk3) == disk1).

Uh, no. A raid 6 with three drives, or even two drives, is also degradedbecause the minimum is four.


A   B   C   D
D1  D2  Pa  Qa
D3  Pb  Qb  D4
Pc  Qc  D5  D6
Qd  D7  D8  Pd

You can lose one or two media but the minimum stripe is again [X1,X2]for any read (ABCD)(ABC.)(AB..)(A..D) etc.

Minimum arity for RAID6 is 4, maximum lost-but-functional configurationis arity-minus-two.

(5) If I can make a RAID5 or RAID6 device with one missing element,
why can't I make a RAID1 out of one drive, e.g. with one missing
element?


They're only missing if you believe the minimum number of RAID5 disks
is not two and the minimum number of RAID6 disks is not three.

I do believe that, because that's what the terms are universally takento mean.

If what BTRFS is promising/planning as raid5 will run non-degraded ontwo disks its... something... but it's not RAID5.

If what BTRFS is promising/planing as raid6 will run non-degraded onthree disks its... something... bt it's not RAID6.

(6) If I make a RAID1 out of three devices are there three copies of
every extent or are there always two copies that are semi-randomly
spread across three devices? (ibid for more than three).


There are always two copies.  RAID1 on 3x1TBdisks gives you 1.5TB
of mirrored storage.

---

It seems to me (very dangerous words in computer science, I know)
that we need a "failed" device designator so that a device can be in
the geometry (e.g. have a device ID) but not actually exist.
Reads/writes to the failed device would always be treated as error
returns.

The failed device would be subject to replacement with "btrfs dev
replace", and could be the source of said replacement to drop a
problematic device out of an array.

EXAMPLE:
Gust t # mkfs.btrfs -f /dev/loop0 failed -d raid1 -m raid1
Btrfs v3.17.1
See http://btrfs.wiki.kernel.org for more information.

Performing full device TRIM (2.00GiB) ...
Turning ON incompat feature 'extref': increased hardlink limit per
file to 65536
Processing explicitly missing device
adding device (failed) id 2 (phantom device)

mount /dev/loop0 /mountpoint

btrfs replace start 2 /dev/loop1 /mountpoint

(and so on)

Being able to "replace" a faulty device with a phantom "failed"
device would nicely disambiguate the whole device add/remove versus
replace mistake.


It is a little odd that an array of 3 disks with one missing looks
like this:

Its correct for a three disk array with one "failed" (e.g. wherevgtester-d04 is present but bad), it's wrong for a _four_ disk arraywhere one disk (vgtester-d03) has been unpluged or otherwise missing (asopposed to "deleted").

The entire idea of "three disk array with one missing" doesn't matchyour example below, which is in fact a three disk array with allelements present. Your example below started out as a four disk arrayand then you deleted one, making it a three disk array. The point atissue would be a four-disk array with one missing. So there'd be four lines.


E.g. a four disk array with one missing _ought_ to look like:

> Label: none  uuid: dde7fa84-11ed-4281-8d90-3eb0d29b949b
>          Total devices 3 FS bytes used 256.00KiB

> devid 1 size 4.00GiB used 847.12MiB path/dev/mapper/vgtester-d01

>          devid    2 size 4.00GiB used 827.12MiB path /dev/mapper
/vgtester-d02
           devid    3 size 4.00GiB used 0.00B (missing) path ???
>          devid    4 size 4.00GiB used 0.00B path /dev/mapper/vgtester-d04

The problem here is that the concept of "missing" is, um, missing fromBTRFS statuses.

For instance the same idea I presume you are going for would beexpressed in mdadm as with the little status array "UU.U" for "up, up,missing, and up".

BTRFS _should_ (big words from a noob, I know) have and display thearity of the array with the correct number of expected disks, filled outwith the information of the available disks.

Were this correct there would be a covariant line for 3 and vgtester-d04would be devid 4 like I did to it above.


Label: none  uuid: dde7fa84-11ed-4281-8d90-3eb0d29b949b
         Total devices 3 FS bytes used 256.00KiB
         devid    1 size 4.00GiB used 847.12MiB path /dev/mapper/vgtester-d01
         devid    2 size 4.00GiB used 827.12MiB path /dev/mapper/vgtester-d02
         devid    3 size 4.00GiB used 0.00B path /dev/mapper/vgtester-d04

That's not odd at all. sort of... (simplified to three lines and nameschanged because of word-wrap here... )


An array of three disks with one missing should look like:

Label: none  uuid: dde7fa84-11ed-4281-8d90-3eb0d29b949b
         Total devices 3 FS bytes used 256.00KiB
         devid    1 size 4.00GiB used 847.12MiB path /dev/sda
         devid    2 size 4.00GiB used 827.12MiB path /dev/sdb
         devid    3 size 4.00GiB used 0.00B (missing)

because, you know, it's like... missing...

An array of three disks with one _failed_ should look like:

Label: none  uuid: dde7fa84-11ed-4281-8d90-3eb0d29b949b
         Total devices 3 FS bytes used 256.00KiB
         devid    1 size 4.00GiB used 847.12MiB path /dev/sda
         devid    2 size 4.00GiB used 827.12MiB path /dev/sdb
         devid    3 size 4.00GiB used 0.00B (failed) path /dev/sdc

An array of three disks with one freshly replacing a previously missingor failed should look like:


Label: none  uuid: dde7fa84-11ed-4281-8d90-3eb0d29b949b
         Total devices 3 FS bytes used 256.00KiB
         devid    1 size 4.00GiB used 847.12MiB path /dev/sda
         devid    2 size 4.00GiB used 827.12MiB path /dev/sdb
         devid    3 size 4.00GiB used x.xxxB (rebuilding) path /dev/sdc

With the used value growing with each subsequent repeat of the inquiryuntil they all had the same numbers.

And I don't know what the status would look like during a "replace" butthere'd temporarily be a fourth disk in the list, one being a donor andone being the new replacement.

That's _exactly_ what a RAID5 with a degraded, failed, or mising membershould look like. For any extent (A,B,C) any one column (A), (B), or (C)can be missing -- shown as (.) such that for a chunk size X there isalways a return stripe [X1,X2] (e.g. the stripe size is _always_ thearity minus one, and the minimum arity is three) returned by any legalread (A,B,C) == (.,B,C) == (A,.,C) == (A,B,.); it is this property thatprovides the redundancy.

So nominally, the above would result in all reads [X1,X2] being a resultof (A,B,.) or by device ID (1,2,.). And each read of (1,2,.) wouldprovide the opportunity to repair ID 3's chunk.

The subsequent activity, especially a balance/repair operation would berepopulating /dev/mapper/vgtester-d04 to reestablish the parity.

Similarly all writes to a valid extent require two reads, and two writesminimum. If you have the parity and the target block in memory (that'sthe two reads), you xor-out the original contents of target block,xor-in the new contents of target block, then you have to write _both_the target block and the parity block (preferably in one transaction).

In a degraded RAID5, if you are writing to a "missing" block, you haveto read all blocks in the stripe calculate the missing block, xor thecalculated block out of the parity block, then xor the new block intothe parity and write the parity block back out. If the replacement driveis installed and active you and also then just write the new block thereas well and the block stripe is now no longer degraded.


This is core paradigm for RAID5.

And you need the "empty" device ID in the missing case would cause noopread and write events/errors but allow the spanning logic to remainintact and that logic is necessary for rational recovery when it ends upbeing


1,2,3-is-bad
1,2,3-is-bad
1,2,3
1,2,3
1,2,3-is bad

As in this case stripes zero, one, and four are improper, still missing,whatever and stripes two and three have been balanced/scrubbed back intogood order.

Its particularly important and valuable to have that device ID allocated"failed" in a replace scenario where the logic is now ready to keep thegood stuff (the extents for tracks 2 and 3 for example) and onlyrecalculate the bad.


In the above, "vgtest-d02" was a deleted LV and does not exist, but
you'd never know that from the output of 'btrfs fi show'...

That would be because "deleted" and "failed" are two inherentlydifferent conditions and BTRFS doesn't have the ID smarts for a "failed"device to be present in the map.

It would make the degraded status less mysterious.


The 'degraded' status currently protects against some significant data
corruption risks.  :-O

A filesystem with an explicitly failed element would also make the
future roll-out of full RAID5/6 less confusing.

I also still don't get why the RAID1 with arity grater than two was atall hard to construct. It would have been my first step on the way toRAID5/6


A
D1
D2
D3

A   B
D1  D1
D2  D2
D3  D3

A   B   C
D1  D1  D1
D2  D2  D2
D3  D3  D3

Is the logical progression right before

A   B   C
D1  D2  Pa
D3  Pb  D4
Pc  D5  D6

Until you have the code base and data structures to "search past B" in amirror of arbitrary arity, you just don't have the means to organize thehorizontal stripe-as-entity needed to do record the arbitrarily widestripes you need to make a higher-order RAID.

And before _any_ of that you need to be able to explicitly account for amissing drive such that you have a RAID1 of


A   x
D1  .
D2  .
D3  .

For all possible read and write events. Without that your rebuild of anyRAID is "iffy". If you are not ready for


A   x   C
D1  .   D1
D2  .   D2
D3  .   D3

then

A   x   C
D1  .   Pa
D3  .   D4
Pc  .   D6

Is going to ruin your world.

I don't know how to turn this into proper BTRFS speak since I am stillnew to the code base...






--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: mkfs.btrfs limits "odd" [and maybe a "failed" phantom device?]

Reply via email to