On 12/11/2014 07:56 PM, Zygo Blaxell wrote:
On Wed, Dec 10, 2014 at 02:18:55PM -0800, Robert White wrote:
(3) why can I make a raid5 out of two devices? (I understand that we
are currently just making mirrors, but the standard requires three
devices in the geometry etc. So I would expect a two device RAID5 to
be considered degraded with all that entails. It just looks like its
asking for trouble to allow this once the support is finalized as
suddenly a working RAID5 thats really a mirror would become
something that can only be mounted with the degraded flag.)
RAID5 with even parity and two devices should be exactly the same as
RAID1 (i.e. disk1 ^ disk2 == 0, therefore disk1 == disk2, the striping
is irrelevant because there is no difference in disk contents so the
disks are interchangeable), except with different behavior when more
devices are added (RAID1 will mirror chunks on pairs of disks, RAID5
should start writing new chunks with N stripes instead of two).
That's not correct. A RAID5 with three elements presents two _different_
sectors in each stripe. When one element is lost, it would still present
two different sectors, but the safety is gone.
I understand that the XOR collapses into a mirror if only two datum are
involved, but that's a mathematical fact that is irrelevant to the
definition of a RAID5 layout. When you take a wheel off of a tricycle it
doesn't just become a bike. And you can't make a bicycle into a trike by
just welding on a wheel somewhere. The infrastructure of the two is
completely different.
So RAID5 with three media M is
M MM MMM
D1 D2 P(a)
D3 P(b) D4
P(c) D5 D6
If MMM is lost D1, D2, D3, and D5 are intact
D4 and D6 can be recreated via D3^P(b) and P(c)^D5
M MM X
D1 D2 .
D3 P(b) .
P(c) D5 .
So under _no_ circumstances would a two-disk RAID5 be the same as a
RAID1 since a two disk RAID5 functionally implies disk three because the
_minimum_ arity of a RAID5 is 3. A two-disk RAID5 has _zero_ data
protection because the minimum third element is a computational phantom.
In short it is irrational to have a "two disk" RAID5 that is "not
degraded" in the same way you cannot have a two-wheeled tricycle without
scraping some part of something along the asphalt.
A RAID1 with two elements presents one sector along the "stripe".
I realize that what has been implemented is what you call a two drive
RAID5, and done so by really implementing a RAID1, but it's nonsense.
I mean I understand what you are saying you've done, but it makes no
sense according to the definitions of RAID5. There is no circumstance
where RAID5 falls back to mirroring. Trying to implement RAID5 as an
extension of a mirroring paradigm would involve a fundamental conflict
in definitions. Especially when you reached a failure mode.
This is so fundamental to the design that the "fast" way to assemble a
RAID5 of N-arity (minimum N being 3) is to just connect the first N-1
elements, declare the raid valid-but-degraded using (N-1) of the media,
and then "replacing" the Nth phantom/missing/failed element with the
real disk and triggering a rebuild. This only works if you don't need
the initial contents of the array to have a specific value like zero.
(This involves fewest reads and the array is instantly available while
it builds.)
As soon as you start writing to the array, the stripes you write
"repair" the extents if the repair process hadn't gotten to them yet.
Its basically impossible to turn a mirror into a RAID5 if you _ever_
expect the code base to to be able to recover an array that's lost an
element.
(4) Same question for raid6 but with three drives instead of the
mandated four.
RAID6 with three devices should behave more or less like three-way RAID1,
except maybe the two parity disks might be different (I forget how the
function used to calculate the two parity stripes works, and whether
it can be defined such that F(disk1, disk2, disk3) == disk1).
Uh, no. A raid 6 with three drives, or even two drives, is also degraded
because the minimum is four.
A B C D
D1 D2 Pa Qa
D3 Pb Qb D4
Pc Qc D5 D6
Qd D7 D8 Pd
You can lose one or two media but the minimum stripe is again [X1,X2]
for any read (ABCD)(ABC.)(AB..)(A..D) etc.
Minimum arity for RAID6 is 4, maximum lost-but-functional configuration
is arity-minus-two.
(5) If I can make a RAID5 or RAID6 device with one missing element,
why can't I make a RAID1 out of one drive, e.g. with one missing
element?
They're only missing if you believe the minimum number of RAID5 disks
is not two and the minimum number of RAID6 disks is not three.
I do believe that, because that's what the terms are universally taken
to mean.
If what BTRFS is promising/planning as raid5 will run non-degraded on
two disks its... something... but it's not RAID5.
If what BTRFS is promising/planing as raid6 will run non-degraded on
three disks its... something... bt it's not RAID6.
(6) If I make a RAID1 out of three devices are there three copies of
every extent or are there always two copies that are semi-randomly
spread across three devices? (ibid for more than three).
There are always two copies. RAID1 on 3x1TBdisks gives you 1.5TB
of mirrored storage.
---
It seems to me (very dangerous words in computer science, I know)
that we need a "failed" device designator so that a device can be in
the geometry (e.g. have a device ID) but not actually exist.
Reads/writes to the failed device would always be treated as error
returns.
The failed device would be subject to replacement with "btrfs dev
replace", and could be the source of said replacement to drop a
problematic device out of an array.
EXAMPLE:
Gust t # mkfs.btrfs -f /dev/loop0 failed -d raid1 -m raid1
Btrfs v3.17.1
See http://btrfs.wiki.kernel.org for more information.
Performing full device TRIM (2.00GiB) ...
Turning ON incompat feature 'extref': increased hardlink limit per
file to 65536
Processing explicitly missing device
adding device (failed) id 2 (phantom device)
mount /dev/loop0 /mountpoint
btrfs replace start 2 /dev/loop1 /mountpoint
(and so on)
Being able to "replace" a faulty device with a phantom "failed"
device would nicely disambiguate the whole device add/remove versus
replace mistake.
It is a little odd that an array of 3 disks with one missing looks
like this:
Its correct for a three disk array with one "failed" (e.g. where
vgtester-d04 is present but bad), it's wrong for a _four_ disk array
where one disk (vgtester-d03) has been unpluged or otherwise missing (as
opposed to "deleted").
The entire idea of "three disk array with one missing" doesn't match
your example below, which is in fact a three disk array with all
elements present. Your example below started out as a four disk array
and then you deleted one, making it a three disk array. The point at
issue would be a four-disk array with one missing. So there'd be four lines.
E.g. a four disk array with one missing _ought_ to look like:
> Label: none uuid: dde7fa84-11ed-4281-8d90-3eb0d29b949b
> Total devices 3 FS bytes used 256.00KiB
> devid 1 size 4.00GiB used 847.12MiB path
/dev/mapper/vgtester-d01
> devid 2 size 4.00GiB used 827.12MiB path /dev/mapper
/vgtester-d02
devid 3 size 4.00GiB used 0.00B (missing) path ???
> devid 4 size 4.00GiB used 0.00B path /dev/mapper/vgtester-d04
The problem here is that the concept of "missing" is, um, missing from
BTRFS statuses.
For instance the same idea I presume you are going for would be
expressed in mdadm as with the little status array "UU.U" for "up, up,
missing, and up".
BTRFS _should_ (big words from a noob, I know) have and display the
arity of the array with the correct number of expected disks, filled out
with the information of the available disks.
Were this correct there would be a covariant line for 3 and vgtester-d04
would be devid 4 like I did to it above.
Label: none uuid: dde7fa84-11ed-4281-8d90-3eb0d29b949b
Total devices 3 FS bytes used 256.00KiB
devid 1 size 4.00GiB used 847.12MiB path /dev/mapper/vgtester-d01
devid 2 size 4.00GiB used 827.12MiB path /dev/mapper/vgtester-d02
devid 3 size 4.00GiB used 0.00B path /dev/mapper/vgtester-d04
That's not odd at all. sort of... (simplified to three lines and names
changed because of word-wrap here... )
An array of three disks with one missing should look like:
Label: none uuid: dde7fa84-11ed-4281-8d90-3eb0d29b949b
Total devices 3 FS bytes used 256.00KiB
devid 1 size 4.00GiB used 847.12MiB path /dev/sda
devid 2 size 4.00GiB used 827.12MiB path /dev/sdb
devid 3 size 4.00GiB used 0.00B (missing)
because, you know, it's like... missing...
An array of three disks with one _failed_ should look like:
Label: none uuid: dde7fa84-11ed-4281-8d90-3eb0d29b949b
Total devices 3 FS bytes used 256.00KiB
devid 1 size 4.00GiB used 847.12MiB path /dev/sda
devid 2 size 4.00GiB used 827.12MiB path /dev/sdb
devid 3 size 4.00GiB used 0.00B (failed) path /dev/sdc
An array of three disks with one freshly replacing a previously missing
or failed should look like:
Label: none uuid: dde7fa84-11ed-4281-8d90-3eb0d29b949b
Total devices 3 FS bytes used 256.00KiB
devid 1 size 4.00GiB used 847.12MiB path /dev/sda
devid 2 size 4.00GiB used 827.12MiB path /dev/sdb
devid 3 size 4.00GiB used x.xxxB (rebuilding) path /dev/sdc
With the used value growing with each subsequent repeat of the inquiry
until they all had the same numbers.
And I don't know what the status would look like during a "replace" but
there'd temporarily be a fourth disk in the list, one being a donor and
one being the new replacement.
That's _exactly_ what a RAID5 with a degraded, failed, or mising member
should look like. For any extent (A,B,C) any one column (A), (B), or (C)
can be missing -- shown as (.) such that for a chunk size X there is
always a return stripe [X1,X2] (e.g. the stripe size is _always_ the
arity minus one, and the minimum arity is three) returned by any legal
read (A,B,C) == (.,B,C) == (A,.,C) == (A,B,.); it is this property that
provides the redundancy.
So nominally, the above would result in all reads [X1,X2] being a result
of (A,B,.) or by device ID (1,2,.). And each read of (1,2,.) would
provide the opportunity to repair ID 3's chunk.
The subsequent activity, especially a balance/repair operation would be
repopulating /dev/mapper/vgtester-d04 to reestablish the parity.
Similarly all writes to a valid extent require two reads, and two writes
minimum. If you have the parity and the target block in memory (that's
the two reads), you xor-out the original contents of target block,
xor-in the new contents of target block, then you have to write _both_
the target block and the parity block (preferably in one transaction).
In a degraded RAID5, if you are writing to a "missing" block, you have
to read all blocks in the stripe calculate the missing block, xor the
calculated block out of the parity block, then xor the new block into
the parity and write the parity block back out. If the replacement drive
is installed and active you and also then just write the new block there
as well and the block stripe is now no longer degraded.
This is core paradigm for RAID5.
And you need the "empty" device ID in the missing case would cause noop
read and write events/errors but allow the spanning logic to remain
intact and that logic is necessary for rational recovery when it ends up
being
1,2,3-is-bad
1,2,3-is-bad
1,2,3
1,2,3
1,2,3-is bad
As in this case stripes zero, one, and four are improper, still missing,
whatever and stripes two and three have been balanced/scrubbed back into
good order.
Its particularly important and valuable to have that device ID allocated
"failed" in a replace scenario where the logic is now ready to keep the
good stuff (the extents for tracks 2 and 3 for example) and only
recalculate the bad.
In the above, "vgtest-d02" was a deleted LV and does not exist, but
you'd never know that from the output of 'btrfs fi show'...
That would be because "deleted" and "failed" are two inherently
different conditions and BTRFS doesn't have the ID smarts for a "failed"
device to be present in the map.
It would make the degraded status less mysterious.
The 'degraded' status currently protects against some significant data
corruption risks. :-O
A filesystem with an explicitly failed element would also make the
future roll-out of full RAID5/6 less confusing.
I also still don't get why the RAID1 with arity grater than two was at
all hard to construct. It would have been my first step on the way to
RAID5/6
A
D1
D2
D3
A B
D1 D1
D2 D2
D3 D3
A B C
D1 D1 D1
D2 D2 D2
D3 D3 D3
Is the logical progression right before
A B C
D1 D2 Pa
D3 Pb D4
Pc D5 D6
Until you have the code base and data structures to "search past B" in a
mirror of arbitrary arity, you just don't have the means to organize the
horizontal stripe-as-entity needed to do record the arbitrarily wide
stripes you need to make a higher-order RAID.
And before _any_ of that you need to be able to explicitly account for a
missing drive such that you have a RAID1 of
A x
D1 .
D2 .
D3 .
For all possible read and write events. Without that your rebuild of any
RAID is "iffy". If you are not ready for
A x C
D1 . D1
D2 . D2
D3 . D3
then
A x C
D1 . Pa
D3 . D4
Pc . D6
Is going to ruin your world.
I don't know how to turn this into proper BTRFS speak since I am still
new to the code base...
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html