Re: raid1 degraded mount still produce single chunks, writeable mount not allowed

Austin S. Hemmelgarn Mon, 06 Mar 2017 05:27:23 -0800

On 2017-03-05 14:13, Peter Grandi wrote:

What makes me think that "unmirrored" 'raid1' profile chunks
are "not a thing" is that it is impossible to remove
explicitly a member device from a 'raid1' profile volume:
first one has to 'convert' to 'single', and then the 'remove'
copies back to the remaining devices the 'single' chunks that
are on the explicitly 'remove'd device. Which to me seems
absurd.

It is, there should be a way to do this as a single operation.
[ ... ] The reason this is currently the case though is a
simple one, 'btrfs device delete' is just a special instance
of balance [ ... ]  does no profile conversion, but having
that as an option would actually be _very_ useful from a data
safety perspective.


That seems to me an even more "confused" opinion: because
removing a device to make it "missing" and removing it
permanently should be very different operations.

Consider the common case of a 3-member volume with a 'raid1'
target profile: if the sysadm thinks that a drive should be
replaced, the goal is to take it out *without* converting every
chunk to 'single', because with 2-out-of-3 devices half of the
chunks will still be fully mirrored.

Also, removing the device to be replaced should really not be
the same thing as balancing the chunks, if there is space, to be
'raid1' across remaining drives, because that's a completely
different operation.

There is a command specifically for replacing devices. It operates verydifferently from the add+delete or delete+add sequences. Instead ofbalancing, it's more similar to LVM's pvmove command. It redirects allnew writes that would go to the old device to the new one, then copiesall the data from the old to the new (while properly recreating damagedchunks). it uses way less bandwidth than add+delete, runs faster, andis in general much safer because it moves less data around. If you'rejust replacing devices, you should be using this, not the add and deletecommands, which are more for reshaping arrays than repairing them.

Additionally, if you _have_ to use add and remove to replace a device,if possible, you should add the new device then delete the old one, notthe other way around, as that avoids most of the issues other than thehigh load on the filesystem from the balance operation.

Going further in my speculation, I suspect that at the core of
the Btrfs multidevice design there is a persistent "confusion"
(to use en euphemism) between volumes having a profile, and
merely chunks have a profile.

There generally is.  The profile is entirely a property of the
chunks (each chunk literally has a bit of metadata that says
what profile it is), not the volume.  There's some metadata in
the volume somewhere that says what profile to use for new
chunks of each type (I think),


That's the "target" profile for the volume.

but that doesn't dictate what chunk profiles there are on the
volume. [ ... ]


But as that's the case then the current Btrfs logic for
determining whether a volume is degraded or not is quite
"confused" indeed.

Entirely agreed. Currently, it checks the target profile, when itshould be checking per-chunk.


Because suppose there is again the simple case of a 3-device
volume, where all existing chunks have 'raid1' profile and the
volume's target profile is also 'raid1' and one device has gone
offline: the volume cannot be said to be "degraded", unless a
full examination of all chunks is made. Because it can well
happen that in fact *none* of the chunks was mirrored to that
device, for example, however unlikely. And viceversa. Even with
3 devices some chunks may be temporarily "unmirrored" (even if
for brief times hopefully).

The average case is that half of the chunks will be fully
mirrored across the two remaining devices and half will be
"unmirrored".

Now consider re-adding the third device: at that point the
volume has got back all 3 devices, so it is not "degraded", but
50% of the chunks in the volume will still be "unmirrored", even
if eventually they will be mirrored on the newly added device.

Note: possibilities get even more interesting with a 4-device
volume with 'raid1' profile chunks, and similar case involving
other profiles than 'raid1'.

Therefore the current Btrfs logic for deciding whether a volume
is "degraded" seems simply "confused" to me, because whether
there are missing devices and some chunks are "unmirrored" is
not quite the same thing.

The same applies to the current logic that in a 2-device volume
with a device missing new chunks are created as "single" profile
instead of as "unmirrored" 'raid1' profile: another example of
"confusion" between number of devices and chunk profile.

Note: the best that can be said is that a volume has both a
"target chunk profile" (one per data, metadata, system chunks)
and a target number of member devices, and that a volume with a
number of devices below the target *might* be degraded, and that
whether a volume is in fact degraded is not either/or, but given
by the percentage of chunks or stripes that are degraded. This
is expecially made clear by the 'raid1' case where the chunk
stripe length is always 2, but the number of target devices can
be greater than 2. Management of devices and management of
stripes are in Btrfs, unlike conventional RAID like Linux MD,
rather different operations needing rather different, if
related, logic.

My impression is that because of "confusion" between number of
devices in a volume and status of chunk profile there are some
"surprising" behaviors in Btrfs, and that will take quite a bit
to fix, most importantly for the Btrfs developer team to clear
among themselves the semantics attaching to both. After 10 years
of development that seems the right thing to do :-).


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: raid1 degraded mount still produce single chunks, writeable mount not allowed

Reply via email to