On 2019-02-10 13:34, Chris Murphy wrote:
On Sat, Feb 9, 2019 at 5:13 AM waxhead <waxh...@dirtcellar.net> wrote:

Understood, but that is not quite what I meant - let me rephrase...
If BTRFS still can't mount, why would it blindly accept a previously
non-existing disk to take part of the pool?!

It doesn't do it blindly. It only ever mounts when the user specifies
the degraded mount option, which is not a default mount option.

E.g. if you have "disk" A+B
and suddenly at one boot B is not there. Now you have only A and one
would think that A should register that B has been missing. Now on the
next boot you have AB , in which case B is likely to have diverged from
A since A has been mounted without B present - so even if both devices
are present why would btrfs blindly accept that both A+B are good to go
even if it should be perfectly possible to register in A that B was
gone. And if you have B without A it should be the same story right?

OK no, you haven't gone far enough to setup the split brain scenario
where there is a partially legitimate complaint. Prior to split brain,
it's entirely reasonable for Btrfs to mount *when you use the degraded
mount option* - it does not blindly mount. And if you've ever done
exactly what you wrote in the above paragraph, you'd see Btrfs
*complains vociferously* about all the errors it's passively finding
and fixing. If you want a more active method of getting device B
caught up with A automatically - that's completely reasonable, and
something people have been saying for some time, but it takes a design
proposal, and code.

As for split brain scenario, it is only the user's manual intervention
with multiple 'degraded' mount options (which again, is not the
default) that caused the volume to arrive in such a state. Would it be
wise to have some additional error checking? Sure. Someone would need
to step up with a design and to do code work, same as any other
feature. Maybe a rudimentary check would be comparing the timestamps
for leaves or nodes ostensibly with the same transid, but in any case
that doesn't just happen for free.
And even then it couldn't be made truly reliable, because data from old transactions may be arbitrarily overwritten at any point after the next transaction (and is just plain gone if you're using the `discard` mount option).


So what you are saying is that the generation number does not
represent a true frozen state of the filesystem at that point?
It does _only_ for those devices which were present at the time of the
commit that incremented it.

So in other words devices that are not present can easily be marked /
defined as such at a later time?

That isn't how it currently works. When stale device B is subsequently
mounted (normally) along with device A, it's only passively fixed up.
Part of the point of non-automatic degraded mounts that require user
intervention is the lack of anything beyond simple error handling and
fixups.

Ok, not sure I still understand how/why systemd knows what devices are
part of btrfs (or md or lvm for that matter). I'll try to research this
a bit - thanks for the info!

It doesn't, not directly. It's from the previously mentioned udev
rule. For md, the assembly, delays, and fall back to running degraded,
are handled in dracut. But the reason why this is in udev is to
prevent a mount failure just because one or more devices are delayed;
basically it inserts a pause until the devices appear, and then
systemd issues the mount command.
Last I knew, it was systemd itself doing the pause, because we provide no real device for udev to wait on appearing.

Reply via email to