Re: btrfs as / filesystem in RAID1

waxhead Sat, 09 Feb 2019 04:14:31 -0800



Austin S. Hemmelgarn wrote:

On 2019-02-08 13:10, waxhead wrote:
Austin S. Hemmelgarn wrote:
On 2019-02-07 13:53, waxhead wrote:
Austin S. Hemmelgarn wrote:
So why do BTRFS hurry to mount itself even if devices are missing? andif BTRFS still can mount , why whould it blindly accept a non-existingdisk to take part of the pool?!
It doesn't unless you tell it to., and that behavior is exactly what I'marguing against making the default here.

Understood, but that is not quite what I meant - let me rephrase...

If BTRFS still can't mount, why would it blindly accept a previouslynon-existing disk to take part of the pool?! E.g. if you have "disk" A+Band suddenly at one boot B is not there. Now you have only A and onewould think that A should register that B has been missing. Now on thenext boot you have AB , in which case B is likely to have diverged fromA since A has been mounted without B present - so even if both devicesare present why would btrfs blindly accept that both A+B are good to goeven if it should be perfectly possible to register in A that B wasgone. And if you have B without A it should be the same story right?

Realistically, we can only safely recover from divergence correctlyif we can prove that all devices are true prior states of the currenthighest generation, which is not currently possible to do reliablybecause of how BTRFS operates.
So what you are saying is that the generation number does notrepresent a true frozen state of the filesystem at that point?
It does _only_ for those devices which were present at the time of thecommit that incremented it.

So in other words devices that are not present can easily be marked /defined as such at a later time?

As an example (don't do this with any BTRFS volume you care about, itwill break it), take a BTRFS volume with two devices configured forraid1. Mount the volume with only one of the devices present, issue asingle write to it, then unmounted it. Now do the same with only theother device. Both devices should show the same generation number rightnow (but it should be one higher than when you started), but thegeneration number on each device refers to a different volume state.
Also, LVM and MD have the exact same issue, it's just not assignificant because they re-add and re-sync missing devicesautomatically when they reappear, which makes such split-brainscenarios much less likely.
Which means marking the entire device as invalid, then re-adding itfrom scratch more or less...
Actually, it doesn't.
For LVM and MD, they track what regions of the remaining device havechanged, and sync only those regions when the missing device comes back.

For MD , if you have the bitmap enabled yes...

For BTRFS, the same thing happens implicitly because of the COWstructure, and you can manually reproduce similar behavior to LVM or MDby scrubbing the volume and then using balance with the 'soft' filter toensure all the chunks are the correct type.

Understood.

Why does systemd concern itself about what devices btrfs consist of.Please educate me, I am curious.
For the same reason that it concerns itself with what devices make up aLVM volume or an MD array. In essence, it comes down to a couple ofspecific things:
* It is almost always preferable to delay boot-up while waiting for amissing device to reappear than it is to start using a volume thatdepends on it while it's missing. The overall impact on the system fromtaking a few seconds longer to boot is generally less than the impact ofhaving to resync the device when it reappears while the system is stillbooting up.
* Systemd allows mounts to not block the system booting while stillallowing certain services to depend on those mounts being active. Thisis extremely useful for remote management reasons, and is actuallysupported by most service managers these days. Systemd extends this allthe way down the storage stack though, which is even more useful,because it lets disk failures properly cascade up the storage stack andtranslate into the volumes they were part of showing up as degraded (orgetting unmounted if you choose to configure it that way).

Ok, not sure I still understand how/why systemd knows what devices arepart of btrfs (or md or lvm for that matter). I'll try to research thisa bit - thanks for the info!

IOW, there's a special case with systemd that makes even mountingBTRFS volumes that have missing devices degraded not work.
Well I use systemd on Debian and have not had that issue. In whatsituation does this fail?
At one point, if you tried to manually mount a volume that systemd didnot see all the constituent devices present for, it would get unmountedalmost instantly by systemd itself. This may not be the case anymore,or it may have been how the distros I've used with systemd on themhappened to behave, but either way it's a pain in the arse when you wantto fix a BTRFS volume.

I can see that, but from my "toying around" with btrfs I have not runinto any issues while mounting degraded.

* Given that new kernels still don't properly generate half-raid1chunks when a device is missing in a two-device raid1 setup,there's a very real possibility that users will have troublerecovering filesystems with old recovery media (IOW, any recoveryenvironment running a kernel before 4.14 will not mount the volumecorrectly).
Sometimes you have to break a few eggs to make an omelette right? Ifpeople want to recover their data they should have backups, and ifthey are really interested in recovering their data (and don't havebackups) then they will probably find this on the web by searchinganyway...
Backups aren't the type of recovery I'm talking about. I'm talkingabout people booting to things like SystemRescueCD to fix systemconfiguration or do offline maintenance without having to nuke thesystem and restore from backups. Such recovery environments oftendon't get updated for a _long_ time, and such usage is not atypicalas a first step in trying to fix a broken system in situations wheredowntime really is a serious issue.
I would say that if downtime is such a serious issue you have afailover and a working tested backup.
Generally yes, but restoring a volume completely from scratch is almostalways going to take longer than just fixing what's broken unless it's_really_ broken. Would you really want to nuke a system and rebuild itfrom scratch just because you accidentally pulled out the wrong diskwhen hot-swapping drives to rebuild an array?

Absolutely not , but in this case I would not even want to use a rescuedisk in the first place.

* You shouldn't be mounting writable and degraded for any reasonother than fixing the volume (or converting it to a single profileuntil you can fix it), even aside from the other issues.
Well in my opinion the degraded mount option is counter intuitive.Unless otherwise asked for the system should mount and work as longas it can guarantee the data can be read and written somehow(regardless if any redundancy guarantee is not met). If the user iswilling to accept more or less risk they should configure it!
Again, BTRFS mounting degraded is significantly riskier than LVM orMD doing the same thing. Most users don't properly research things(When's the last time you did a complete cost/benefit analysis beforedeciding to use a particular piece of software on a system?), andwould not know they were taking on significantly higher risk by usingBTRFS without configuring it to behave safely until it actuallycaused them problems, at which point most people would then complainabout the resulting data loss instead of trying to figure out why ithappened and prevent it in the first place. I don't know about you,but I for one would rather BTRFS have a reputation for beingover-aggressively safe by default than risking users data by default.
Well I don't do cost/benefit analysis since I run free software. I dohowever try my best to ensure that whatever software I install don'tcause more drawbacks than benefits.
Which is essentially a CBA. The cost doesn't have to equate to money,it could be time, or even limitations in what you can do with the system.
I would also like for BTRFS to be over-aggressively safe, but I alsowant it to be over-aggressively always running or even limping if thatis what it needs to do.
And you can have it do that, we just prefer not to by default.

Got it!

Re: btrfs as / filesystem in RAID1

Reply via email to