Re: btrfs as / filesystem in RAID1

waxhead Fri, 08 Feb 2019 10:10:45 -0800

Austin S. Hemmelgarn wrote:

On 2019-02-07 13:53, waxhead wrote:
Austin S. Hemmelgarn wrote:
On 2019-02-07 06:04, Stefan K wrote:
Thanks, with degraded as kernel parameter and also ind the fstab itworks like expected
That should be the normal behaviour, cause a server must be up andrunning, and I don't care about a device loss, thats why I use aRAID1. The device-loss problem can I fix later, but its importantthat a server is up and running, i got informed at boot time andalso in the logs files that a device is missing, also I see that ifyou use a monitoring program.
No, it shouldn't be the default, because:
* Normal desktop users _never_ look at the log files or boot info,and rarely run monitoring programs, so they as a general rule won'tnotice until it's already too late. BTRFS isn't just a serverfilesystem, so it needs to be safe for regular users too.
I am willing to argue that whatever you refer to as normal users don'thave a clue how to make a raid1 filesystem, nor do they care aboutwhat underlying filesystem their computer runs. I can't quite see howa limping system would be worse than a failing system in this case.Besides "normal" desktop users use Windows anyway, people that run onpenguin powered stuff generally have at least some technical knowledge.
Once you get into stuff like Arch or Gentoo, yeah, people tend to haveenough technical knowledge to handle this type of thing, but if you'retalking about the big distros like Ubuntu or Fedora, not so much. Yes,I might be a bit pessimistic here, but that pessimism is based onpersonal experience over many years of providing technical support forpeople.
Put differently, human nature is to ignore things that aren'timmediately relevant. Kernel logs don't matter until you see somethingwrong. Boot messages don't matter unless you happen to see them whilethe system is booting (and most people don't). Monitoring is the onlyway here, but most people won't invest the time in proper monitoringuntil they have problems. Even as a seasoned sysadmin, I never look atkernel logs until I see any problem, I rarely see boot messages on mostof the systems I manage (because I'm rarely sitting at the console whenthey boot up, and when I am I'm usually handling startup of a dozen orso systems simultaneously after a network-wide outage), and I onlymonitor things that I know for certain need to be monitored.

So what you are saying here is that distro's that use btrfs by defaultshould be responsible enough to make some monitoring solution if theyallow non-technical users to create a "raid"1 like btrfs filesystem inthe first place. I don't think that many distros install some S.M.A.R.T.monitoring solution either... in which case you are worse off with anon-checksumming filesystem.Since the users you refer to basically ignores the filesystem anyway Ican't see why this would be an argument at all...

* It's easily possible to end up mounting degraded by accident if oneof the constituent devices is slow to enumerate, and this can easilyresult in a split-brain scenario where all devices have diverged andthe volume can only be repaired by recreating it from scratch.
Am I wrong or would not the remaining disk have the generation numberbumped on every commit? would it not make sense to ignore (previously)stale disks and require a manual "re-add" of the failed disks. From ausers perspective with some C coding knowledge this sounds to me (inprinciple) like something as quite simple.E.g. if the superblock UUID match for all devices and one (or more)devices has a lower generation number than the other(s) then thedisk(s) with the newest generation number should be considered goodand the other disks with a lower generation number should be marked asfailed.
The problem is that if you're defaulting to this behavior, you can havemultiple disks diverge from the base. Imagine, for example, a systemwith two devices in a raid1 setup with degraded mounts enabled bydefault, and either device randomly taking longer than normal toenumerate. It's very possible for one boot to have one device delayduring enumeration on one boot, then the other on the next boot, and ifnot handled _exactly_ right by the user, this will result in bothdevices having a higher generation number than they started with, butneither one being 'wrong'. It's like trying to merge branches in gitthat both have different changes to a binary file, there's no sane wayto handle it without user input.

So why do BTRFS hurry to mount itself even if devices are missing? andif BTRFS still can mount , why whould it blindly accept a non-existingdisk to take part of the pool?!

Realistically, we can only safely recover from divergence correctly ifwe can prove that all devices are true prior states of the currenthighest generation, which is not currently possible to do reliablybecause of how BTRFS operates.

So what you are saying is that the generation number does not representa true frozen state of the filesystem at that point?

Also, LVM and MD have the exact same issue, it's just not as significantbecause they re-add and re-sync missing devices automatically when theyreappear, which makes such split-brain scenarios much less likely.

Which means marking the entire device as invalid, then re-adding it fromscratch more or less...

* We have _ZERO_ automatic recovery from this situation. This makesboth of the above mentioned issues far more dangerous.
See above, would this not be as simple as auto-deleting disks from thepool that has a matching UUID and a mismatch for the superblockgeneration number? Not exactly a recovery, but the system should beable to limp along.
* It just plain does not work with most systemd setups, becausesystemd will hang waiting on all the devices to appear due to thefact that they refuse to acknowledge that the only way to correctlyknow if a BTRFS volume will mount is to just try and mount it.
As far as I have understood this BTRFS refuses to mount even inredundant setups without the degraded flag. Why?! This is just plainuseless. If anything the degraded mount option should be replaced withsomething like failif=X where X would be anything from 'never' whichshould get a 2 disk system up with exclusively raid1 profiles even ifonly one device is working. 'always' in case any device is failed oreven 'atrisk' when loss of one more device would keep any raid chunkprofile guarantee. (this get admittedly complex in a multi disk raid1setup or when subvolumes perhaps can be mounted with different "raid"profiles....)
The issue with systemd is that if you pass 'degraded' on most systemdsystems, and devices are missing when the system tries to mount thevolume, systemd won't mount it because it doesn't see all the devices.It doesn't even _try_ to mount it because it doesn't see all thedevices. Changing to degraded by default won't fix this, because it's asystemd problem.
The same issue also makes it a serious pain in the arse to recoverdegraded BTRFS volumes on systemd systems, because if the volume issupposed to mount normally on that system, systemd will unmount it if itdoesn't see all the devices, regardless of how it got mounted in thefirst place.

Why does systemd concern itself about what devices btrfs consist of.Please educate me, I am curious.

IOW, there's a special case with systemd that makes even mounting BTRFSvolumes that have missing devices degraded not work.

Well I use systemd on Debian and have not had that issue. In whatsituation does this fail?

* Given that new kernels still don't properly generate half-raid1chunks when a device is missing in a two-device raid1 setup, there'sa very real possibility that users will have trouble recoveringfilesystems with old recovery media (IOW, any recovery environmentrunning a kernel before 4.14 will not mount the volume correctly).
Sometimes you have to break a few eggs to make an omelette right? Ifpeople want to recover their data they should have backups, and ifthey are really interested in recovering their data (and don't havebackups) then they will probably find this on the web by searchinganyway...
Backups aren't the type of recovery I'm talking about. I'm talkingabout people booting to things like SystemRescueCD to fix systemconfiguration or do offline maintenance without having to nuke thesystem and restore from backups. Such recovery environments often don'tget updated for a _long_ time, and such usage is not atypical as a firststep in trying to fix a broken system in situations where downtimereally is a serious issue.

I would say that if downtime is such a serious issue you have a failoverand a working tested backup.

* You shouldn't be mounting writable and degraded for any reasonother than fixing the volume (or converting it to a single profileuntil you can fix it), even aside from the other issues.
Well in my opinion the degraded mount option is counter intuitive.Unless otherwise asked for the system should mount and work as long asit can guarantee the data can be read and written somehow (regardlessif any redundancy guarantee is not met). If the user is willing toaccept more or less risk they should configure it!
Again, BTRFS mounting degraded is significantly riskier than LVM or MDdoing the same thing. Most users don't properly research things (When'sthe last time you did a complete cost/benefit analysis before decidingto use a particular piece of software on a system?), and would not knowthey were taking on significantly higher risk by using BTRFS withoutconfiguring it to behave safely until it actually caused them problems,at which point most people would then complain about the resulting dataloss instead of trying to figure out why it happened and prevent it inthe first place. I don't know about you, but I for one would ratherBTRFS have a reputation for being over-aggressively safe by default thanrisking users data by default.

Well I don't do cost/benefit analysis since I run free software. I dohowever try my best to ensure that whatever software I install don'tcause more drawbacks than benefits.I would also like for BTRFS to be over-aggressively safe, but I alsowant it to be over-aggressively always running or even limping if thatis what it needs to do.

Re: btrfs as / filesystem in RAID1

Reply via email to