Re: btrfs as / filesystem in RAID1

Stefan K Thu, 07 Feb 2019 23:16:11 -0800

> * Normal desktop users _never_ look at the log files or boot info, and 
> rarely run monitoring programs, so they as a general rule won't notice 
> until it's already too late.  BTRFS isn't just a server filesystem, so 
> it needs to be safe for regular users too.
I guess a normal desktop user wouldn't create a RAID1 nor other RAID-things, 
right? So an admin take care of a RAID and monitor it (it doesn't matter if it 
a hardwareraid, mdraid, zfs raid or what ever)
and degraded works only with RAID-things, its not relevant for single-disk 
usage, right?


> Also, LVM and MD have the exact same issue, it's just not as significant 
> because they re-add and re-sync missing devices automatically when they 
> reappear, which makes such split-brain scenarios much less likely.
why does btrfs don't do that?


On Thursday, February 7, 2019 2:39:34 PM CET Austin S. Hemmelgarn wrote:
> On 2019-02-07 13:53, waxhead wrote:
> > 
> > 
> > Austin S. Hemmelgarn wrote:
> >> On 2019-02-07 06:04, Stefan K wrote:
> >>> Thanks, with degraded  as kernel parameter and also ind the fstab it 
> >>> works like expected
> >>>
> >>> That should be the normal behaviour, cause a server must be up and 
> >>> running, and I don't care about a device loss, thats why I use a 
> >>> RAID1. The device-loss problem can I fix later, but its important 
> >>> that a server is up and running, i got informed at boot time and also 
> >>> in the logs files that a device is missing, also I see that if you 
> >>> use a monitoring program.
> >> No, it shouldn't be the default, because:
> >>
> >> * Normal desktop users _never_ look at the log files or boot info, and 
> >> rarely run monitoring programs, so they as a general rule won't notice 
> >> until it's already too late.  BTRFS isn't just a server filesystem, so 
> >> it needs to be safe for regular users too.
> > 
> > I am willing to argue that whatever you refer to as normal users don't 
> > have a clue how to make a raid1 filesystem, nor do they care about what 
> > underlying filesystem their computer runs. I can't quite see how a 
> > limping system would be worse than a failing system in this case. 
> > Besides "normal" desktop users use Windows anyway, people that run on 
> > penguin powered stuff generally have at least some technical knowledge.
> Once you get into stuff like Arch or Gentoo, yeah, people tend to have 
> enough technical knowledge to handle this type of thing, but if you're 
> talking about the big distros like Ubuntu or Fedora, not so much.  Yes, 
> I might be a bit pessimistic here, but that pessimism is based on 
> personal experience over many years of providing technical support for 
> people.
> 
> Put differently, human nature is to ignore things that aren't 
> immediately relevant.  Kernel logs don't matter until you see something 
> wrong.  Boot messages don't matter unless you happen to see them while 
> the system is booting (and most people don't).  Monitoring is the only 
> way here, but most people won't invest the time in proper monitoring 
> until they have problems.  Even as a seasoned sysadmin, I never look at 
> kernel logs until I see any problem, I rarely see boot messages on most 
> of the systems I manage (because I'm rarely sitting at the console when 
> they boot up, and when I am I'm usually handling startup of a dozen or 
> so systems simultaneously after a network-wide outage), and I only 
> monitor things that I know for certain need to be monitored.
> > 
> >> * It's easily possible to end up mounting degraded by accident if one 
> >> of the constituent devices is slow to enumerate, and this can easily 
> >> result in a split-brain scenario where all devices have diverged and 
> >> the volume can only be repaired by recreating it from scratch.
> > 
> > Am I wrong or would not the remaining disk have the generation number 
> > bumped on every commit? would it not make sense to ignore (previously) 
> > stale disks and require a manual "re-add" of the failed disks. From a 
> > users perspective with some C coding knowledge this sounds to me (in 
> > principle) like something as quite simple.
> > E.g. if the superblock UUID match for all devices and one (or more) 
> > devices has a lower generation number than the other(s) then the disk(s) 
> > with the newest generation number should be considered good and the 
> > other disks with a lower generation number should be marked as failed.
> The problem is that if you're defaulting to this behavior, you can have 
> multiple disks diverge from the base.  Imagine, for example, a system 
> with two devices in a raid1 setup with degraded mounts enabled by 
> default, and either device randomly taking longer than normal to 
> enumerate.  It's very possible for one boot to have one device delay 
> during enumeration on one boot, then the other on the next boot, and if 
> not handled _exactly_ right by the user, this will result in both 
> devices having a higher generation number than they started with, but 
> neither one being 'wrong'.  It's like trying to merge branches in git 
> that both have different changes to a binary file, there's no sane way 
> to handle it without user input.
> 
> Realistically, we can only safely recover from divergence correctly if 
> we can prove that all devices are true prior states of the current 
> highest generation, which is not currently possible to do reliably 
> because of how BTRFS operates.
> 
> Also, LVM and MD have the exact same issue, it's just not as significant 
> because they re-add and re-sync missing devices automatically when they 
> reappear, which makes such split-brain scenarios much less likely.
> > 
> >> * We have _ZERO_ automatic recovery from this situation.  This makes 
> >> both of the above mentioned issues far more dangerous.
> > 
> > See above, would this not be as simple as auto-deleting disks from the 
> > pool that has a matching UUID and a mismatch for the superblock 
> > generation number? Not exactly a recovery, but the system should be able 
> > to limp along.
> > 
> >> * It just plain does not work with most systemd setups, because 
> >> systemd will hang waiting on all the devices to appear due to the fact 
> >> that they refuse to acknowledge that the only way to correctly know if 
> >> a BTRFS volume will mount is to just try and mount it.
> > 
> > As far as I have understood this BTRFS refuses to mount even in 
> > redundant setups without the degraded flag. Why?! This is just plain 
> > useless. If anything the degraded mount option should be replaced with 
> > something like failif=X where X would be anything from 'never' which 
> > should get a 2 disk system up with exclusively raid1 profiles even if 
> > only one device is working. 'always' in case any device is failed or 
> > even 'atrisk' when loss of one more device would keep any raid chunk 
> > profile guarantee. (this get admittedly complex in a multi disk raid1 
> > setup or when subvolumes perhaps can be mounted with different "raid" 
> > profiles....)
> The issue with systemd is that if you pass 'degraded' on most systemd 
> systems,  and devices are missing when the system tries to mount the 
> volume, systemd won't mount it because it doesn't see all the devices. 
> It doesn't even _try_ to mount it because it doesn't see all the 
> devices.  Changing to degraded by default won't fix this, because it's a 
> systemd problem.
> 
> The same issue also makes it a serious pain in the arse to recover 
> degraded BTRFS volumes on systemd systems, because if the volume is 
> supposed to mount normally on that system, systemd will unmount it if it 
> doesn't see all the devices, regardless of how it got mounted in the 
> first place.
> 
> IOW, there's a special case with systemd that makes even mounting BTRFS 
> volumes that have missing devices degraded not work.
> > 
> >> * Given that new kernels still don't properly generate half-raid1 
> >> chunks when a device is missing in a two-device raid1 setup, there's a 
> >> very real possibility that users will have trouble recovering 
> >> filesystems with old recovery media (IOW, any recovery environment 
> >> running a kernel before 4.14 will not mount the volume correctly).
> > Sometimes you have to break a few eggs to make an omelette right? If 
> > people want to recover their data they should have backups, and if they 
> > are really interested in recovering their data (and don't have backups) 
> > then they will probably find this on the web by searching anyway...
> Backups aren't the type of recovery I'm talking about.  I'm talking 
> about people booting to things like SystemRescueCD to fix system 
> configuration or do offline maintenance without having to nuke the 
> system and restore from backups.  Such recovery environments often don't 
> get updated for a _long_ time, and such usage is not atypical as a first 
> step in trying to fix a broken system in situations where downtime 
> really is a serious issue.
> > 
> >> * You shouldn't be mounting writable and degraded for any reason other 
> >> than fixing the volume (or converting it to a single profile until you 
> >> can fix it), even aside from the other issues.
> > 
> > Well in my opinion the degraded mount option is counter intuitive. 
> > Unless otherwise asked for the system should mount and work as long as 
> > it can guarantee the data can be read and written somehow (regardless if 
> > any redundancy guarantee is not met). If the user is willing to accept 
> > more or less risk they should configure it!
> Again, BTRFS mounting degraded is significantly riskier than LVM or MD 
> doing the same thing.  Most users don't properly research things (When's 
> the last time you did a complete cost/benefit analysis before deciding 
> to use a particular piece of software on a system?), and would not know 
> they were taking on significantly higher risk by using BTRFS without 
> configuring it to behave safely until it actually caused them problems, 
> at which point most people would then complain about the resulting data 
> loss instead of trying to figure out why it happened and prevent it in 
> the first place.  I don't know about you, but I for one would rather 
> BTRFS have a reputation for being over-aggressively safe by default than 
> risking users data by default.
>

Re: btrfs as / filesystem in RAID1

Reply via email to