> * Normal desktop users _never_ look at the log files or boot info, and > rarely run monitoring programs, so they as a general rule won't notice > until it's already too late. BTRFS isn't just a server filesystem, so > it needs to be safe for regular users too. I guess a normal desktop user wouldn't create a RAID1 nor other RAID-things, right? So an admin take care of a RAID and monitor it (it doesn't matter if it a hardwareraid, mdraid, zfs raid or what ever) and degraded works only with RAID-things, its not relevant for single-disk usage, right?
> Also, LVM and MD have the exact same issue, it's just not as significant > because they re-add and re-sync missing devices automatically when they > reappear, which makes such split-brain scenarios much less likely. why does btrfs don't do that? On Thursday, February 7, 2019 2:39:34 PM CET Austin S. Hemmelgarn wrote: > On 2019-02-07 13:53, waxhead wrote: > > > > > > Austin S. Hemmelgarn wrote: > >> On 2019-02-07 06:04, Stefan K wrote: > >>> Thanks, with degraded as kernel parameter and also ind the fstab it > >>> works like expected > >>> > >>> That should be the normal behaviour, cause a server must be up and > >>> running, and I don't care about a device loss, thats why I use a > >>> RAID1. The device-loss problem can I fix later, but its important > >>> that a server is up and running, i got informed at boot time and also > >>> in the logs files that a device is missing, also I see that if you > >>> use a monitoring program. > >> No, it shouldn't be the default, because: > >> > >> * Normal desktop users _never_ look at the log files or boot info, and > >> rarely run monitoring programs, so they as a general rule won't notice > >> until it's already too late. BTRFS isn't just a server filesystem, so > >> it needs to be safe for regular users too. > > > > I am willing to argue that whatever you refer to as normal users don't > > have a clue how to make a raid1 filesystem, nor do they care about what > > underlying filesystem their computer runs. I can't quite see how a > > limping system would be worse than a failing system in this case. > > Besides "normal" desktop users use Windows anyway, people that run on > > penguin powered stuff generally have at least some technical knowledge. > Once you get into stuff like Arch or Gentoo, yeah, people tend to have > enough technical knowledge to handle this type of thing, but if you're > talking about the big distros like Ubuntu or Fedora, not so much. Yes, > I might be a bit pessimistic here, but that pessimism is based on > personal experience over many years of providing technical support for > people. > > Put differently, human nature is to ignore things that aren't > immediately relevant. Kernel logs don't matter until you see something > wrong. Boot messages don't matter unless you happen to see them while > the system is booting (and most people don't). Monitoring is the only > way here, but most people won't invest the time in proper monitoring > until they have problems. Even as a seasoned sysadmin, I never look at > kernel logs until I see any problem, I rarely see boot messages on most > of the systems I manage (because I'm rarely sitting at the console when > they boot up, and when I am I'm usually handling startup of a dozen or > so systems simultaneously after a network-wide outage), and I only > monitor things that I know for certain need to be monitored. > > > >> * It's easily possible to end up mounting degraded by accident if one > >> of the constituent devices is slow to enumerate, and this can easily > >> result in a split-brain scenario where all devices have diverged and > >> the volume can only be repaired by recreating it from scratch. > > > > Am I wrong or would not the remaining disk have the generation number > > bumped on every commit? would it not make sense to ignore (previously) > > stale disks and require a manual "re-add" of the failed disks. From a > > users perspective with some C coding knowledge this sounds to me (in > > principle) like something as quite simple. > > E.g. if the superblock UUID match for all devices and one (or more) > > devices has a lower generation number than the other(s) then the disk(s) > > with the newest generation number should be considered good and the > > other disks with a lower generation number should be marked as failed. > The problem is that if you're defaulting to this behavior, you can have > multiple disks diverge from the base. Imagine, for example, a system > with two devices in a raid1 setup with degraded mounts enabled by > default, and either device randomly taking longer than normal to > enumerate. It's very possible for one boot to have one device delay > during enumeration on one boot, then the other on the next boot, and if > not handled _exactly_ right by the user, this will result in both > devices having a higher generation number than they started with, but > neither one being 'wrong'. It's like trying to merge branches in git > that both have different changes to a binary file, there's no sane way > to handle it without user input. > > Realistically, we can only safely recover from divergence correctly if > we can prove that all devices are true prior states of the current > highest generation, which is not currently possible to do reliably > because of how BTRFS operates. > > Also, LVM and MD have the exact same issue, it's just not as significant > because they re-add and re-sync missing devices automatically when they > reappear, which makes such split-brain scenarios much less likely. > > > >> * We have _ZERO_ automatic recovery from this situation. This makes > >> both of the above mentioned issues far more dangerous. > > > > See above, would this not be as simple as auto-deleting disks from the > > pool that has a matching UUID and a mismatch for the superblock > > generation number? Not exactly a recovery, but the system should be able > > to limp along. > > > >> * It just plain does not work with most systemd setups, because > >> systemd will hang waiting on all the devices to appear due to the fact > >> that they refuse to acknowledge that the only way to correctly know if > >> a BTRFS volume will mount is to just try and mount it. > > > > As far as I have understood this BTRFS refuses to mount even in > > redundant setups without the degraded flag. Why?! This is just plain > > useless. If anything the degraded mount option should be replaced with > > something like failif=X where X would be anything from 'never' which > > should get a 2 disk system up with exclusively raid1 profiles even if > > only one device is working. 'always' in case any device is failed or > > even 'atrisk' when loss of one more device would keep any raid chunk > > profile guarantee. (this get admittedly complex in a multi disk raid1 > > setup or when subvolumes perhaps can be mounted with different "raid" > > profiles....) > The issue with systemd is that if you pass 'degraded' on most systemd > systems, and devices are missing when the system tries to mount the > volume, systemd won't mount it because it doesn't see all the devices. > It doesn't even _try_ to mount it because it doesn't see all the > devices. Changing to degraded by default won't fix this, because it's a > systemd problem. > > The same issue also makes it a serious pain in the arse to recover > degraded BTRFS volumes on systemd systems, because if the volume is > supposed to mount normally on that system, systemd will unmount it if it > doesn't see all the devices, regardless of how it got mounted in the > first place. > > IOW, there's a special case with systemd that makes even mounting BTRFS > volumes that have missing devices degraded not work. > > > >> * Given that new kernels still don't properly generate half-raid1 > >> chunks when a device is missing in a two-device raid1 setup, there's a > >> very real possibility that users will have trouble recovering > >> filesystems with old recovery media (IOW, any recovery environment > >> running a kernel before 4.14 will not mount the volume correctly). > > Sometimes you have to break a few eggs to make an omelette right? If > > people want to recover their data they should have backups, and if they > > are really interested in recovering their data (and don't have backups) > > then they will probably find this on the web by searching anyway... > Backups aren't the type of recovery I'm talking about. I'm talking > about people booting to things like SystemRescueCD to fix system > configuration or do offline maintenance without having to nuke the > system and restore from backups. Such recovery environments often don't > get updated for a _long_ time, and such usage is not atypical as a first > step in trying to fix a broken system in situations where downtime > really is a serious issue. > > > >> * You shouldn't be mounting writable and degraded for any reason other > >> than fixing the volume (or converting it to a single profile until you > >> can fix it), even aside from the other issues. > > > > Well in my opinion the degraded mount option is counter intuitive. > > Unless otherwise asked for the system should mount and work as long as > > it can guarantee the data can be read and written somehow (regardless if > > any redundancy guarantee is not met). If the user is willing to accept > > more or less risk they should configure it! > Again, BTRFS mounting degraded is significantly riskier than LVM or MD > doing the same thing. Most users don't properly research things (When's > the last time you did a complete cost/benefit analysis before deciding > to use a particular piece of software on a system?), and would not know > they were taking on significantly higher risk by using BTRFS without > configuring it to behave safely until it actually caused them problems, > at which point most people would then complain about the resulting data > loss instead of trying to figure out why it happened and prevent it in > the first place. I don't know about you, but I for one would rather > BTRFS have a reputation for being over-aggressively safe by default than > risking users data by default. >