On 2017-12-19 12:56, Tomasz Pala wrote:
On Tue, Dec 19, 2017 at 11:35:02 -0500, Austin S. Hemmelgarn wrote:

2. printed on screen when creating/converting "RAID1" profile (by btrfs tools),
I don't agree on this one.  It is in no way unreasonable to expect that
someone has read the documentation _before_ trying to use something.

Provided there are:
- a decent documentation AND
- appropriate[*] level of "common knowledge" AND
- stable behaviour and mature code (kernel, tools etc.)

BTRFS lacks all of these - there are major functional changes in current
kernels and it reaches far beyond LTS. All the knowledge YOU have here,
on this maillist, should be 'engraved' into btrfs-progs, as there are
people still using kernels with serious malfunctions. btrfs-progs could
easily check kernel version and print appropriate warning - consider
this a "software quirks".
Except the systems running on those ancient kernel versions are not necessarily using a recent version of btrfs-progs.

It might be possible to write up a script to check the kernel version and report known issues with it, but I don't think having it tightly integrated will be much help, at least not for quite some time.

[*] by 'appropriate' I mean knowledge so common, as the real word usage
itself.

Moreover, the fact that I've read the documentation and did a
comprehensive[**] reseach today, doesn't mean I should do this again
after kernel change for example.

[**] apparently what I thought was comprehensive, wasn't at all. Most of
the btrfs quirks I've found HERE. As a regular user, not fs developer, I
shouldn't be even looking at this list.
That last bit is debatable. BTRFS doesn't have separate developer and user lists, so this list serves both purposes (though IRC also serves some of the function of a user list). I'll agree that searching the archives shouldn't be needed to get a baseline of knowledge.

BTW, doesn't SuSE use btrfs by default? Would you expect everyone using
this distro to research every component used?
SuSE also provides very good support by themselves.

[*] yes, I know the recent kernels handle this, but the last LTS (4.14)
is just too young.
4.14 should have gotten that patch last I checked.

I meant too young to be widely adopted yet. This requires some
countermeasures in the toolkit that is easier to upgrade, like userspace.
So in other words, spend the time to write up code for btrfs-progs that will then be run by a significant minority of users because people using old kernels usually use old userspace, and people using new kernels won't have to care, instead of working on other bugs that are still affecting people?

Regarding handling of degraded mounts, BTRFS _is_ working just fine, we
just chose a different default behavior from MD and LVM (we make certain
the user knows about the issue without having to look through syslog).

I'm not arguing about the behaviour - apparently there were some
technical reasons. But IF the reasons are not technical, but
philosophical, I'd like to have either mount option (allow_degraded) or
even kernel-level configuration knob for this to happen RAID-style.

Now, if the current kernels won't toggle degraded RAID1 as ro, can I
safely add "degraded" to the mount options? My primary concern is the
machine UPTIME. I care less about the data, as they are backed up to
some remote location and loosing day or week of changes is acceptable,
brain-split as well, while every hour of downtime costs me a real money.
In which case you shouldn't be relying on _ANY_ kind of RAID by itself, let alone BTRFS. If you care that much about uptime, you should be investing in a HA setup and going from there. If downtime costs you money, you need to be accounting for kernel updates and similar things, and therefore should have things set up such that you can reboot a system with no issues.


Meanwhile I can't fix broken server using 'remote hands' - mounting degraded
volume means using physical keyboard or KVM which might be not available
at a site. Current btrfs behavious requires physical presence AND downtime
(if a machine rebooted) for fixing things, that could be fixed remotely
an on-line.
Assuming you have a sensibly designed system and are able to do remote management, physical presence should only be required for handling of an issue with the root filesystem, and downtime should only be needed long enough for other filesystem to get them into a sensible enough state that you can repair them the rest of the way online. There's not really anything you can do about the root filesystem, but sensible organization of application data can mitigate the issues for other filesystems.

Anyway, users shouldn't look through syslog, device status should be
reported by some monitoring tool.
This is a common complaint, and based on developer response, I think the consensus is that it's out of scope for the time being. There have been some people starting work on such things, but nobody really got anywhere because most of the users who care enough about monitoring to be interested are already using some external monitoring tool that it's easy to hook into.

TBH, you essentially need external monitoring in most RAID situations anyway unless you've got some pre-built purpose specific system that already includes it (see FreeNAS for an example).

Deviation so big (respectively to common RAID1 scenarios) deserves being 
documented.
Or renamed...
Really? Some examples of where MD and LVM provide direct monitoring without needing third party software please. LVM technically has the ability to handle it though dmeventd, but it's decidedly non-trivial to monitor state with that directly, and as a result almost everyone uses third party software there.. MD I don't have as much background with (I prefer the flexibility LVM offers), but anything I've seen regarding that requires manual setup of some external software as well.

reliability, and all the discussion of reliability assumes that either:
1. Disks fail catastrophically.
or:
2. Disks return read or write errors when there is a problem.

Following just those constraints, RAID is not designed to handle devices
that randomly drop off the bus and reappear

If it drops, there would be I/O errors eventually. Without the errors - agreed.
Classical hardware RAID will kick the device when it drops, and then never re-add it, just like BTRFS functionally does. The only difference is how they then treat the 'failed' disk. Hardware RAID will stop using it, BTRFS will keep trying to use it.

implementations.  As people are quick to point out BTRFS _IS NOT_ RAID,
the devs just made a poor choice in the original naming of the 2-way
replication implementation, and it stuck.

Well, the question is: either it is not raid YET, or maybe it's time to 
consider renaming?
Again, the naming is too ingrained. At a minimum, you will have to keep the old naming, and at that point you're just wasting time and making things _more_ confusing because some documentation will use the old naming and some will use the new (keep in mind that third-party documentation rarely gets updated).

3. if sysadmin doesn't request any kind of device autobinding, the
device that were already failed doesn't matter anymore - regardless of
it's current state or reappearences.
You have to explicitly disable automatic binding of drivers to
hot-plugged devices though, so that's rather irrelevant.  Yes, you can

Ha! I got this disabled on every bus (although for different reasons)
after boot completes. Lucky me:)
Security I'm guessing (my laptop behaves like that for USB devices for that exact reason)? It's a viable option on systems that are tightly controlled. Once you look at consumer devices though, it's just impractical. People expect hardware to just work when they plug it in these days.

1. "known" only to the people that already stepped into it, meaning too
     late - it should be "COMMONLY known", i.e. documented,
And also known to people who have done proper research.

All the OpenSUSE userbase? ;)
I don't think you quite understand what the SuSE business model is. SuSE does the research, and then provides support for customers so they don't have to. Red Hat has a similar model. Most normal distros however, do not, and those people using them need to be doing proper research.

2. "older kernels" are not so old, the newest mature LTS (4.9) is still
     affected,
I really don't see this as a valid excuse.  It's pretty well documented
that you absolutely should be running the most recent kernel if you're
using BTRFS.

Good point.

4. As already said before, using r/w degraded RAID1 is FULLY ACCEPTABLE,
     as long as you accept "no more redundancy"...
This is a matter of opinion.

Sure! And the particular opinion depends on system being affected. I'd
rather not have any brain-split scenario under my database servers, but
also won't mind data loss on BGP router as long as it keeps running and
is fully operational.

I still contend that running half a two
device array for an extended period of time without reshaping it to be a
single device is a bad idea for cases other than BTRFS.  The fewer
layers of code you're going through, the safer you are.

I create single-device degraded MD RAID1 when I attach one disk for
deployment (usually test machines), which are going to be converted into
dual (production) in a future - attaching second disk to array is much
easier and faster than messing with device nodes (or labels or
anything). The same applies to LVM, it's better to have it even when not
used at a moment. In case of btrfs there is no need for such
preparations, as the devices are added without renaming.
Unless you're pulling some complex black magic, you're not running degraded, you're running both in single device mode (which is not the same as a degraded two device RAID1 array) and converting to two device RAID1 later, which is a perfectly normal use case I have absolutely no issues with.

However, sometimes the systems end up without second disk attached.
Either due to their low importance, sometimes power usage, others
need to be quiet.

One might ask, why don't I attach second disk before initial system
creation - the answer is simple: I usually use the same drive models in
RAID1, but it happens that drives bought from the same production lot
fail simultaneously, so this approach mitigates the problem and gives
more time to react.
You appear to be misunderstanding me here. I'm not saying I think running with a single disk is bad, I'm saying that I feel that running with a single disk and not telling the storage stack that the other one isn't coming back any time soon is bad.

IOW, if I lose a disk in a two device BTRFS volume set up for replication, I'll mount it degraded, and convert it from the raid1 profile to the single profile and then remove the missing disk from the volume. Similarly, for a 2 device LVM RAID1 LV, I would use lvconvert to a regular linear LV. Going through the multi-device code in BTRFS or the DM-RAID code in LVM when you've only got one actual device is a waste of processing power, and ads another layer where things can go wrong.

Patches would be gratefully accepted.  It's really not hard to update
the documentation, it's just that nobody has had the time to do it.

Writing accurate documentation requires deep undestanding of internals.
Me - for example, I know some of the results: "don't do this", "if X happens, Y
should be done", "Z doesn't work yet, but there were some patches", "V
was fixed in some recent kernel, but no idea which commit was it
exactly", "W was severly broken in kernel I.J.K" etc. Not the hard data
that could be posted without creating the impression, that it's all
about creating complain-list. Not to mention I'm absolutely not familiar
with current patches, WIP and many many other corner cases or usage
scenarios. In a fact, not only the internals, but motivation and design
principles must be well understood to write piece of documentation.
Writing up something like that is near useless, it would only be valid for upstream kernels (And if you're using upstream kernels and following the advice of keeping up to date, what does it matter anyway? The moment a new btrfs-progs gets released, you're already going to be on a kernel that fixes the issues it reports.), because distros do whatever the hell they want with version numbers (RHEL for example is notorious for using _ancient_ version numbers bug having bunches of stuff back-ported, and most other big distros that aren't Arch, Gentoo, or Slackware derived do so too to a lesser degree), and it would require constant curation to keep up to date. Only for long-term known issues does it make sense, but those absolutely should be documented in the regular documentation, and doing that really isn't that hard if you just go for current issues.

Otherwise some "fake news" propaganda is being created, just like
https://suckless.org/sucks/systemd or other systemd-haters that haven't
spent a day in their life for writing SysV init scripts or managing a
bunch of mission critical machines with handcrafted supervisors.
I hate to tell you that:
1. This type of thing happens regardless. Systemd has just garnered a lot of hatred because it redesigned everything from the ground up and was then functionally forced on most of the Linux community. 2. There are quite a few of us who dislike systemd who have had to handle actual systems administration before (and quite a few such individuals are primarily complaining about other aspects of systemd, like the journal crap or how it handles manually mounted filesystems for which mount units exist (namely, if it thinks the underlying device isn't ready, it will unmount them immediately, even if the user just manually mounted them), not the service files replacing init scripts).
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to