Re: Unexpected raid1 behaviour

Austin S. Hemmelgarn Tue, 19 Dec 2017 12:11:44 -0800

On 2017-12-19 12:56, Tomasz Pala wrote:

On Tue, Dec 19, 2017 at 11:35:02 -0500, Austin S. Hemmelgarn wrote:

2. printed on screen when creating/converting "RAID1" profile (by btrfs tools),

I don't agree on this one.  It is in no way unreasonable to expect that
someone has read the documentation _before_ trying to use something.


Provided there are:
- a decent documentation AND
- appropriate[*] level of "common knowledge" AND
- stable behaviour and mature code (kernel, tools etc.)

BTRFS lacks all of these - there are major functional changes in current
kernels and it reaches far beyond LTS. All the knowledge YOU have here,
on this maillist, should be 'engraved' into btrfs-progs, as there are
people still using kernels with serious malfunctions. btrfs-progs could
easily check kernel version and print appropriate warning - consider
this a "software quirks".

Except the systems running on those ancient kernel versions are notnecessarily using a recent version of btrfs-progs.

It might be possible to write up a script to check the kernel versionand report known issues with it, but I don't think having it tightlyintegrated will be much help, at least not for quite some time.


[*] by 'appropriate' I mean knowledge so common, as the real word usage
itself.

Moreover, the fact that I've read the documentation and did a
comprehensive[**] reseach today, doesn't mean I should do this again
after kernel change for example.

[**] apparently what I thought was comprehensive, wasn't at all. Most of
the btrfs quirks I've found HERE. As a regular user, not fs developer, I
shouldn't be even looking at this list.

That last bit is debatable. BTRFS doesn't have separate developer anduser lists, so this list serves both purposes (though IRC also servessome of the function of a user list). I'll agree that searching thearchives shouldn't be needed to get a baseline of knowledge.


BTW, doesn't SuSE use btrfs by default? Would you expect everyone using
this distro to research every component used?

SuSE also provides very good support by themselves.

[*] yes, I know the recent kernels handle this, but the last LTS (4.14)
is just too young.

4.14 should have gotten that patch last I checked.


I meant too young to be widely adopted yet. This requires some
countermeasures in the toolkit that is easier to upgrade, like userspace.

So in other words, spend the time to write up code for btrfs-progs thatwill then be run by a significant minority of users because people usingold kernels usually use old userspace, and people using new kernelswon't have to care, instead of working on other bugs that are stillaffecting people?

Regarding handling of degraded mounts, BTRFS _is_ working just fine, we
just chose a different default behavior from MD and LVM (we make certain
the user knows about the issue without having to look through syslog).


I'm not arguing about the behaviour - apparently there were some
technical reasons. But IF the reasons are not technical, but
philosophical, I'd like to have either mount option (allow_degraded) or
even kernel-level configuration knob for this to happen RAID-style.

Now, if the current kernels won't toggle degraded RAID1 as ro, can I
safely add "degraded" to the mount options? My primary concern is the
machine UPTIME. I care less about the data, as they are backed up to
some remote location and loosing day or week of changes is acceptable,
brain-split as well, while every hour of downtime costs me a real money.

In which case you shouldn't be relying on _ANY_ kind of RAID by itself,let alone BTRFS. If you care that much about uptime, you should beinvesting in a HA setup and going from there. If downtime costs youmoney, you need to be accounting for kernel updates and similar things,and therefore should have things set up such that you can reboot asystem with no issues.



Meanwhile I can't fix broken server using 'remote hands' - mounting degraded
volume means using physical keyboard or KVM which might be not available
at a site. Current btrfs behavious requires physical presence AND downtime
(if a machine rebooted) for fixing things, that could be fixed remotely
an on-line.

Assuming you have a sensibly designed system and are able to do remotemanagement, physical presence should only be required for handling of anissue with the root filesystem, and downtime should only be needed longenough for other filesystem to get them into a sensible enough statethat you can repair them the rest of the way online. There's not reallyanything you can do about the root filesystem, but sensible organizationof application data can mitigate the issues for other filesystems.


Anyway, users shouldn't look through syslog, device status should be
reported by some monitoring tool.

This is a common complaint, and based on developer response, I think theconsensus is that it's out of scope for the time being. There have beensome people starting work on such things, but nobody really got anywherebecause most of the users who care enough about monitoring to beinterested are already using some external monitoring tool that it'seasy to hook into.

TBH, you essentially need external monitoring in most RAID situationsanyway unless you've got some pre-built purpose specific system thatalready includes it (see FreeNAS for an example).


Deviation so big (respectively to common RAID1 scenarios) deserves being 
documented.
Or renamed...

Really? Some examples of where MD and LVM provide direct monitoringwithout needing third party software please. LVM technically has theability to handle it though dmeventd, but it's decidedly non-trivial tomonitor state with that directly, and as a result almost everyone usesthird party software there.. MD I don't have as much background with (Iprefer the flexibility LVM offers), but anything I've seen regardingthat requires manual setup of some external software as well.

reliability, and all the discussion of reliability assumes that either:
1. Disks fail catastrophically.
or:
2. Disks return read or write errors when there is a problem.

Following just those constraints, RAID is not designed to handle devices
that randomly drop off the bus and reappear


If it drops, there would be I/O errors eventually. Without the errors - agreed.

Classical hardware RAID will kick the device when it drops, and thennever re-add it, just like BTRFS functionally does. The only differenceis how they then treat the 'failed' disk. Hardware RAID will stop usingit, BTRFS will keep trying to use it.

implementations.  As people are quick to point out BTRFS _IS NOT_ RAID,
the devs just made a poor choice in the original naming of the 2-way
replication implementation, and it stuck.


Well, the question is: either it is not raid YET, or maybe it's time to 
consider renaming?

Again, the naming is too ingrained. At a minimum, you will have to keepthe old naming, and at that point you're just wasting time and makingthings _more_ confusing because some documentation will use the oldnaming and some will use the new (keep in mind that third-partydocumentation rarely gets updated).

3. if sysadmin doesn't request any kind of device autobinding, the
device that were already failed doesn't matter anymore - regardless of
it's current state or reappearences.

You have to explicitly disable automatic binding of drivers to
hot-plugged devices though, so that's rather irrelevant.  Yes, you can


Ha! I got this disabled on every bus (although for different reasons)
after boot completes. Lucky me:)

Security I'm guessing (my laptop behaves like that for USB devices forthat exact reason)? It's a viable option on systems that are tightlycontrolled. Once you look at consumer devices though, it's justimpractical. People expect hardware to just work when they plug it inthese days.

1. "known" only to the people that already stepped into it, meaning too
     late - it should be "COMMONLY known", i.e. documented,

And also known to people who have done proper research.


All the OpenSUSE userbase? ;)

I don't think you quite understand what the SuSE business model is.SuSE does the research, and then provides support for customers so theydon't have to. Red Hat has a similar model. Most normal distroshowever, do not, and those people using them need to be doing properresearch.

2. "older kernels" are not so old, the newest mature LTS (4.9) is still
     affected,

I really don't see this as a valid excuse.  It's pretty well documented
that you absolutely should be running the most recent kernel if you're
using BTRFS.


Good point.

4. As already said before, using r/w degraded RAID1 is FULLY ACCEPTABLE,
     as long as you accept "no more redundancy"...

This is a matter of opinion.


Sure! And the particular opinion depends on system being affected. I'd
rather not have any brain-split scenario under my database servers, but
also won't mind data loss on BGP router as long as it keeps running and
is fully operational.

I still contend that running half a two
device array for an extended period of time without reshaping it to be a
single device is a bad idea for cases other than BTRFS.  The fewer
layers of code you're going through, the safer you are.


I create single-device degraded MD RAID1 when I attach one disk for
deployment (usually test machines), which are going to be converted into
dual (production) in a future - attaching second disk to array is much
easier and faster than messing with device nodes (or labels or
anything). The same applies to LVM, it's better to have it even when not
used at a moment. In case of btrfs there is no need for such
preparations, as the devices are added without renaming.

Unless you're pulling some complex black magic, you're not runningdegraded, you're running both in single device mode (which is not thesame as a degraded two device RAID1 array) and converting to two deviceRAID1 later, which is a perfectly normal use case I have absolutely noissues with.


However, sometimes the systems end up without second disk attached.
Either due to their low importance, sometimes power usage, others
need to be quiet.

One might ask, why don't I attach second disk before initial system
creation - the answer is simple: I usually use the same drive models in
RAID1, but it happens that drives bought from the same production lot
fail simultaneously, so this approach mitigates the problem and gives
more time to react.

You appear to be misunderstanding me here. I'm not saying I thinkrunning with a single disk is bad, I'm saying that I feel that runningwith a single disk and not telling the storage stack that the other oneisn't coming back any time soon is bad.

IOW, if I lose a disk in a two device BTRFS volume set up forreplication, I'll mount it degraded, and convert it from the raid1profile to the single profile and then remove the missing disk from thevolume. Similarly, for a 2 device LVM RAID1 LV, I would use lvconvertto a regular linear LV. Going through the multi-device code in BTRFS orthe DM-RAID code in LVM when you've only got one actual device is awaste of processing power, and ads another layer where things can go wrong.

Patches would be gratefully accepted.  It's really not hard to update
the documentation, it's just that nobody has had the time to do it.


Writing accurate documentation requires deep undestanding of internals.
Me - for example, I know some of the results: "don't do this", "if X happens, Y
should be done", "Z doesn't work yet, but there were some patches", "V
was fixed in some recent kernel, but no idea which commit was it
exactly", "W was severly broken in kernel I.J.K" etc. Not the hard data
that could be posted without creating the impression, that it's all
about creating complain-list. Not to mention I'm absolutely not familiar
with current patches, WIP and many many other corner cases or usage
scenarios. In a fact, not only the internals, but motivation and design
principles must be well understood to write piece of documentation.

Writing up something like that is near useless, it would only be validfor upstream kernels (And if you're using upstream kernels and followingthe advice of keeping up to date, what does it matter anyway? Themoment a new btrfs-progs gets released, you're already going to be on akernel that fixes the issues it reports.), because distros do whateverthe hell they want with version numbers (RHEL for example is notoriousfor using _ancient_ version numbers bug having bunches of stuffback-ported, and most other big distros that aren't Arch, Gentoo, orSlackware derived do so too to a lesser degree), and it would requireconstant curation to keep up to date. Only for long-term known issuesdoes it make sense, but those absolutely should be documented in theregular documentation, and doing that really isn't that hard if you justgo for current issues.


Otherwise some "fake news" propaganda is being created, just like
https://suckless.org/sucks/systemd or other systemd-haters that haven't
spent a day in their life for writing SysV init scripts or managing a
bunch of mission critical machines with handcrafted supervisors.

I hate to tell you that:

1. This type of thing happens regardless. Systemd has just garnered alot of hatred because it redesigned everything from the ground up andwas then functionally forced on most of the Linux community.2. There are quite a few of us who dislike systemd who have had tohandle actual systems administration before (and quite a few suchindividuals are primarily complaining about other aspects of systemd,like the journal crap or how it handles manually mounted filesystems forwhich mount units exist (namely, if it thinks the underlying deviceisn't ready, it will unmount them immediately, even if the user justmanually mounted them), not the service files replacing init scripts).

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Unexpected raid1 behaviour

Reply via email to