Dark Penguin posted on Sat, 16 Dec 2017 22:50:33 +0300 as excerpted:

> Could someone please point me towards some read about how btrfs handles
> multiple devices? Namely, kicking faulty devices and re-adding them.
> 
> I've been using btrfs on single devices for a while, but now I want to
> start using it in raid1 mode. I booted into an Ubuntu 17.10 LiveCD and
> tried to see how does it handle various situations. The experience left
> me very surprised; I've tried a number of things, all of which produced
> unexpected results.
> 
> I create a btrfs raid1 filesystem on two hard drives and mount it.
> 
> - When I pull one of the drives out (simulating a simple cable failure,
> which happens pretty often to me), the filesystem sometimes goes
> read-only. ???
> - But only after a while, and not always. ???
> - When I fix the cable problem (plug the device back), it's immediately
> "re-added" back. But I see no replication of the data I've written onto
> a degraded filesystem... Nothing shows any problems, so "my filesystem
> must be ok". ???
> - If I unmount the filesystem and then mount it back, I see all my
> recent changes lost (everything I wrote during the "degraded" period). -
> If I continue working with a degraded raid1 filesystem (even without
> damaging it further by re-adding the faulty device), after a while it
> won't mount at all, even with "-o degraded".
> 
> I can't wrap my head about all this. Either the kicked device should not
> be re-added, or it should be re-added "properly", or it should at least
> show some errors and not pretend nothing happened, right?..
> 
> I must be missing something. Is there an explanation somewhere about
> what's really going on during those situations? Also, do I understand
> correctly that upon detecting a faulty device (a write error), nothing
> is done about it except logging an error into the 'btrfs device stats'
> report? No device kicking, no notification?.. And what about degraded
> filesystems - is it absolutely forbidden to work with them without
> converting them to a "single" filesystem first?..
> 
> On Ubuntu 17.10, there's Linux 4.13.0-16 and btrfs-progs 4.12-1 .

Btrfs device handling at this point is still "development level" and very 
rough, but there's a patch set in active review ATM that should improve 
things dramatically, perhaps as soon as 4.16 (4.15 is already well on the 
way).

Basically, at this point btrfs doesn't have "dynamic" device handling.  
That is, if a device disappears, it doesn't know it.  So it continues 
attempting to write to (and read from, but the reads are redirected) the 
missing device until things go bad enough it kicks to read-only for 
safety.

If a device is added back, the kernel normally shuffles device names and 
assigns a new one.  Btrfs will see it and list the new device, but it's 
still trying to use the old one internally.  =:^(

Thus, if a device disappears, to get it back you really have to reboot, 
or at least unload/reload the btrfs kernel module, in ordered to clear 
the stale device state and have btrfs rescan and reassociate devices with 
the matching filesystems.

Meanwhile, once a device goes stale -- other devices in the filesystem 
have data that should have been written to the stale one but it was gone 
so the data couldn't get to it -- once you do the module unload/reload or 
reboot cycle and btrfs picks up the device again, you should immediately 
do a btrfs scrub, which will detect and "catch up" the differences.

Btrfs tracks atomic filesystem updates via a monotonically increasing 
generation number, aka transaction-id (transid).  When a device goes 
offline, its generation number of course gets stuck at the point it went 
offline, while the other devices continue to update their generation 
numbers.

When a stale device is readded, btrfs should automatically find and use 
the device with the latest generation, but the old one isn't 
automatically caught up -- a scrub is the mechanism by which you do this.

One thing you do **NOT** want to do is degraded-writable mount one 
device, then the other device, of a raid1 pair, because that'll diverge 
the two with new data on each, and that's no longer simple to correct.  
If you /have/ to degraded-writable mount a raid1, always make sure it's 
the same one mounted writable if you want to combine them again.  If you 
/do/ need to recombine two diverged raid1 devices, the only safe way to 
do so is to wipe the one so btrfs has only the one copy of the data to go 
on, and add the wiped device back as a new device.

Meanwhile, until /very/ recently... 4.13 may not be current enough... if 
you mounted a two-device raid1 degraded-writable, btrfs would try to 
write and note that it couldn't do raid1 because there wasn't a second 
device, so it would create single chunks to write into.

And the older filesystem safe-mount mechanism would see those single 
chunks on a raid1 and decide it wasn't safe to mount the filesystem 
writable at all after that, even if all the single chunks were actually 
present on the remaining device.

The effect was that if a device died, you had exactly one degraded-
writable mount to replace it successfully.  If you didn't complete the 
replace in that single chance writable mount, the filesystem would refuse 
to mount writable again, and thus it was impossible to repair the 
filesystem since that required a writable mount and that was no longer 
possible!  Fortunately the filesystem could still be mounted degraded-
readonly (unless there was some other problem), allowing people to at 
least get at the read-only data to copy it elsewhere.

With a new enough btrfs, while btrfs will still create those single 
chunks on a degraded-writable mount of a raid1, it's at least smart 
enough to do per-chunk checks to see if they're all available on existing 
devices (none only on the missing device), and will continue to allow 
degraded-writable mounting if so.

But once the filesystem is back to multi-device (with writable space on 
at least two devices), a balance-convert of those single chunks to raid1 
should be done, otherwise if the device with them on it goes...

And there's work on allowing it to do only single-copy, thus incomplete-
raid1, chunk writes as well.  This should prevent the single mode chunks 
entirely, thus eliminating the need for the balance-convert, tho a scrub 
would still be needed to fully sync back up.  But I'm not sure what the 
status is on that.

Meanwhile, as mentioned above, there's active work on proper dynamic 
btrfs device tracking and management.  It may or may not be ready for 
4.16, but once it goes in, btrfs should properly detect a device going 
away and react accordingly, and it should detect a device coming back as 
a different device too.  As I write this it occurs to me that I've not 
read close enough to know if it actually initiates scrub/resync on its 
own in the current patch set, but that's obviously an eventual goal if 
not.

Longer term, there's further patches that will provide a hot-spare 
functionality, automatically bringing in a device pre-configured as a hot-
spare if a device disappears, but that of course requires that btrfs 
properly recognize devices disappearing and coming back first, so one 
thing at a time.  Tho as originally presented, that hot-spare 
functionality was a bit limited -- it was a global hot-spare list, and 
with multiple btrfs of different sizes and multiple hot-spare devices 
also of different sizes, it would always just pick the first spare on the 
list for the first btrfs needing one, regardless of whether the size was 
appropriate for that filesystem or not.  By the time the feature actually 
gets merged it may have changed some, and regardless, it should 
eventually get less limited, but that's _eventually_, with a target time 
likely still in years, so don't hold your breath.


I think that answers most of your questions.  Basically, you have to be 
quite careful with btrfs raid1 today, as btrfs simply doesn't have the 
automated functionality to handle it yet.  It's still possible to do two-
device-only raid1 and replace a failed device when you're down to one, 
but it's not as easy or automated as more mature raid options such as 
mdraid, and you do have to keep on top of it as a result.  But it can and 
does work reasonably well for those (like me) who use btrfs raid1 as 
their "daily driver", as long as you /do/ keep on top of it... and don't 
try to use raid1 as a replacement for real backups, because it's *not* a 
backup! =:^)

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to