On 2017-12-16 14:50, Dark Penguin wrote:
Could someone please point me towards some read about how btrfs handles
multiple devices? Namely, kicking faulty devices and re-adding them.
I've been using btrfs on single devices for a while, but now I want to
start using it in raid1 mode. I booted into an Ubuntu 17.10 LiveCD and
tried to see how does it handle various situations. The experience left
me very surprised; I've tried a number of things, all of which produced
unexpected results.
Expounding a bit on Duncan's answer with some more specific info.
I create a btrfs raid1 filesystem on two hard drives and mount it.
- When I pull one of the drives out (simulating a simple cable failure,
which happens pretty often to me), the filesystem sometimes goes
read-only. ??? > - But only after a while, and not always. ???
The filesystem won't go read-only until it hits an I/O error, and it's
non-deterministic how long it will be before that happens on an idle
filesystem that only sees read access (because if all the files that are
being read are in the page cache).
- When I fix the cable problem (plug the device back), it's immediately
"re-added" back. But I see no replication of the data I've written onto
a degraded filesystem... Nothing shows any problems, so "my filesystem
must be ok". ???
One of two things happens in this case, and why there is no re-sync is
dependent on which happens, but both ultimately have to do with the fact
that BTRFS assumes I/O errors are from device failures, and are at worst
transient. Either:
1. The device reappears with the same name. This happens if the time it
was disconnected is less than the kernel's command timeout (30 seconds
by default). In this case, BTRFS may not even notice that the device
was gone (and if it doesn't, then a re-sync isn't necessary, since it
will retry all the writes it needs to). In this case, BTRFS assumes the
I/O errors were temporary, and keeps using the device after logging the
errors. If this happens, then you need to manually re-sync things by
scrubbing the filesystem (or balancing, but scrubbing is preferred as it
should run quicker and will only re-write what is actually needed).
2. The device reappears with a different name. In this case, the device
was gone long enough that the block layer is certain it was
disconnected, and thus when it reappears and BTRFS still holds open
references to the old device node, it gets a new device node. In this
case, if the 'new' device is scanned, BTRFS will recognize it as part of
the FS, but will keep using the old device node. The correct fix here
is to unmount the filesystem, re-scan all devices, and then remount the
filesystem and manually re-sync with a scrub.
- If I unmount the filesystem and then mount it back, I see all my
recent changes lost (everything I wrote during the "degraded" period).
I'm not quite sure about this, but I think BTRFS is rolling back to the
last common generation number for some reason.
- If I continue working with a degraded raid1 filesystem (even without
damaging it further by re-adding the faulty device), after a while it
won't mount at all, even with "-o degraded".
This is (probably) a known bug relating to chunk handling. In a two
device volume using a raid1 profile with a missing device, older kernels
(I don't remember when the fix went in, but I could have sworn it was in
4.13) will (erroneously) generate single-profile chunks when they need
to allocate new chunks. When you then go to mount the filesystem, the
check for the degraded mount-ability of the FS fails because there is a
device missing and single profile chunks.
Now, even without that bug, it's never a good idea t0o run a storage
array degraded for any extended period of time, regardless of what type
of array it is (BTRFS, ZFS, MD, LVM, or even hardware RAID). By keeping
it in 'degraded' mode, you're essentially telling the system that the
array will be fixed in a reasonably short time-frame, which impacts how
it handles the array. If you're not going to fix it almost immediately,
you should almost always reshape the array to account for the missing
device if at all possible, as that will improve relative data safety and
generally get you better performance than running degraded will.
I can't wrap my head about all this. Either the kicked device should not
be re-added, or it should be re-added "properly", or it should at least
show some errors and not pretend nothing happened, right?..
BTRFS is not the best at error reporting at the moment. If you check
the output of `btrfs device stats` for that filesystem though, it should
show non-zero values in the error counters (note that these counters are
cumulative, so they are counts since the last time they were reset (or
when the FS was created if they have never been reset). Similarly,
scrub should report errors, there should be error messages in the kernel
log, and switching the FS to read-only mode _is_ technically reporting
an error, as that's standard error behavior for most sensible
filesystems (ext[234] being the notable exception, they just continue as
if nothing happened).
I must be missing something. Is there an explanation somewhere about
what's really going on during those situations? Also, do I understand
correctly that upon detecting a faulty device (a write error), nothing
is done about it except logging an error into the 'btrfs device stats'
report? No device kicking, no notification?.. And what about degraded
filesystems - is it absolutely forbidden to work with them without
converting them to a "single" filesystem first?..
As mentioned above, going read-only _is_ a notification that something
is wrong. Translating that (and the error counter increase, and the
kernel log messages) into a user visible notification is not really the
job of BTRFS, especially considering that no other filesystem or device
manager does so either (yes, you can get nice notifications from LVM,
but they aren't _from_ LVM itself, they're from other software that
watches for errors, and the same type of software works just fine for
BTRFS too). If you're this worried about it and don't want to keep on
top of it yourself by monitoring things manually, you really need to
look into a tool like monit [1] that can handle this for you.
[1] https://mmonit.com/monit/
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html