On 2016-04-05 14:36, Yauhen Kharuzhy wrote:
2016-04-05 11:15 GMT-07:00 Austin S. Hemmelgarn <ahferro...@gmail.com>:
On 2016-04-05 13:53, Yauhen Kharuzhy wrote:
Hello,
I try to understand btrfs logic in mounting of multi-device filesystem
when device generations are different. All my questions are related to
RAID5/6 for system, metadata, and data case.
Kernel can mount FS with different device generations (if drive was
physically removed before last unmount and returned back after, for
example) now but scrub will report uncorrectable errors after this
(but second run doesn't show any errors). Does any documentation about
algorithm of multiple device handling in such case exist? Does the
case with different device generations is allowed in general and what
worst cases can be here?
In general, it isn't allowed, but we don't explicitly disallow it either.
The worst case here is that the devices both get written two separately, and
you end up with data not matching for correlated generation ID's. The
second scrub in this case shows no errors because the first one corrects
them (even though they are reported as uncorrectable, which is a bug as far
as I can tell), and from what I can tell from reading the code, it does this
by just picking the highest generation ID and dropping the data from the
lower generation.
Hmm... Sounds reasonable but how to detect if filesystem should be
checked by scrub after mounting? There is one way as I understand — to
check kernel logs after mount for any btrfs errors and this is not a
good way for case of some kind of automatic management.
There really isn't any way that I know of. Personally, I just scrub all
my filesystems shortly after mount, but I also have pretty small
filesystems (the biggest are 64G) on relatively fast storage. In
theory, it might be possible to parse the filesystems before mounting to
check the device generation numbers, but that may be just as expensive
as just scrubbing the filesystem (and you really should be scrubbing
somewhat regularly anyway).
What should happen if device was removed and returned back after some
time when filesystem is online? Should some kind of device
reopening be possible or one possible way to guarantee FS consistensy
is to mark such device as missing and to replace it?
In this case, the device being removed (or some component between the device
and the processor failing, or the device itself erroneously reporting
failure) will force the FS read-only. If the device reappears while the FS
is still online, it may just start working again (this is _really_ rare, and
requires that the device appear with the same device node as it had
previously, and this usually only happens when the device disappears for
only a very short period of time), or it may not work until the FS gets
remounted (this is usually the case), or the system may crash (thankfully
this almost never happens, and it's usually not because of BTRFS when it
does). Regardless of what happens, you may still have to run a scrub to
make sure everything is consistent.
So, one right way if we see device reconnected as new block device —
is to reject it and don't include it in device list again, am I right?
Existing code tries to 'reconnect' it with new device name but this
works completely wrong for mounted FS (because btrfs device is renamed
only, no real device reopening is performed) and I intend to propose
patch based on Anand's 'global spare' patch series to handle this
properly.
In an ideal situation, you have nothing using the FS and can unmount,
run a device scan, and then remount. In most cases this won't work, and
being able to re-add the device via a hot-spare type setup (or even just
use device replace on it, which I've done before myself when dealing
with filesystems on USB devices, and it works well) would be useful.
Ideally, we should have the option to auto-detect such a situation and
handle it, but that _really_ needs to be optional (there are just too
many things that could go wrong).
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html