Hi,

I've noticed some time ago that our device replace implementation is
unreliable. Basically under several situations it ends up not copying
extents (both data and metadata) from the old device to the new device
(I briefly talked about some of the problems with Josef at LSF).

Essentially it operates by iterating all the extents of the old device
using the device tree's commit root, and for each device extent found,
does a direct copy of it into the new device (it does not use the
regular write paths). While it's doing this no extents can be
allocated from the new device, this is only possible once it finishes
copying all device extents and replaces the old device with the new
device in the list of devices available for allocations.

There are 4 main flaws with the implementation:

1) It processes each device extent only once. Clearly after it
processes (copies) a device extent it's possible for future data and
metadata writes to allocate extents from the corresponding block
group, so any new extents are not copied into the new device.

2) Before it copies a device extent, it sets the corresponding block
group to RO mode (as of commit [1], which sadly does not really fix an
issue nor does it explain how the problem it claims to fix happens) to
prevent new allocations from allocating extents from the block group
while we are processing it, and once we finish processing it, it sets
the block group back to RW mode. This is flawed too, as when it sets
the block group to RO mode, some concurrent task might have just
allocated an extent from the block group and ends up writing to it
while or after we process the device extent.

3) Since it iterates over the device tree in an ascending fashion, it
never considers the possibility for new device extents with an offset
lower then the current search offset from being allocated for new
block groups (just take a look at the while loop at
scrub_enumerate_chunks() to see how flawed it is). So it totally
ignores that device holes can be used while we are doing the replace
and never looks back to check for any and copy the respective device
extent into the new device. This is much more likely to happen
nowadays since empty block groups are automatically removed by the
cleaner kthread, while in the past this could only happen through
balance operations triggered from userspace.

4) When we finish iterating "all" device extents, we call
btrfs_dev_replace_finishing(), which flushes all delalloc, commits the
current transaction, replaces the old device in all extent mappings
with the new device, removes old device from the list of available
devices for allocation and adds the new device to that list. So when
flushing dellaloc more extents are allocated from existing block
groups and new block groups might be allocated, with all the
corresponding bios writing to the old device and not the new one, and
after this flushing we surely don't copy all new or changed extents to
the new device.

These are the 4 major issues I see with device replace. Issue number 2
is trivial to solve but it's pointless to do so without addressing the
other issues.

Solving this isn't easy, not without some trade offs at least. We can
keep in memory structures to track which new block groups are
allocated while we are doing the replace and which new extents are
allocated from existing block groups. Besides the needed
synchronization between the allocator and the device replace code
(which would bring yet more complexity), this can imply using a lot of
memory if there's a significant number of writes happening all the
time and ending up in an endless copy loop. Maybe some sort of write
throttling would work out. There's also a bit of a chicken and the egg
problem, as while the replace operation is ongoing we update a
progress item in the devices tree (btrfs_run_dev_replace()) which in
turn results in allocating new metadata extents.

Anyway, I'm just hoping someone tells me that I'm completely wrong and
misunderstood the current implementation we have. If not, and given
how serious this is (affecting both data and metadata, and defeating
the main purpose of raid), shouldn't we disable this feature, like we
did for snapshot aware defrag a couple years ago, until we have these
issues solved?

thanks


[1] - 
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=55e3a601c81cdca4497bf855fa4d331f8e830744
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to