On Wed, May 11, 2016 at 10:29 AM, Qu Wenruo <quwen...@cn.fujitsu.com> wrote: > > > Filipe Manana wrote on 2016/05/11 10:09 +0100: >> >> Hi, >> >> I've noticed some time ago that our device replace implementation is >> unreliable. Basically under several situations it ends up not copying >> extents (both data and metadata) from the old device to the new device >> (I briefly talked about some of the problems with Josef at LSF). >> >> Essentially it operates by iterating all the extents of the old device >> using the device tree's commit root, and for each device extent found, >> does a direct copy of it into the new device (it does not use the >> regular write paths). While it's doing this no extents can be >> allocated from the new device, this is only possible once it finishes >> copying all device extents and replaces the old device with the new >> device in the list of devices available for allocations. >> >> There are 4 main flaws with the implementation: >> >> 1) It processes each device extent only once. Clearly after it >> processes (copies) a device extent it's possible for future data and >> metadata writes to allocate extents from the corresponding block >> group, so any new extents are not copied into the new device. > > > Not quite familiar with dev replace though, but IIRC, current dev replace > will commit current transaction and mark that block group readonly to > prevent any incoming write. > > So that's to say, the readonly setup and commit_transaction() has a race > window to allow new write into the ro block group. > > Or I missed something?
You misunderstood my statement. What I say is that before we process a device extent, we set the block group to RO. But once we finish processing it, we turn it back to RW. And after that we don't process that device extent/block group anymore, which is a problem because new extents might have been allocated from it. >> >> >> 2) Before it copies a device extent, it sets the corresponding block >> group to RO mode (as of commit [1], which sadly does not really fix an >> issue nor does it explain how the problem it claims to fix happens) to >> prevent new allocations from allocating extents from the block group >> while we are processing it, and once we finish processing it, it sets >> the block group back to RW mode. This is flawed too, as when it sets >> the block group to RO mode, some concurrent task might have just >> allocated an extent from the block group and ends up writing to it >> while or after we process the device extent. > > > So the problem is the timing of commit_transaction() and set block group RO. > > Or is there any possibility to allocate extent while can still escape the > control of commit_transaction()? The transaction commit is irrelevant here. That's not the problem. See the answer above. > >> >> 3) Since it iterates over the device tree in an ascending fashion, it >> never considers the possibility for new device extents with an offset >> lower then the current search offset from being allocated for new >> block groups (just take a look at the while loop at >> scrub_enumerate_chunks() to see how flawed it is). So it totally >> ignores that device holes can be used while we are doing the replace >> and never looks back to check for any and copy the respective device >> extent into the new device. This is much more likely to happen >> nowadays since empty block groups are automatically removed by the >> cleaner kthread, while in the past this could only happen through >> balance operations triggered from userspace. > > > It seems that the problem is still a result of problem 1 and 2. > Still commit_trans() and set block group RO problem. Not, problems 1 and 2 are very different. This about space that was previously not allocated becoming allocated to a new block group and the loop at scrub_enumerate_chunks() totally ignoring that possibility. The loop keeps searching for extents using a key that has its offset incremented by "offset of current extent + length of current extent" and never goes back. > > Just as stated, I'm still not quite familiar with dev replace, > it would be quite nice if you could point out any misunderstand from me. Well I gave a high level overview of the problems in the algorithm. Now you'll to read it by yourself and confirm. thanks > > Thanks, > Qu > >> >> 4) When we finish iterating "all" device extents, we call >> btrfs_dev_replace_finishing(), which flushes all delalloc, commits the >> current transaction, replaces the old device in all extent mappings >> with the new device, removes old device from the list of available >> devices for allocation and adds the new device to that list. So when >> flushing dellaloc more extents are allocated from existing block >> groups and new block groups might be allocated, with all the >> corresponding bios writing to the old device and not the new one, and >> after this flushing we surely don't copy all new or changed extents to >> the new device. >> >> These are the 4 major issues I see with device replace. Issue number 2 >> is trivial to solve but it's pointless to do so without addressing the >> other issues. >> >> Solving this isn't easy, not without some trade offs at least. We can >> keep in memory structures to track which new block groups are >> allocated while we are doing the replace and which new extents are >> allocated from existing block groups. Besides the needed >> synchronization between the allocator and the device replace code >> (which would bring yet more complexity), this can imply using a lot of >> memory if there's a significant number of writes happening all the >> time and ending up in an endless copy loop. Maybe some sort of write >> throttling would work out. There's also a bit of a chicken and the egg >> problem, as while the replace operation is ongoing we update a >> progress item in the devices tree (btrfs_run_dev_replace()) which in >> turn results in allocating new metadata extents. >> >> Anyway, I'm just hoping someone tells me that I'm completely wrong and >> misunderstood the current implementation we have. If not, and given >> how serious this is (affecting both data and metadata, and defeating >> the main purpose of raid), shouldn't we disable this feature, like we >> did for snapshot aware defrag a couple years ago, until we have these >> issues solved? >> >> thanks >> >> >> [1] - >> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=55e3a601c81cdca4497bf855fa4d331f8e830744 >> -- >> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in >> the body of a message to majord...@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html >> >> > > -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html