On Fri, May 20, 2016 at 12:44 AM, <[email protected]> wrote: > From: Filipe Manana <[email protected]> > > When we do a device replace, for each device extent we find from the > source device, we set the corresponding block group to readonly mode to > prevent writes into it from happening while we are copying the device > extent from the source to the target device. However just before we set > the block group to readonly mode some concurrent task might have already > allocated an extent from it or decided it could perform a nocow write > into one of its extents, which can make the device replace process to > miss copying an extent since it uses the extent tree's commit root to > search for extents and only once it finishes searching for all extents > belonging to the block group it does set the left cursor to the logical > end address of the block group - this is a problem if the respective > ordered extents finish while we are searching for extents using the > extent tree's commit root and no transaction commit happens while we > are iterating the tree, since it's the delayed references created by the > ordered extents (when they complete) that insert the extent items into > the extent tree (using the non-commit root of course). > Example: > > CPU 1 CPU 2 > > btrfs_dev_replace_start() > btrfs_scrub_dev() > scrub_enumerate_chunks() > --> finds device extent belonging > to block group X > > <transaction N starts> > > starts buffered write > against some inode > > writepages is run > against > that inode forcing > dellaloc > to run > > btrfs_writepages() > extent_writepages() > > extent_write_cache_pages() > > __extent_writepage() > > writepage_delalloc() > > run_delalloc_range() > > cow_file_range() > > btrfs_reserve_extent() > --> > allocates an extent > > from block group X > > (which is not yet > in > RO mode) > > btrfs_add_ordered_extent() > --> > creates ordered extent Y > flush_epd_write_bio() > --> bio against the > extent from > block group X > is submitted > > btrfs_inc_block_group_ro(bg X) > --> sets block group X to readonly > > scrub_chunk(bg X) > scrub_stripe(device extent from srcdev) > --> keeps searching for extent items > belonging to the block group using > the extent tree's commit root > --> it never blocks due to > fs_info->scrub_pause_req as no > one tries to commit transaction N > --> copies all extents found from the > source device into the target device > --> finishes search loop > > bio completes > > ordered extent Y > completes > and creates delayed > data > reference which will > add an > extent item to the > extent > tree when run > (typically > at transaction commit > time) > > --> so the task > doing the > scrub/device > replace > at CPU 1 misses > this > and does not > copy this > extent into the > new/target > device > > btrfs_dec_block_group_ro(bg X) > --> turns block group X back to RW mode > > dev_replace->cursor_left is set to the > logical end offset of block group X > > So fix this by waiting for all cow and nocow writes after setting a block > group to readonly mode. > > Signed-off-by: Filipe Manana <[email protected]>
Reviewed-by: Josef Bacik <[email protected]> Thanks! Josef -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to [email protected] More majordomo info at http://vger.kernel.org/majordomo-info.html
