On Fri, May 20, 2016 at 12:44 AM,  <[email protected]> wrote:
> From: Filipe Manana <[email protected]>
>
> When we do a device replace, for each device extent we find from the
> source device, we set the corresponding block group to readonly mode to
> prevent writes into it from happening while we are copying the device
> extent from the source to the target device. However just before we set
> the block group to readonly mode some concurrent task might have already
> allocated an extent from it or decided it could perform a nocow write
> into one of its extents, which can make the device replace process to
> miss copying an extent since it uses the extent tree's commit root to
> search for extents and only once it finishes searching for all extents
> belonging to the block group it does set the left cursor to the logical
> end address of the block group - this is a problem if the respective
> ordered extents finish while we are searching for extents using the
> extent tree's commit root and no transaction commit happens while we
> are iterating the tree, since it's the delayed references created by the
> ordered extents (when they complete) that insert the extent items into
> the extent tree (using the non-commit root of course).
> Example:
>
>           CPU 1                                            CPU 2
>
>  btrfs_dev_replace_start()
>    btrfs_scrub_dev()
>      scrub_enumerate_chunks()
>        --> finds device extent belonging
>            to block group X
>
>                                <transaction N starts>
>
>                                                       starts buffered write
>                                                       against some inode
>
>                                                       writepages is run 
> against
>                                                       that inode forcing 
> dellaloc
>                                                       to run
>
>                                                       btrfs_writepages()
>                                                         extent_writepages()
>                                                           
> extent_write_cache_pages()
>                                                             
> __extent_writepage()
>                                                               
> writepage_delalloc()
>                                                                 
> run_delalloc_range()
>                                                                   
> cow_file_range()
>                                                                     
> btrfs_reserve_extent()
>                                                                       --> 
> allocates an extent
>                                                                           
> from block group X
>                                                                           
> (which is not yet
>                                                                            in 
> RO mode)
>                                                                     
> btrfs_add_ordered_extent()
>                                                                       --> 
> creates ordered extent Y
>                                                         flush_epd_write_bio()
>                                                           --> bio against the 
> extent from
>                                                               block group X 
> is submitted
>
>        btrfs_inc_block_group_ro(bg X)
>          --> sets block group X to readonly
>
>        scrub_chunk(bg X)
>          scrub_stripe(device extent from srcdev)
>            --> keeps searching for extent items
>                belonging to the block group using
>                the extent tree's commit root
>            --> it never blocks due to
>                fs_info->scrub_pause_req as no
>                one tries to commit transaction N
>            --> copies all extents found from the
>                source device into the target device
>            --> finishes search loop
>
>                                                         bio completes
>
>                                                         ordered extent Y 
> completes
>                                                         and creates delayed 
> data
>                                                         reference which will 
> add an
>                                                         extent item to the 
> extent
>                                                         tree when run 
> (typically
>                                                         at transaction commit 
> time)
>
>                                                           --> so the task 
> doing the
>                                                               scrub/device 
> replace
>                                                               at CPU 1 misses 
> this
>                                                               and does not 
> copy this
>                                                               extent into the 
> new/target
>                                                               device
>
>        btrfs_dec_block_group_ro(bg X)
>          --> turns block group X back to RW mode
>
>        dev_replace->cursor_left is set to the
>        logical end offset of block group X
>
> So fix this by waiting for all cow and nocow writes after setting a block
> group to readonly mode.
>
> Signed-off-by: Filipe Manana <[email protected]>

Reviewed-by: Josef Bacik <[email protected]>

Thanks!

Josef
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to [email protected]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to