On Fri, May 20, 2016 at 4:30 PM, Josef Bacik <[email protected]> wrote:
> On Fri, May 20, 2016 at 12:45 AM,  <[email protected]> wrote:
>> From: Filipe Manana <[email protected]>
>>
>> While iterating and copying extents from the source device, the device
>> replace code keeps adjusting a left cursor that is used to make sure that
>> once we finish processing a device extent, any future writes to extents
>> from the corresponding block group will get into both the source and
>> target devices. This left cursor is also used for resuming the device
>> replace operation at mount time.
>>
>> However using this left cursor to decide whether writes go into both
>> devices or only the source device is not enough to guarantee we don't
>> miss copying extents into the target device. There are two cases where
>> the current approach fails. The first one is related to when there are
>> holes in the device and they get allocated for new block groups while
>> the device replace operation is iterating the device extents (more on
>> this explained below). The second one is that when that loop over the
>> device extents finishes, we start dellaloc, wait for all ordered extents
>> and then commit the current transaction, we might have got new block
>> groups allocated that are now using a device extent that has an offset
>> greater then or equals to the value of the left cursor, in which case
>> writes to extents belonging to these new block groups will get issued
>> only to the source device.
>>
>> For the first case where the current approach of using a left cursor
>> fails, consider the source device currently has the following layout:
>>
>>   [ extent bg A ] [ hole, unallocated space ] [extent bg B ]
>>   3Gb             4Gb                         5Gb
>>
>> While we are iterating the device extents from the source device using
>> the commit root of the device tree, the following happens:
>>
>>         CPU 1                                            CPU 2
>>
>>                       <we are at transaction N>
>>
>>   scrub_enumerate_chunks()
>>     --> searches the device tree for
>>         extents belonging to the source
>>         device using the device tree's
>>         commit root
>>     --> 1st iteration finds extent belonging to
>>         block group A
>>
>>         --> sets block group A to RO mode
>>             (btrfs_inc_block_group_ro)
>>
>>         --> sets cursor left to found_key.offset
>>             which is 3Gb
>>
>>         --> scrub_chunk() starts
>>             copies all allocated extents from
>>             block group's A stripe at source
>>             device into target device
>>
>>                                                            
>> btrfs_alloc_chunk()
>>                                                              --> allocates 
>> device extent
>>                                                                  in the 
>> range [4Gb, 5Gb[
>>                                                                  from the 
>> source device for
>>                                                                  a new block 
>> group C
>>
>>                                                            extent allocated 
>> from block
>>                                                            group C for a 
>> direct IO,
>>                                                            buffered write or 
>> btree node/leaf
>>
>>                                                            extent is written 
>> to, perhaps
>>                                                            in response to a 
>> writepages()
>>                                                            call from the VM 
>> or directly
>>                                                            through direct IO
>>
>>                                                            the write is made 
>> only against
>>                                                            the source device 
>> and not against
>>                                                            the target device 
>> because the
>>                                                            extent's offset 
>> is in the interval
>>                                                            [4Gb, 5Gb[ which 
>> is larger then
>>                                                            the value of 
>> cursor_left (3Gb)
>>
>>         --> scrub_chunks() finishes
>>
>>         --> updates left cursor from 3Gb to
>>             4Gb
>>
>>         --> btrfs_dec_block_group_ro() sets
>>             block group A back to RW mode
>>
>>                              <we are still at transaction N>
>>
>>     --> 2nd iteration finds extent belonging to
>>         block group B - it did not find the new
>>         extent in the range [4Gb, 5Gb[ for block
>>         group C because we are using the device
>>         tree's commit root or even because the
>>         block group's items are not all yet
>>         inserted in the respective btrees, that is,
>>         the block group is still attached to some
>>         transaction handle's new_bgs list and
>>         btrfs_create_pending_block_groups() was
>>         not called yet against that transaction
>>         handle, so the device extent items were
>>         not yet inserted into the devices tree
>>
>>                              <we are still at transaction N>
>>
>>         --> so we end not copying anything from the newly
>>             allocated device extent from the source device
>>             to the target device
>>
>> So fix this by making __btrfs_map_block() always redirect writes to the
>> target device as well, independently of the left cursor's value. With
>> this change the left cursor is now used only for the purpose of tracking
>> progress and allow a mount operation to resume a device replace.
>>
>> Signed-off-by: Filipe Manana <[email protected]>
>
> So if we commit the transaction and the left cursor points at the 4gib
> chunk and then we immediately crash, when we start back up we will
> re-copy the new extents onto the target device.  Or we don't even have
> to crash, if we commit the transaction before the 2nd iteration begins
> then we will find the new extent and copy it again.

Indeed, while I thought about that second case and the result of
double copying the same extent, I forgot to mention it in the change
log.
I went for the dead simple solution as I couldn't find any solution
that was also simple (and equally correct, not racy) but avoided
redirecting all writes to both devices regardless of the left cursor's
current value.

Thanks Josef!

> I guess this
> isn't a problem per-se, but maybe it would be good to keep track of
> when we created the new device extent and see if we already have it on
> the target device so we can avoid copying stuff we already have.
> Anyway this seems ok
>
> Reviewed-by: Josef Bacik <[email protected]>
>
> Thanks,
>
> Josef
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to [email protected]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to