On Tue, Aug 2, 2016 at 2:20 AM, Qu Wenruo <quwen...@cn.fujitsu.com> wrote:
>
>
> At 08/02/2016 02:00 AM, Filipe Manana wrote:
>>
>> On Fri, Jul 29, 2016 at 1:40 PM, Qu Wenruo <quwenruo.bt...@gmx.com> wrote:
>>>
>>> Hi Filipe, and maintainers,
>>>
>>> I'm recently working on the root fix to free send from calling backref
>>> walk.
>>>
>>> My current idea is to send data and metadata separately, and only do
>>> clone
>>> detection inside the send subvolume.
>>>
>>> This method needs two new send commands:
>>> (And new send attribute, A_DATA_BYTENR)
>>> 1) SEND_C_DATA
>>>    much like SEND_C_WRITE, with a little change in the 1st TLV.
>>>
>>>    TLVs:
>>>    A_DATA_BYTENR:        bytenr of the data extent
>>>    A_FILE_OFFSET:        offset inside the data extent
>>>    A_DATA:               real data
>>>
>>> 2) SEND_C_CLONE_DATA
>>>    A little like SEND_C_CLONE, with unneeded parameters striped
>>>
>>>    TLVs:
>>>    A_PATH:               filename
>>>    A_DATA_BYTENR:        disk_bytenr of the EXTENT_DATA
>>>    A_FILE_OFFSET:        file offset
>>>    A_FILE_OFFSET:        offset inside the EXTENT_DATA
>>>    A_CLONE_LEN:          num_bytes of the EXTENT_DATA
>>>
>>>
>>> The send part is different in how to sending out a EXTENT_DATA.
>>> The send work follow is:
>>>
>>> 1) Found a EXTENT_DATA to send.
>>>    Check rb_tree of "disk_bytenr".
>>>    if "disk_bytenr" in rb_tree
>>>      goto 2) Reflink data
>>>    /* Initiate a SEND_C_DATA */
>>>    Send out the *whole* *uncompressed* extent of "disk_bytenr".
>>>    Adds "disk_bytenr" into rb_tree
>>>
>>>
>>> 2) Reflink data
>>>    /* Initiate a SEND_C_CLONE_DATA */
>>>    Filling disk_bytenr, offset and num_bytes, and send out the command.
>>>
>>> That's to say, send will send out extent data and referencer separately.
>>>
>>> So for kernel part, it's quite easy and *NO* time consuming backref walk
>>> ever.
>>> And no other part is modified.
>>>
>>>
>>> The main trick happens in the receive part.
>>>
>>> Receive will do the following thing first before recovering the
>>> subvolume/snapshot:
>>>
>>> 0) Create temporary dir for data extents
>>>    Create a new dir with temporary name($data_extent), to put data
>>> extents
>>> into it.
>>>
>>> Then for SEND_C_DATA command:
>>> 1) Create file with file name $filename under $data_extent dir
>>>    filename = $(printf "0x%x" $disk_bytenr)
>>>    $disk_bytenr is the first u64 TLV of SEND_A_DATA command.
>>> 2) Write data into $data_extent/$filename
>>>
>>> Then handle the SEND_C_CLONE_DATA command
>>> It would be like
>>>   xfs_io -f -c "reflink $data_extent/$disk_bytenr $extent_offset
>>>                 $file_offset $num_bytes" $filename
>>> disk_bytenr=2nd TLV (string converted to u64, with "0x%x")
>>> extent_offset=3rd TLV, u64
>>> file_offset=4th TLV, u64
>>> num_bytes=5th TLV, u64
>>> filename=1th TLV, string
>>>
>>> Finally, after the snapshot/subvolume is recovered, remove the
>>> $data_extent
>>> directory.
>>>
>>>
>>> The whole idea is to completely remove the time consuming backref walk in
>>> send.
>>>
>>> So pros:
>>> 1) No backref walk, no soft lockup, no super long execution time
>>>    Under worst case O(N^2), best case O(N)
>>>    Memory usage worst case O(N), best case O(1)
>>>    Where N is the number of reference to extents.
>>>
>>> 2) Almost the same metadata layout
>>>    Including the overlap extents
>>>
>>> Cons:
>>> 1) Not full fs clone detection
>>>    Such clone detection is only inside the send snapshot.
>>>
>>>    For case that one extent is referred only once in the send snapshot,
>>>    but also referred by source subvolume, then in the received
>>>    subvolume, it will be a new extent, but not a clone.
>>>
>>>    Only extent that is referred twice by send snapshot, that extent
>>>    will be shared.
>>>
>>>    (Although much better than disabling the whole clone detection)
>>> 2) Extra space usage
>>>    Since it completely recovers the overlap extents
>>> 3) As many fragments as source subvolume
>>> 4) Possible slow recovery due to reflink speed.
>>>
>>>
>>> I am still concerned about the following problems:
>>>
>>> 1) Is it OK to add not only 1, but 2 new send commands?
>>> 2) Is such clone detection range change OK?
>>>
>>> Any ideas and suggestion is welcomed.
>>
>>
>>
>
> Hi Filipe,
>
> Thanks for your comment, it helps to refine the idea to fix the problem.
>
> New idea is stated at the ending of the mail, hopes it will address all your
> conern.
>
>> Qu,
>>
>> I don't like the idea at all, for several reasons:
>>
>> 1) Too complex to implement. We should really avoid making things more
>> complex than they are already.
>
>
> Yes, new command new TLV new receiver behavior, the whole idea itself is
> complex.
>
> But the core function for clone detection is simple.
> Rb_tree for sent extent bytenr, and avoid sending sent extents.
> At least it avoids doing expensive backref walk at all.
>
> My new idea will keep the core function, while stripe all the new
> command/TLV/receiver behavior.
>
>>    Your earlier suggestion to cache backref lookups is much simpler
>> and solves the problem for the vast majority of cases (assuming a
>> bounded cache of course).
>
>
> In fact, my earlier suggestion is not to cache backref walk result, but just
> like this one, implement a internal, simpler backref mapper.
>
> The biggest problem of backref cache is, it conflicts with snapshot.
> Any snapshot will easily trash backrefs of a tree.
>
> It means either we do a full tree walk to trash all backref cache, making
> snapshot much slower, or a broken cache.
> (And it adds more complexity to the already complex backref walk)
>
>>     There's really no need for such high complexity.
>>
>> 2) By adding new commands to the stream, you break backwards
>> compatibility.
>>    Think about all the tools out there that interpret send streams and
>> not just the receive command (for example snapper).
>>
>> 3) By requiring a new different behaviour for the receiver, suddenly
>> older versions of it will no longer be able to receive from new
>> kernels.
>
>
> That's the real problem, I'd try to get rid of these new commands.
> My original plan is to introduce new send flag, and make "btrfs send"
> command to try to use new flag if possible.
>
> But the incompatibility is still here.
>
>>
>> 4) By keeping temporary files on the receiver end that contains whole
>> extents, you're creating periods of time where stale data is exposed.
>
>
>
> At least, I didn't see direct objection against the changed clone range.
> (From original parent/send subvol range to send subvol range only)
> That's a good news, the core idea behind the fix is still OK.
>
>
> Then the new idea, plan B.
>
> Plan B is much like the original plan A, with same clone detection
> range(inside the send subvolume, not cross subvolumes).
>
> The modification is, this time, we don't only record extent disk bytenr into
> rb_tree at send time, but more things:
> 1) extent disk bytenr
> 2) extent offset
> 3) extent nr bytes
> 4) file offset
> 5) path
>
> So that, we can reuse original write and clone command, no new command/TLV
> is needed at all.
>
> But at the cost of keeping the rb_tree sync with other operations, like
> unlink.

I don't get it. The snapshots used by a send operation are read-only -
it's not possible to unlink files/extents while send is in progress.

> And it causes higher memory usage.
>
> The new commands in plan A is just used to avoid such complex rb_tree nor
> complex sync.
>
> What do you think about this Plan B?

Lets not rush until things are simple.
A (bounded) cache of backref lookups, on the send side and per inode,
is simple and solves your original use case.

>
> Thanks,
> Qu
>
>
>
>
>>
>> Thanks.
>>
>>>
>>> Thanks,
>>> Qu
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>>> the body of a message to majord...@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>>
>>
>>
>
>



-- 
Filipe David Manana,

"People will forget what you said,
 people will forget what you did,
 but people will never forget how you made them feel."
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to