Re: Btrfs send to send out metadata and data separately

Qu Wenruo Mon, 01 Aug 2016 18:25:42 -0700


At 08/02/2016 02:00 AM, Filipe Manana wrote:

On Fri, Jul 29, 2016 at 1:40 PM, Qu Wenruo <quwenruo.bt...@gmx.com> wrote:

Hi Filipe, and maintainers,

I'm recently working on the root fix to free send from calling backref walk.

My current idea is to send data and metadata separately, and only do clone
detection inside the send subvolume.

This method needs two new send commands:
(And new send attribute, A_DATA_BYTENR)
1) SEND_C_DATA
   much like SEND_C_WRITE, with a little change in the 1st TLV.

   TLVs:
   A_DATA_BYTENR:        bytenr of the data extent
   A_FILE_OFFSET:        offset inside the data extent
   A_DATA:               real data

2) SEND_C_CLONE_DATA
   A little like SEND_C_CLONE, with unneeded parameters striped

   TLVs:
   A_PATH:               filename
   A_DATA_BYTENR:        disk_bytenr of the EXTENT_DATA
   A_FILE_OFFSET:        file offset
   A_FILE_OFFSET:        offset inside the EXTENT_DATA
   A_CLONE_LEN:          num_bytes of the EXTENT_DATA


The send part is different in how to sending out a EXTENT_DATA.
The send work follow is:

1) Found a EXTENT_DATA to send.
   Check rb_tree of "disk_bytenr".
   if "disk_bytenr" in rb_tree
     goto 2) Reflink data
   /* Initiate a SEND_C_DATA */
   Send out the *whole* *uncompressed* extent of "disk_bytenr".
   Adds "disk_bytenr" into rb_tree


2) Reflink data
   /* Initiate a SEND_C_CLONE_DATA */
   Filling disk_bytenr, offset and num_bytes, and send out the command.

That's to say, send will send out extent data and referencer separately.

So for kernel part, it's quite easy and *NO* time consuming backref walk
ever.
And no other part is modified.


The main trick happens in the receive part.

Receive will do the following thing first before recovering the
subvolume/snapshot:

0) Create temporary dir for data extents
   Create a new dir with temporary name($data_extent), to put data extents
into it.

Then for SEND_C_DATA command:
1) Create file with file name $filename under $data_extent dir
   filename = $(printf "0x%x" $disk_bytenr)
   $disk_bytenr is the first u64 TLV of SEND_A_DATA command.
2) Write data into $data_extent/$filename

Then handle the SEND_C_CLONE_DATA command
It would be like
  xfs_io -f -c "reflink $data_extent/$disk_bytenr $extent_offset
                $file_offset $num_bytes" $filename
disk_bytenr=2nd TLV (string converted to u64, with "0x%x")
extent_offset=3rd TLV, u64
file_offset=4th TLV, u64
num_bytes=5th TLV, u64
filename=1th TLV, string

Finally, after the snapshot/subvolume is recovered, remove the $data_extent
directory.


The whole idea is to completely remove the time consuming backref walk in
send.

So pros:
1) No backref walk, no soft lockup, no super long execution time
   Under worst case O(N^2), best case O(N)
   Memory usage worst case O(N), best case O(1)
   Where N is the number of reference to extents.

2) Almost the same metadata layout
   Including the overlap extents

Cons:
1) Not full fs clone detection
   Such clone detection is only inside the send snapshot.

   For case that one extent is referred only once in the send snapshot,
   but also referred by source subvolume, then in the received
   subvolume, it will be a new extent, but not a clone.

   Only extent that is referred twice by send snapshot, that extent
   will be shared.

   (Although much better than disabling the whole clone detection)
2) Extra space usage
   Since it completely recovers the overlap extents
3) As many fragments as source subvolume
4) Possible slow recovery due to reflink speed.


I am still concerned about the following problems:

1) Is it OK to add not only 1, but 2 new send commands?
2) Is such clone detection range change OK?

Any ideas and suggestion is welcomed.


Hi Filipe,

Thanks for your comment, it helps to refine the idea to fix the problem.

New idea is stated at the ending of the mail, hopes it will address allyour conern.

Qu,

I don't like the idea at all, for several reasons:

1) Too complex to implement. We should really avoid making things more
complex than they are already.

Yes, new command new TLV new receiver behavior, the whole idea itself iscomplex.


But the core function for clone detection is simple.
Rb_tree for sent extent bytenr, and avoid sending sent extents.
At least it avoids doing expensive backref walk at all.

My new idea will keep the core function, while stripe all the newcommand/TLV/receiver behavior.

   Your earlier suggestion to cache backref lookups is much simpler
and solves the problem for the vast majority of cases (assuming a
bounded cache of course).

In fact, my earlier suggestion is not to cache backref walk result, butjust like this one, implement a internal, simpler backref mapper.


The biggest problem of backref cache is, it conflicts with snapshot.
Any snapshot will easily trash backrefs of a tree.

It means either we do a full tree walk to trash all backref cache,making snapshot much slower, or a broken cache.

(And it adds more complexity to the already complex backref walk)

    There's really no need for such high complexity.

2) By adding new commands to the stream, you break backwards compatibility.
   Think about all the tools out there that interpret send streams and
not just the receive command (for example snapper).

3) By requiring a new different behaviour for the receiver, suddenly
older versions of it will no longer be able to receive from new
kernels.


That's the real problem, I'd try to get rid of these new commands.

My original plan is to introduce new send flag, and make "btrfs send"command to try to use new flag if possible.


But the incompatibility is still here.


4) By keeping temporary files on the receiver end that contains whole
extents, you're creating periods of time where stale data is exposed.



At least, I didn't see direct objection against the changed clone range.
(From original parent/send subvol range to send subvol range only)
That's a good news, the core idea behind the fix is still OK.


Then the new idea, plan B.

Plan B is much like the original plan A, with same clone detectionrange(inside the send subvolume, not cross subvolumes).

The modification is, this time, we don't only record extent disk bytenrinto rb_tree at send time, but more things:

1) extent disk bytenr
2) extent offset
3) extent nr bytes
4) file offset
5) path

So that, we can reuse original write and clone command, no newcommand/TLV is needed at all.

But at the cost of keeping the rb_tree sync with other operations, likeunlink.

And it causes higher memory usage.

The new commands in plan A is just used to avoid such complex rb_treenor complex sync.


What do you think about this Plan B?

Thanks,
Qu


Thanks.


Thanks,
Qu
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Btrfs send to send out metadata and data separately

Reply via email to