On Wed, Aug 24, 2016 at 3:36 AM, Qu Wenruo <quwen...@cn.fujitsu.com> wrote: > > > At 08/04/2016 09:52 AM, Qu Wenruo wrote: >> >> >> >> At 08/03/2016 05:05 PM, Filipe Manana wrote: >>> >>> On Tue, Aug 2, 2016 at 2:20 AM, Qu Wenruo <quwen...@cn.fujitsu.com> >>> wrote: >>>> >>>> >>>> >>>> At 08/02/2016 02:00 AM, Filipe Manana wrote: >>>>> >>>>> >>>>> On Fri, Jul 29, 2016 at 1:40 PM, Qu Wenruo <quwenruo.bt...@gmx.com> >>>>> wrote: >> >> [snipped] >>>>> >>>>> >>>>> >>>>> >>>> >>>> Hi Filipe, >>>> >>>> Thanks for your comment, it helps to refine the idea to fix the problem. >>>> >>>> New idea is stated at the ending of the mail, hopes it will address >>>> all your >>>> conern. >>>> >>>>> Qu, >>>>> >>>>> I don't like the idea at all, for several reasons: >>>>> >>>>> 1) Too complex to implement. We should really avoid making things more >>>>> complex than they are already. >>>> >>>> >>>> >>>> Yes, new command new TLV new receiver behavior, the whole idea itself is >>>> complex. >>>> >>>> But the core function for clone detection is simple. >>>> Rb_tree for sent extent bytenr, and avoid sending sent extents. >>>> At least it avoids doing expensive backref walk at all. >>>> >>>> My new idea will keep the core function, while stripe all the new >>>> command/TLV/receiver behavior. >>>> >>>>> Your earlier suggestion to cache backref lookups is much simpler >>>>> and solves the problem for the vast majority of cases (assuming a >>>>> bounded cache of course). >>>> >>>> >>>> >>>> In fact, my earlier suggestion is not to cache backref walk result, >>>> but just >>>> like this one, implement a internal, simpler backref mapper. >>>> >>>> The biggest problem of backref cache is, it conflicts with snapshot. >>>> Any snapshot will easily trash backrefs of a tree. >>>> >>>> It means either we do a full tree walk to trash all backref cache, >>>> making >>>> snapshot much slower, or a broken cache. >>>> (And it adds more complexity to the already complex backref walk) >>>> >>>>> There's really no need for such high complexity. >>>>> >>>>> 2) By adding new commands to the stream, you break backwards >>>>> compatibility. >>>>> Think about all the tools out there that interpret send streams and >>>>> not just the receive command (for example snapper). >>>>> >>>>> 3) By requiring a new different behaviour for the receiver, suddenly >>>>> older versions of it will no longer be able to receive from new >>>>> kernels. >>>> >>>> >>>> >>>> That's the real problem, I'd try to get rid of these new commands. >>>> My original plan is to introduce new send flag, and make "btrfs send" >>>> command to try to use new flag if possible. >>>> >>>> But the incompatibility is still here. >>>> >>>>> >>>>> 4) By keeping temporary files on the receiver end that contains whole >>>>> extents, you're creating periods of time where stale data is exposed. >>>> >>>> >>>> >>>> >>>> At least, I didn't see direct objection against the changed clone range. >>>> (From original parent/send subvol range to send subvol range only) >>>> That's a good news, the core idea behind the fix is still OK. >>>> >>>> >>>> Then the new idea, plan B. >>>> >>>> Plan B is much like the original plan A, with same clone detection >>>> range(inside the send subvolume, not cross subvolumes). >>>> >>>> The modification is, this time, we don't only record extent disk >>>> bytenr into >>>> rb_tree at send time, but more things: >>>> 1) extent disk bytenr >>>> 2) extent offset >>>> 3) extent nr bytes >>>> 4) file offset >>>> 5) path >>>> >>>> So that, we can reuse original write and clone command, no new >>>> command/TLV >>>> is needed at all. >>>> >>>> But at the cost of keeping the rb_tree sync with other operations, like >>>> unlink. >>> >>> >>> I don't get it. The snapshots used by a send operation are read-only - >>> it's not possible to unlink files/extents while send is in progress. >> >> >> My fault, unlink in send only means to unlink a inode which is in parent >> subvolume but not in send subvolume anymore. >> >> So it won't need to do the sync with cache. >>> >>> >>>> And it causes higher memory usage. >>>> >>>> The new commands in plan A is just used to avoid such complex rb_tree >>>> nor >>>> complex sync. >>>> >>>> What do you think about this Plan B? >>> >>> >>> Lets not rush until things are simple. >>> A (bounded) cache of backref lookups, on the send side and per inode, >>> is simple and solves your original use case. >> >> >> Still not very clear of your cache idea. >> >> 1) About "per-inode" >> >> Did you mean the lifetime of the cache is during sending one inode? >> >> If so, it may solve the problem of the submitted test case, but won't >> help the following case, where all different files share the same extent: >> >> Inode 1000, file offset 0 len 128K: disk bytenr X, extent offset 0 >> Inode 1001, file offset 0 len 128K: disk bytenr X, extent offset 0 >> ... >> Inode 5096, file offset 0 len 128K: disk bytenr X, extent offset 0 >> >> And that's why my first idea is to make the cache per-subvolume. >> >> 2) About "cache of backref lookups" >> >> IMHO, we have a lot of things to consider in this case. >> For example, in the following case: >> >> Extent X: len 16M >> >> Inode 1000: file offset 0 len 4K: disk bytenr X, extent offset 0 >> Inode 1001: file offset 0 len 4K: disk bytenr X, extent offset 4K >> ... >> Inode 4095: file offset 0 len 4K: disk bytenr X, extent offset 16M-4K >> >> For the first referencer, we must do the full backref walk. >> And to cache what? the complex tuple of >> (extent disk bytenr, extent offset, file offset, file path)? >> >> The tuple already seems complex for me now. >> >> And not to mention such cache doesn't have any effect in above case. >> All file extent will go through backref walk, and won't return for a >> long time. >> >> >> And, further more, above case can be more complex: >>> >>> Inode 999: file offset 0 len 16K: disk bytenr X, extent offset 16K >> >> Inode 1000: file offset 0 len 4K: disk bytenr X, extent offset 0 >> Inode 1001: file offset 0 len 4K: disk bytenr X, extent offset 4K >> ... >> >> For inode 999, if we cache its result with (extent disk bytenr, extent >> offset, file offset, file path), then when we send inode 1000, we will >> do a extent overlap check, to find that 1000 can reflink to the first 4K >> of inode 999. >> >> IMHO, such cache is far from simple. >> >> AFAIK, only my original plan A can handle above cases with ease. >> >> Or I misunderstand your idea on what to cache? > > > Hi Filipe, > > Any update on the send problem?
Sorry, I have no time to go through this right now. > > Thanks, > Qu > >> >> Thanks, >> Qu >> >>> >>>> >>>> Thanks, >>>> Qu >>>> >>>> >>>> >>>> >>>>> >>>>> Thanks. >>>>> >>>>>> >>>>>> Thanks, >>>>>> Qu >>>>>> -- >>>>>> To unsubscribe from this list: send the line "unsubscribe >>>>>> linux-btrfs" in >>>>>> the body of a message to majord...@vger.kernel.org >>>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html >>>>> >>>>> >>>>> >>>>> >>>>> >>>> >>>> >>> >>> >>> > > -- Filipe David Manana, "People will forget what you said, people will forget what you did, but people will never forget how you made them feel." -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html