Re: About in-band dedupe for v4.7

Qu Wenruo Tue, 10 May 2016 18:03:54 -0700


Mark Fasheh wrote on 2016/05/10 15:11 -0700:

On Tue, May 10, 2016 at 03:19:52PM +0800, Qu Wenruo wrote:

Hi, Chris, Josef and David,

As merge window for v4.7 is coming, it would be good to hear your
ideas about the inband dedupe.

We are addressing the ENOSPC problem which Josef pointed out, and we
believe the final fix patch would come out at the beginning of the
merge window.(Next week)


How about the fiemap performance problem you referenced before? My guess is
that it happens because you don't coalesce writes into anything larger than
a page so you're stuck deduping at some silly size like 4k. This in turn
fragments the files so much that fiemap has a hard time walking backrefs.

Nope. Default dedupe size is 128K, and minimal dedupe size is limited to64K maximum to 8M.Yes, it's going to cause fragements, but just check the test case Isubmitted, it doesn't ever need dedupe to trigger the bug.

Clone range will also trigger it.

The problem is quite obvious, btrfs_check_shared() is doing wrong judgment.

It's using find_parent_nodes() to do shared check, while it's completelypossible a file extent to be shared with other inodes inside the same root.

So our fix is to change btrfs_check_shared(), if a file extent is sharedbetween even the same inode but different offset, then it's a sharedextent, btrfs_check_shared() will exit.

The fix shows pretty positive result yet, and we are just doing thefinal check.

It would be out in recent days.


Yes, there is still root cause hidden behind the whole thing:

Slow(non-cached) backref walk, which will not only affects the oldcheck_shared() function, but impact more on qgroup.(Walking through all the 8192 backrefs of a fully dedupe 1G file takesabout 5 full seconds)

However the cache for backref walk is almost impossible to implement, asany snapshot can easily trash all the cache we have done before.

For that case, we are going to add TTL(time to live) for in-memory cacheafter v4.7, to ensure a dedupe hash will be trashed after specifiedcache hit (64 or 128 seems quite reasonable for me, of course it'stuneable).


Oh, you're then going to think I'm not fixing the thing?
It's an *EXISTING* defeat and it's dedupe to make *you* aware of the bug.


I have to check the patches to be sure but perhaps you can tell me whether
my hunch is correct or not.


In fact, I actually asked privately for time to review your dedupe patches,
but I've been literally so busy cleaning up after the mess you left in your
last qgroups rewrite I haven't had time.

You literally broke qgroups in almost every spot that matters. In some cases
(drop_snapshot) you tore out working code and left in a /* TODO */ comment
for someone else to complete.  Snapshot create was so trivially and
completely broken by your changes that weeks later, I'm still hunting a
solution which doesn't involve adding an extra _commit_ to our commit.  This
is a MASSIVE regression from where we were before.

If you think my rework is a mess, then before that, qgroup is justrubbish, it can't even handle reflink between subvolume(btrfs/091), notto mention the hell of leaked reserved space(btrfs/099).

You were just overlooking the old qgroup things, thinking old oneworking and use the corner spot to express your disappointment.

The rework is mainly focused on fixing the main and real part of theqgroup. Normal accounting routine.

If you think the already broken qgroups is better just because it canhandle snapshot and subvolume delete but all other part broken.Then, feel free to revert all my qgroup rework and provide a bettersolution.


IMHO, you should not be trusted with large features or rewrites until you
can demonstrate:

 - A willingness to *completely* solve the problem you are trying to 'fix',
   not do half the job which someone else will have to complete for you.


OK, just let the qgroup break forever?

Have you ever noticed the problem before the rework? What are you doingbefore that?

Did you ever want to fix it before the rework?

Yes good corner spot and good whole accounting is perfect.

But it's super wired for you to think bad whole accounting with goodcorner case is even better than good whole accounting with bad corner case.


Any sane will fix the whole part thing first, and then the corner case.


 - Actual testing. The snapshot bug I reference above exists purely because
   nobody created a snapshot inside of one and checked the qgroup numbers!


Just check the qgroup test cases.

Same word can also be said to yourself.

Why didn't you spot leaked reserve space and reflink problem and eventhe long failed test cases of qgroup test group?


It's always easier to complain others work than provide a good one.

Thanks,
Qu


Sorry to be so harsh.
   --Mark

--
Mark Fasheh



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: About in-band dedupe for v4.7

Reply via email to