Re: [PATCH v8 10/27] btrfs: dedupe: Add basic tree structure for on-disk dedupe method

Qu Wenruo Mon, 28 Mar 2016 18:49:22 -0700


Chris Mason wrote on 2016/03/28 10:09 -0400:

On Sat, Mar 26, 2016 at 09:11:53PM +0800, Qu Wenruo wrote:



On 03/25/2016 11:11 PM, Chris Mason wrote:

On Fri, Mar 25, 2016 at 09:59:39AM +0800, Qu Wenruo wrote:



Chris Mason wrote on 2016/03/24 16:58 -0400:

Are you storing the entire hash, or just the parts not represented in
the key?  I'd like to keep the on-disk part as compact as possible for
this part.


Currently, it's entire hash.

More detailed can be checked in another mail.

Although it's OK to truncate the last duplicated 8 bytes(64bit) for me,
I still quite like current implementation, as one memcpy() is simpler.


[ sorry FB makes urls look ugly, so I delete them from replys ;) ]

Right, I saw that but wanted to reply to the specific patch.  One of the
lessons learned from the extent allocation tree and file extent items is
they are just too big.  Lets save those bytes, it'll add up.


OK, I'll reduce the duplicated last 8 bytes.

And also, removing the "length" member, as it can be always fetched from
dedupe_info->block_size.


This would mean dedup_info->block_size is a write once field.  I'm ok
with that (just like metadata blocksize) but we should make sure the
ioctls etc don't allow changing it.

Not a problem, current block_size change is done by completely disablingdedupe(imply a sync_fs), then re-enable with new block_size.


So it would be OK.


The length itself is used to verify if we are at the transaction to a new
dedupe size, but later we use full sync_fs(), such behavior is not needed
any more.

+
+/*
+ * Objectid: bytenr
+ * Type: BTRFS_DEDUPE_BYTENR_ITEM_KEY
+ * offset: Last 64 bit of the hash
+ *
+ * Used for bytenr <-> hash search (for free_extent)
+ * all its content is hash.
+ * So no special item struct is needed.
+ */
+


Can we do this instead with a backref from the extent?  It'll save us a
huge amount of IO as we delete things.


That's the original implementation from Liu Bo.

The problem is, it changes the data backref rules(originally, only
EXTENT_DATA item can cause data backref), and will make dedupe INCOMPACT
other than current RO_COMPACT.
So I really don't like to change the data backref rule.


Let me reread this part, the cost of maintaining the second index is
dramatically higher than adding a backref.  I do agree that's its nice
to be able to delete the dedup trees without impacting the rest, but
over the long term I think we'll regret the added balances.


Thanks for pointing the problem. Yes, I didn't even consider this fact.

But, on the other hand. such remove only happens when we remove the *last*
reference of the extent.
So, for medium to high dedupe rate case, such routine is not that frequent,
which will reduce the impact.
(Which is quite different for non-dedupe case)


It's both addition and removal, and the efficiency hit does depend on
what level of sharing you're able to achieve.  But what we don't want is
for metadata usage to explode as people make small non-duplicate changes
to their FS.
  If that happens, we'll only end up using dedup in back up
farms and other highly limited use cases.

Right, with current dedupe-specific backref, it'll bring unavoidablemetadata overhead.


[[People are trading-off using non-default feature]]

Although IMHO, dedupe is not a generic feature, just like compressionand possible encryption, people choose them with trade-off in their mind.

For example, compression can achieve quite high performance for easilycompressible data, but can also get quite low performance for not socompressible data, like ISO file or videos.(In my test with 2 cores VM, virtio blk on HDD, dd ISO into btrfs filewill causing about 90MB/s for default mount option, while withcompression, it's only about 40~50MB/s)

If we combine all overhead together (not only metadata overhead), almostall current transparent data processing method will only benefitspecific use case while reducing generic performance.

So increased metadata overhead is acceptable for me, especially when themain overhead is CPU time spent on SHA256.

And we have workaround from setting dedupe disable prop to settinglarger dedupe block_size to avoid small and non-dedupe writes to filldedupe tree.


I do agree that delayed refs are error prone, but that's a good reason
not fix delayed refs, not to recreate the backrefs of the extent
allocation tree in a new dedicated tree.


[[We need an idea generic for both backends]]

Also I want to mention is, dedupe now contains 2 different backends, sowe'd better choose one idea that won't break different backends intodifferent incompat/ro_compat flags.

If using backref method, ondisk backend will definitely make dedupeincompatible, affecting in-memory backend even it's completelybackward-compatible.

Or, splitting dedupe flag into DEDUPE_ONDISK and DEDUPE_INMEMORY, andformer one is INCOMPAT, while latter is at most RO_COMPAT(if usingdedupe tree).



[[Cleaner layout is less bug-prone]]

The last point of using dedupe specific backref, is to reduce thepossible bug affection, which for me is more important than performance.


Current implementation will limit dedupe backref bug to dedupe only.
While a new backref bug will definitely impact almost all btrfs function.

Thanks,
Qu


-chris



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v8 10/27] btrfs: dedupe: Add basic tree structure for on-disk dedupe method

Reply via email to