Re: Announcing btrfs-dedupe

Austin S. Hemmelgarn Tue, 15 Nov 2016 04:27:35 -0800

On 2016-11-14 16:10, Zygo Blaxell wrote:

On Mon, Nov 14, 2016 at 02:56:51PM -0500, Austin S. Hemmelgarn wrote:

On 2016-11-14 14:51, Zygo Blaxell wrote:

Deduplicating an extent that may might be concurrently modified during the
dedup is a reasonable userspace request.  In the general case there's
no way for userspace to ensure that it's not happening.

I'm not even talking about the locking, I'm talking about the data
comparison that the ioctl does to ensure they are the same before
deduplicating them, and specifically that protecting against userspace just
passing in two random extents that happen to be the same size but not
contain the same data (because deduplication _should_ reject such a
situation, that's what the clone ioctl is for).


If I'm deduping a VM image, and the virtual host is writing to said image
(which is likely since an incremental dedup will be intentionally doing
dedup over recently active data sets), the extent I just compared in
userspace might be different by the time the kernel sees it.

This is an important reason why the whole lock/read/compare/replace step
is an atomic operation from userspace's PoV.

The read also saves having to confirm a short/weak hash isn't a collision.
The RAM savings from using weak hashes (~48 bits) are a huge performance
win.

The locking overhead is very small compared to the reading overhead,
and (in the absence of bugs) it will only block concurrent writes to the
same offset range in the src/dst inodes (based on a read of the code...I
don't know if there's also an inode-level or backref-level barrier that
expands the locking scope).

I'm not arguing that it's a bad thing that the kernel is doing this, I'mjust saying that the locking overhead is minuscule in most casescompared to the data comparison. It is absolutely necessary for exactlythe reasons you are outlining.


I'm not sure the ioctl is well designed for simply throwing random
data at it, especially not entire files (it can't handle files over
16MB anyway).  It will read more data than it has to compared to a
block-by-block comparison from userspace with prefetches or a pair of
IO threads.  If userspace reads both copies of the data just before
issuing the extent-same call, the kernel will read the data from cache
reasonably quickly.

It still depends on the use case to a certain extent. In the case I wasusing as an example, I know to a reasonably certain degree (barringtampering, bugs, or hardware failure) that any two files are identical,and I actually don't want to trash the page-cache just to deduplicatedata faster (he data set in question is large, but most of it is idle atany given point in time), so there's no point in me prereadingeverything in userspace, which in turn makes the script I use muchsimpler (the most complex part is figuring out how to split extents forfiles bigger than the ioctl can handle such that I don't have tiny tailextents but still have a minimum number per file).

The locking is perfectly reasonable and shouldn't contribute that much to
the overhead (unless you're being crazy and deduplicating thousands of tiny
blocks of data).


Why is deduplicating thousands of blocks of data crazy?  I already
deduplicate four orders of magnitude more than that per week.

You missed the 'tiny' quantifier. I'm talking really small blocks, onthe order of less than 64k (so, IOW, stuff that's not much bigger than afew filesystem blocks), and that is somewhat crazy because it ends upnot only taking _really_ long to do compared to larger chunks (becauseyou're running more independent hashes than with bigger blocks), butalso because it will often split extents unnecessarily and contribute tofragmentation, which will lead to all kinds of other performanceproblems on the FS.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Announcing btrfs-dedupe

Reply via email to