On 07/31/2014 07:54 PM, Timofey Titovets wrote:
> Good time of day.
> I have several questions about data deduplication on btrfs.
> Sorry if i ask stupid questions or waste you time %)
> 
> What about implementation of offline data deduplication? I don't see
> any activity on this place, may be i need to ask a particular person?
> Where the problem? May be a can i try to help (testing as example)?
> 
> I could be wrong, but as i understand btrfs store crc32 checksum one
> per file, if this is true, may be make a sense to create small worker
> for dedup files? Like worker for autodefrag?
> With simple logic like:
> if sum1 == sum2 && file_size1 == file_size2; then
> if (bit_to_bit_identical(file1,2)); then merge(file1, file2);
> This can be first attempt to implement per file offline dedup
> What you think about it? could i be wrong? or this is a horrible crutch?
> (as i understand it not change format of fs)
> 
> (bedup and other tools, its cool, but have several problem with these
> tools and i think, what kernel implementation can work better).
> 
I think there may be some misunderstandings here about some of the
internals of BTRFS.  First of all, checksums are stored per block, not
per file, and secondly, deduplication can be done on a much finer scale
than individual files (you can deduplicate individual extents).

I do think however that having the option of a background thread doing
deduplication asynchronously is a good idea, but then you would have to
have some way to trigger it on individual files/trees, and triggering on
writes like the autodefrag thread does doesn't make much sense.  Having
some userspace program to tell it to run on a given set of files would
probably be the best approach for a trigger.  I don't remember if this
kind of thing was also included in the online deduplication patches that
got posted a while back or not.

Attachment: smime.p7s
Description: S/MIME Cryptographic Signature

Reply via email to