Re: Data Deduplication with the help of an online filesystem check

Ric Wheeler Mon, 04 May 2009 07:35:22 -0700

On 04/28/2009 01:41 PM, Michael Tharp wrote:

Thomas Glanzmann wrote:
no, I just used the md5 checksum. And even if I have a hash escalation
which is highly unlikely it still gives a good house number.
I'd start with a crc32 and/or MD5 to find candidate blocks, then do abytewise comparison before actually merging them. Even the risk of anaccidental collision is too high, and considering there are plenty ofbirthday-style MD5 attacks it would not be extraordinarily difficultto construct a block that collides with e.g. a system library.
Keep in mind that although digests do a fairly good job of makingunique identifiers for larger chunks of data, they can only hold somany unique combinations. Considering you're comparing blocks of a fewkibibytes in size it's best to just do a foolproof comparison. There'snothing wrong with using a checksum/digest as a screening mechanismthough.
-- m. tharp

One thing in the above scheme that would be really interesting for allpossible hash functions is maintaining good stats on hash collisions,effectiveness of the hash, etc. There has been a lot of press about MD5hash collisions for example - it would be really neat to be able totrack real world data on those,


Ric

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Data Deduplication with the help of an online filesystem check

Reply via email to