Bill Sommerfeld wrote:
[This is version 2.  the first one escaped early by mistake]

On Sun, 2007-06-24 at 16:58 -0700, dave johnson wrote:
The most common non-proprietary hash calc for file-level deduplication seems to be the combination of the SHA1 and MD5 together. Collisions have been shown to exist in MD5 and theoried to exist in SHA1 by extrapolation, but the probibility of collitions occuring simultaneously both is to "small" as the capacity of ZFS is to "large" :)

No.  Collisions in *any* hash function with output smaller than input
are known to exist through information theory (you can't put kilobytes
of information into a 128 or 160 bit bucket) The tricky part lies in
finding collisions faster than a brute force search would find them.

Last I checked, the cryptographers specializing in hash functions were
pessimistic; the breakthroughs in collision-finding by Wang & crew a
couple years ago had revealed how little the experts actually knew about
building collision-resistant hash functions; the advice to those of us
who have come to rely on that hash function property was to migrate now
to sha256/sha512 (notably, ZFS uses sha256, not sha1), and then migrate
again once the cryptographers felt they had a better grip on the
problem; the fear was that the newly discovered attacks would generalize
to sha256.

But there's another way -- design the system so correct behavior doesn't
rely on collisions being impossible to find.

I wouldn't de-duplicate without actually verifying that two blocks or
files were actually bitwise identical; if you do this, the
collision-resistance of the hash function becomes far less important to
correctness.
                                        - Bill
I'm assuming the de-duplication scheme would be run in a similar manner as scrub currently is under ZFS. That is, infrequently, batched, and interruptible. :-)

Long before we look at deduplication, I'd vote for being able to optimize the low-hanging fruit of the instance we KNOW that two files are identical (i.e. on copying).

Oh, and last I looked, there was no consensus that there wouldn't be considerable overlap between collision-causing files and MD5 checksums and SHA1 checksums. That is, there is no confidence that those datasets which cause collision under MD5 will not cause collision under SHA. They might, they might not, but it's kinda like the P->NP problem right now (as to determining the scope of the overlap). So don't make any assumptions about the validity of using two different checksum algorithms. I think (as Casper said), that should you need to, use SHA to weed out the cases where the checksums are different (since, that definitively indicates they are different), then do a bitwise compare on any that produce the same checksum, to see if they really are the same file.

--
Erik Trimble
Java System Support
Mailstop:  usca22-123
Phone:  x17195
Santa Clara, CA
Timezone: US/Pacific (GMT-0800)

_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to