Re: [zfs-discuss] zfs space efficiency

Erik Trimble Mon, 25 Jun 2007 13:05:09 -0700

Bill Sommerfeld wrote:

[This is version 2.  the first one escaped early by mistake]


On Sun, 2007-06-24 at 16:58 -0700, dave johnson wrote:

The most common non-proprietary hash calc for file-level deduplication seemsto be the combination of the SHA1 and MD5 together. Collisions have beenshown to exist in MD5 and theoried to exist in SHA1 by extrapolation, butthe probibility of collitions occuring simultaneously both is to "small" asthe capacity of ZFS is to "large" :)


No.  Collisions in *any* hash function with output smaller than input
are known to exist through information theory (you can't put kilobytes
of information into a 128 or 160 bit bucket) The tricky part lies in
finding collisions faster than a brute force search would find them.

Last I checked, the cryptographers specializing in hash functions were
pessimistic; the breakthroughs in collision-finding by Wang & crew a
couple years ago had revealed how little the experts actually knew about
building collision-resistant hash functions; the advice to those of us
who have come to rely on that hash function property was to migrate now
to sha256/sha512 (notably, ZFS uses sha256, not sha1), and then migrate
again once the cryptographers felt they had a better grip on the
problem; the fear was that the newly discovered attacks would generalize
to sha256.

But there's another way -- design the system so correct behavior doesn't
rely on collisions being impossible to find.

I wouldn't de-duplicate without actually verifying that two blocks or
files were actually bitwise identical; if you do this, the
collision-resistance of the hash function becomes far less important to

correctness.

                                        - Bill

I'm assuming the de-duplication scheme would be run in a similar manneras scrub currently is under ZFS. That is, infrequently, batched, andinterruptible. :-)

Long before we look at deduplication, I'd vote for being able tooptimize the low-hanging fruit of the instance we KNOW that two filesare identical (i.e. on copying).

Oh, and last I looked, there was no consensus that there wouldn't beconsiderable overlap between collision-causing files and MD5 checksumsand SHA1 checksums. That is, there is no confidence that those datasetswhich cause collision under MD5 will not cause collision under SHA.They might, they might not, but it's kinda like the P->NP problem rightnow (as to determining the scope of the overlap). So don't make anyassumptions about the validity of using two different checksumalgorithms. I think (as Casper said), that should you need to, use SHAto weed out the cases where the checksums are different (since, thatdefinitively indicates they are different), then do a bitwise compare onany that produce the same checksum, to see if they really are the same file.


--
Erik Trimble
Java System Support
Mailstop:  usca22-123
Phone:  x17195
Santa Clara, CA
Timezone: US/Pacific (GMT-0800)

_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] zfs space efficiency

Reply via email to