cpreston <[EMAIL PROTECTED]>: > As promised, I looked into applying the Birthday Paradox > logic to de-duplication. I blogged about my results here: > > http://www.backupcentral.com/content/view/145/47/ > > Long and short of it: If you've got less than 95 Exabytes of > data, I think you'll be OK.
One of us still doesn't understand this. :-) Your blog raises a red herring in misunderstanding or misrepresenting the applicability of Birthday Paradox. The number of possible values in BP is 366; there is no data reduction in it, no key values. An algorithm which reduced the 366 possibilities the same way that hashing 8KB down to 160 bits would yield infinitesimal keys smaller than one bit, an absurdity. An absurdity which should show that even if it stopped at eight bits, one short of the bits required to hold 1-366, there would still be fatal hash collisions--say, Feb 7, Feb 11 and Jun 30 all represented by the same code, in which case you can't figure out if people in the room have the same birthday. What you must grasp is that it is *impossible* to represent/re-create/look up the values of 2^65536 bits in fewer than 2^65536 bits--unless you concede that each checksum/hash/fingerprint will represent many different values of the original data--any more than you can represent three bits of data with two. Hashing is a technique for saving time in certain circumstances. It is valueless in re-creating (and a lookup is a re-creation) original data when those data can have unlimited arbitrary values. All the blog hand-waving about decimal places, Zetabytes and the specious comparison to undetected write errors will not change that. What _would_ be a useful exercise for the reader is to discover how many unique values of 8KB are, on average, represented by a given 160-bit checksum/hash/fingerprint. _______________________________________________ Veritas-bu maillist - Veritas-bu@mailman.eng.auburn.edu http://mailman.eng.auburn.edu/mailman/listinfo/veritas-bu