> From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- > boun...@opensolaris.org] On Behalf Of Peter Taps > > I haven't looked at the link that talks about the probability of collision. > Intuitively, I still wonder how the chances of collision can be so low. We are > reducing a 4K block to just 256 bits. If the chances of collision are so low, > *theoretically* it is possible to reconstruct the original block from the 256-bit > signature by using a simple lookup. Essentially, we would now have world's > best compression algorithm irrespective of whether the data is text or > binary. This is hard to digest.
BTW, at work we do a lot of theoretical mathematics, and one day a few months ago, I was given the challenge to explore the concept of using a hashing algorithm as a form of compression, exactly as you said. The conclusion was: You can't reverse-hash in order to reconstruct unknown original data, but you can do it (theoretically) if you have enough additional information about what constitutes valid original data. If you have a huge lookup table of all the possible original data blocks, then the hash can only be used to identify 2^(N-M) of them as possible candidates, and some additional technique is necessary to figure out precisely which one of those is the original data block. (N is the length of the data block in bits, and M is the length of the hash, in bits.) Hashing discards some of the original data. In fact, random data is generally uncompressible, so if you try to compress random data and end up with something smaller than the original, you can rest assured you're not able to reconstruct. However, if you know something about the original... For example if you know the original is a valid text document written in English, then in all likelihood there is only one possible original block fitting that description and yielding the end hash result. Even if there is more than one original block which looks like valid English text and produces the same end hash, it is easy to choose which one is correct based on context... Since you presumably know the previous block and the subsequent block, you just choose the intermediate block which seamlessly continues to produce valid English grammar at the junctions with adjacent blocks. This technique can be applied to most types of clearly structured original data, but it cannot be applied to unstructured or unknown original data. So at best, hashing could be a special-case form of compression. To decompress would require near-infinite compute hours or a large lookup table to scan all the possible sets of inputs to find one which produces the end hash... So besides the fact that hashing is at best a specific form of compression requiring additional auxiliary information, it's also impractical. To get this down to something reasonable, I considered using a 48MB lookup table for a 24-bit block of data (that's 2^24 entries of 24 bits each), or a 16GB lookup table for a 32-bit block of data (2^32 entries of 32 bits each). Well, in order to get a compression ratio worth talking about, the hash size would have to be 3 bits or smaller. That's a pretty big lookup table to decompress 3 bits into 24 or 32... And let's face it ... 9:1 compression isn't stellar for a text document. And the final nail in the coffin was: In order for this technique to be viable, as mentioned, the original data must be structured. For any set of structured original data, all the information which is necessary for the reverse-hash to identify valid data from the lookup table, could have been used instead to create a specialized compression algorithm which is equal or better than the reverse-hash. So reverse-hash decompression is actually the worst case algorithm for all the data types which it's capable of working on. But yes, you're right, it's theoretically possible for specific cases, but not theoretically possible for the general case. _______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss