> From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
> boun...@opensolaris.org] On Behalf Of Peter Taps
> 
> I haven't looked at the link that talks about the probability of
collision.
> Intuitively, I still wonder how the chances of collision can be so low. We
are
> reducing a 4K block to just 256 bits. If the chances of collision are so
low,
> *theoretically* it is  possible to reconstruct the original block from the
256-bit
> signature by using a simple lookup. Essentially, we would now have world's
> best compression algorithm irrespective of whether the data is text or
> binary. This is hard to digest.

BTW, at work we do a lot of theoretical mathematics, and one day a few
months ago, I was given the challenge to explore the concept of using a
hashing algorithm as a form of compression, exactly as you said.  The
conclusion was:  You can't reverse-hash in order to reconstruct unknown
original data, but you can do it (theoretically) if you have enough
additional information about what constitutes valid original data.  If you
have a huge lookup table of all the possible original data blocks, then the
hash can only be used to identify 2^(N-M) of them as possible candidates,
and some additional technique is necessary to figure out precisely which one
of those is the original data block.  (N is the length of the data block in
bits, and M is the length of the hash, in bits.)

Hashing discards some of the original data.  In fact, random data is
generally uncompressible, so if you try to compress random data and end up
with something smaller than the original, you can rest assured you're not
able to reconstruct.  However, if you know something about the original...
For example if you know the original is a valid text document written in
English, then in all likelihood there is only one possible original block
fitting that description and yielding the end hash result.  Even if there is
more than one original block which looks like valid English text and
produces the same end hash, it is easy to choose which one is correct based
on context...  Since you presumably know the previous block and the
subsequent block, you just choose the intermediate block which seamlessly
continues to produce valid English grammar at the junctions with adjacent
blocks.  This technique can be applied to most types of clearly structured
original data, but it cannot be applied to unstructured or unknown original
data.  So at best, hashing could be a special-case form of compression.

To decompress would require near-infinite compute hours or a large lookup
table to scan all the possible sets of inputs to find one which produces the
end hash...  So besides the fact that hashing is at best a specific form of
compression requiring additional auxiliary information, it's also
impractical.  To get this down to something reasonable, I considered using a
48MB lookup table for a 24-bit block of data (that's 2^24 entries of 24 bits
each), or a 16GB lookup table for a 32-bit block of data (2^32 entries of 32
bits each).  Well, in order to get a compression ratio worth talking about,
the hash size would have to be 3 bits or smaller.  That's a pretty big
lookup table to decompress 3 bits into 24 or 32...  And let's face it ...
9:1 compression isn't stellar for a text document.

And the final nail in the coffin was:  In order for this technique to be
viable, as mentioned, the original data must be structured.  For any set of
structured original data, all the information which is necessary for the
reverse-hash to identify valid data from the lookup table, could have been
used instead to create a specialized compression algorithm which is equal or
better than the reverse-hash.  

So reverse-hash decompression is actually the worst case algorithm for all
the data types which it's capable of working on.

But yes, you're right, it's theoretically possible for specific cases, but
not theoretically possible for the general case.

_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to