Dear Haskellers
Here's a challenge.
We are all familiar with the idea of an MD5 checksum, which provides a reliable
"fingerprint" for a file, usually 128 bits or so. If the file changes, the
fingerprint is (almost) certain to do so too. There are lots of techniques:
CRC, shar?, MD5, etc.
For various applications (including identifying common sub-expressions, and
version tracking in GHC), I'd like a Haskell library that supports simple
fingerprint operations.
Below I give an informal specification of such a library. I can think of any
number of implementations, but I'd like one that is (a) robust and (b)
efficient. By "robust" I meant that with, say, a 128-bit fingerprint the
chance of accidental collision is truly low; a non-robust implementation uses
the 128-bit space in a non-uniform way [trivial example: gets stuck on zero].
Any takers? I bet some of you have this already, in some form.
Simon
The module signature would be something like this:
data Fingerprint
instance Eq Fingerprint
instance Ord Fingerprint
initialPrint :: Fingerprint
class Fingerprintable t where
fingerPrint :: t -> Fingerprint -> FingerPrint
instance Fingerprintable Char
instance Fingerprintable Int
instance Fingerprintable Word32
-- etc
The fingerPrint operation here is expressed in the traditional way: an
accumulator (say 128 bits) into which one stuffs successive values (e.g. a byte
stream).
However, very important, I also want a more compositional form:
instance Fingerprintable Fingerprint
so that I can do something like
fp :: Expr -> Fingerprint
fp (App e1 e2) = fp e1 `fingerPrint` fp e2
Yes, I could thread the accumulator, thus:
fp' :: Expr -> Fingerprint -> Fingerprint
fp' (App e1 e2) p = fp' e1 (fp' e2 p)
but I can't *always* do that; or even if I can, I may already *have* a
fingerprint for e1 in my hand, and it's painful to re-traverse e1. Think of
hash-consing.
An obvious implementation of "instance Fingerprintable Fingerprint" is to
divide one fingerprint into words, and feed it into the other via the Word32
instance. But it's not clear to me whether or not that is robust.
(Incidentally, one could define the class thus:
class Fingerprintable t where
fingerPrint :: t -> FingerPrint
plus
plusFP :: Fingerprint -> Fingerprint -> Fingerprint
But I think it'd be less efficient if that was the only way, because of all
those now-compulsory plusFP operations.)
I'd like one other thing: to be able to trade cost for accuracy. I'd like a
128-bit fingerprint with essentially zero chance of having two distinct values
that map to the same fingerprint; and a one-word fingerprint that was a lot
more efficient, but where collisions would be entirely possible (again
hash-consing is an example).
_______________________________________________
Haskell mailing list
[email protected]
http://www.haskell.org/mailman/listinfo/haskell