Dear Haskellers Here's a challenge.
We are all familiar with the idea of an MD5 checksum, which provides a reliable "fingerprint" for a file, usually 128 bits or so. If the file changes, the fingerprint is (almost) certain to do so too. There are lots of techniques: CRC, shar?, MD5, etc. For various applications (including identifying common sub-expressions, and version tracking in GHC), I'd like a Haskell library that supports simple fingerprint operations. Below I give an informal specification of such a library. I can think of any number of implementations, but I'd like one that is (a) robust and (b) efficient. By "robust" I meant that with, say, a 128-bit fingerprint the chance of accidental collision is truly low; a non-robust implementation uses the 128-bit space in a non-uniform way [trivial example: gets stuck on zero]. Any takers? I bet some of you have this already, in some form. Simon The module signature would be something like this: data Fingerprint instance Eq Fingerprint instance Ord Fingerprint initialPrint :: Fingerprint class Fingerprintable t where fingerPrint :: t -> Fingerprint -> FingerPrint instance Fingerprintable Char instance Fingerprintable Int instance Fingerprintable Word32 -- etc The fingerPrint operation here is expressed in the traditional way: an accumulator (say 128 bits) into which one stuffs successive values (e.g. a byte stream). However, very important, I also want a more compositional form: instance Fingerprintable Fingerprint so that I can do something like fp :: Expr -> Fingerprint fp (App e1 e2) = fp e1 `fingerPrint` fp e2 Yes, I could thread the accumulator, thus: fp' :: Expr -> Fingerprint -> Fingerprint fp' (App e1 e2) p = fp' e1 (fp' e2 p) but I can't *always* do that; or even if I can, I may already *have* a fingerprint for e1 in my hand, and it's painful to re-traverse e1. Think of hash-consing. An obvious implementation of "instance Fingerprintable Fingerprint" is to divide one fingerprint into words, and feed it into the other via the Word32 instance. But it's not clear to me whether or not that is robust. (Incidentally, one could define the class thus: class Fingerprintable t where fingerPrint :: t -> FingerPrint plus plusFP :: Fingerprint -> Fingerprint -> Fingerprint But I think it'd be less efficient if that was the only way, because of all those now-compulsory plusFP operations.) I'd like one other thing: to be able to trade cost for accuracy. I'd like a 128-bit fingerprint with essentially zero chance of having two distinct values that map to the same fingerprint; and a one-word fingerprint that was a lot more efficient, but where collisions would be entirely possible (again hash-consing is an example). _______________________________________________ Haskell mailing list Haskell@haskell.org http://www.haskell.org/mailman/listinfo/haskell