[Haskell] Fingerprints and hashing

Simon Peyton-Jones Thu, 11 Oct 2007 02:42:39 -0700

Dear Haskellers

Here's a challenge.


We are all familiar with the idea of an MD5 checksum, which provides a reliable 
"fingerprint" for a file, usually 128 bits or so.  If the file changes, the 
fingerprint is (almost) certain to do so too.  There are lots of techniques: 
CRC, shar?, MD5, etc.

For various applications (including identifying common sub-expressions, and 
version tracking in GHC), I'd like a Haskell library that supports simple 
fingerprint operations.

Below I give an informal specification of such a library.  I can think of any 
number of implementations, but I'd like one that is (a) robust and (b) 
efficient.  By "robust" I meant that with, say, a 128-bit fingerprint the 
chance of accidental collision is truly low; a non-robust implementation uses 
the 128-bit space in a non-uniform way [trivial example: gets stuck on zero].

Any takers?  I bet some of you have this already, in some form.

Simon



The module signature would be something like this:

  data Fingerprint
  instance Eq Fingerprint
  instance Ord Fingerprint

  initialPrint :: Fingerprint

  class Fingerprintable t where
    fingerPrint :: t -> Fingerprint -> FingerPrint

  instance Fingerprintable Char
  instance Fingerprintable Int
  instance Fingerprintable Word32
  -- etc

The fingerPrint operation here is expressed in the traditional way: an 
accumulator (say 128 bits) into which one stuffs successive values (e.g. a byte 
stream).

However, very important, I also want a more compositional form:

  instance Fingerprintable Fingerprint

so that I can do something like

  fp :: Expr -> Fingerprint
  fp (App e1 e2) = fp e1 `fingerPrint` fp e2

Yes, I could thread the accumulator, thus:

  fp' :: Expr -> Fingerprint -> Fingerprint
  fp' (App e1 e2) p = fp' e1 (fp' e2 p)

but I can't *always* do that; or even if I can, I may already *have* a 
fingerprint for e1 in my hand, and it's painful to re-traverse e1.  Think of 
hash-consing.

An obvious implementation of "instance Fingerprintable Fingerprint" is to 
divide one fingerprint into words, and feed it into the other via the Word32 
instance.  But it's not clear to me whether or not that is robust.


(Incidentally, one could define the class thus:
  class Fingerprintable t where
    fingerPrint :: t -> FingerPrint
plus
  plusFP :: Fingerprint -> Fingerprint -> Fingerprint
But I think it'd be less efficient if that was the only way, because of all 
those now-compulsory plusFP operations.)


I'd like one other thing: to be able to trade cost for accuracy.  I'd like a 
128-bit fingerprint with essentially zero chance of having two distinct values 
that map to the same fingerprint; and a one-word fingerprint that was a lot 
more efficient, but where collisions would be entirely possible (again 
hash-consing is an example).



_______________________________________________
Haskell mailing list
Haskell@haskell.org
http://www.haskell.org/mailman/listinfo/haskell

[Haskell] Fingerprints and hashing

Reply via email to