Simon Peyton Jones writes: > We are all familiar with the idea of an MD5 checksum, which provides a > reliable "fingerprint" for a file, usually 128 bits or so...
> For various applications (including identifying common sub-expressions, and > version tracking in GHC), I'd like a Haskell library that supports simple > fingerprint operations... > > Below I give an informal specification of such a library. I can think of > any number of implementations, but I'd like one that is (a) robust and (b) > efficient. By "robust" I meant that with, say, a 128-bit fingerprint the > chance of accidental collision is truly low; a non-robust implementation > uses the 128-bit space in a non-uniform way [trivial example: gets stuck on > zero]... > The module signature would be something like this: > > data Fingerprint > initialPrint :: Fingerprint > > class Fingerprintable t where > fingerPrint :: t -> Fingerprint -> FingerPrint > > instance Fingerprintable Char > instance Fingerprintable Int > instance Fingerprintable Word32 > -- etc > > The fingerPrint operation here is expressed in the traditional way: an > accumulator (say 128 bits) into which one stuffs successive values (e.g. a > byte stream). > > However, very important, I also want a more compositional form: > > instance Fingerprintable Fingerprint > > so that I can do something like > > fp :: Expr -> Fingerprint > fp (App e1 e2) = fp e1 `fingerPrint` fp e2 [For Salil anad Michael: apply the fingerPrint function to a fingerprint] > An obvious implementation of "instance Fingerprintable Fingerprint" is to > divide one fingerprint into words, and feed it into the other via the > Word32 instance. But it's not clear to me whether or not that is robust... > > I'd like one other thing: to be able to trade cost for accuracy. I'd like > a 128-bit fingerprint with essentially zero chance of having two distinct > values that map to the same fingerprint; and a one-word fingerprint that > was a lot more efficient, but where collisions would be entirely possible > (again hash-consing is an example). Simon, If I wanted this functionality, I'd call in the professionals (like Salil Vadhan, Michael Rabin, and Andrei Broder, whom I have cc'ed on this message). You sent this mail to the Haskell list, but I think the Haskell content of this problem is trivial compared to the algorithmic problems you pose. I remember, for example, a problem in the distant past of SRC Modula-3, where fingerprinting did not behave as expected because the source of the fingerprint module was itself fingerprinted. Here are some algorithmic references: http://tinyurl.com/yoeusg, http://tinyurl.com/2a2yfj - Modula-3 code for compositional 64-bit fingerprints, polynomial chosen by (literally) flipping a coin http://citeseer.ist.psu.edu/broder93some.html - Andrei's paper on applications of fingerprinting How does laziness play into this problem? If you don't mind building up a lot of thunks, you can probably just use several fingerprinting algorithms in parallel and then pull just the one you need... Norman _______________________________________________ Haskell mailing list [email protected] http://www.haskell.org/mailman/listinfo/haskell
