Simon Peyton-Jones wrote:

For various applications (including identifying common sub-expressions,
> and version tracking in GHC), I'd like a Haskell library that supports
> simple fingerprint operations.

Note that for CSE and hash-consing, there is a no-collision yet very efficient way of fingerprinting (with a drawback, of course :).

The starting point is the question "why does cryptographic hashing work"? I mean, uniquely serializing (gödel numbering) a tree like

  newtype Exp = Exp (ExpF Exp)
  data ExpF a = Zero | One | Add a a | Mult a a
              deriving (Functor, Foldable, Traversable)

takes an amount of bits proportional to its size (number of constructors). So, how can we expect a simple 128-bit number to have only few collisions for a collection of medium-sized trees? The answer is that the collection is much too small, no computer will ever be able to count (or construct said trees) from 1 to 2^128 in eons.

So, the idea is to use a "local gödel numbering" and uniquely number the and only the trees that are actually constructed (no collisions, but few in numbers). In other words, every new tree created gets the gödel number size collection + 1 and we can simply use

  type GödelNumber = Word32

assuming that no more than 4G of trees will ever be constructed. For fast hash-consing, the collection itself is a (generalized) trie

  type Collection  = ExpF GödelNumber ~~> Maybe GödelNumber

mapping each new top of a tree (constructor + gödel numbers for already known subtrees) to either Just its gödel number or Nothing when it's not in the collection yet.

With this, CSE becomes a catamorphism that allocates new gödel numbers if necessary

  type StateC a = Collection -> (Collection, a) deriving (Monad)

  cse :: Exp -> StateC GödelNumber
  cse = cata hashCons

  hashCons :: ExpF (StateC GödelNumber) -> StateC GödelNumber
  hashCons x0 = \c0 -> let (c,x) = sequence x0 c0 in
       case lookup x c of
         Maybe g -> (c,g)
         Nothing -> let g = size c + 1; c' = insert x g c in (c',g)
     where
     sequence :: ExpF (StateC GödelNumber) -> StateC (ExpF GödelNumber)
     sequence = Data.Traversable.sequence

CSE in the sense that all common subexpressions now have the same gödel number.

Of course, the drawback compared to "free-form" fingerprinting is that the fingerprinting and the collection now depend on each other. But for CSE, we have to carry the collection around anyway.


I don't know any references for that method since I came up with it myself and haven't searched around yet. Any pointers?

Regards,
apfelmus

_______________________________________________
Haskell mailing list
[email protected]
http://www.haskell.org/mailman/listinfo/haskell

Reply via email to