I think that this would be nearly equivalent to the Lucene solution that I
mentioned ... good for real-time single document queries.

I would be very surprised if this were able to out-do the MR version for the
all-pairs problem.

On Sat, Jul 18, 2009 at 1:30 AM, Miles Osborne <[email protected]> wrote:

> you could probably eliminate phase 2 if the output of phase 1 was stored in
> Perfect Hashing table (say using Hypertable).  this works by storing a
> fingerprint for each shingle/count pair (a few bits) and organising the
> hash
> table such that you never get collisions (hence the Perfect Hashing).
>



-- 
Ted Dunning, CTO
DeepDyve

Reply via email to