Re: My ideas for GSoC 2010

Ted Dunning Fri, 19 Mar 2010 10:48:42 -0700

Minhash clustering is important for duplicate detection.  You can also
do simhash
clustering<http://simhash.googlecode.com/svn/trunk/paper/SimHashWithBib.pdf>which
might be simpler to implement.  I can imagine an implementation where
the map generates the simhash and emits multiple copies keyed on part of the
hash.  Near duplicates would thus be grouped together in the reduce where an
in-memory duplicate detection could be done.


Sean's intuitions about project size are generally better than mine, but if
you limit you are strict about not increasing scope, I think you could
succeed with this project.

On Fri, Mar 19, 2010 at 6:34 AM, cristi prodan <prodan.crist...@gmail.com>wrote:

> IDEA 1 - MinHash clustering
> ---------------------------
> The first idea come after taking a look at Google 's paper on collaborative
> filtering for their news system[2]. In that paper, I looked at MinHash
> clustering.
> My first question is: is MinHash clustering considered cool ? If yes, than
> I would like to take a stab at implementing it.
> The paper also describes the implementation in a MapReduce style. Since
> this is only a suggestion I will not elaborate very much on the solution
> now. I would like to ask you weather this might be considered a good choice
> (i.e. important for the framework to have something like this) and if this
> is a big enough project.
>

Re: My ideas for GSoC 2010

Reply via email to