Near Duplicate Document Detection at Solr

Furkan KAMACI Sun, 22 Sep 2013 12:03:21 -0700

I want to detect near duplicate documents (for web documents). I know that
there is an algorithm called Winnowing and there is another technique used
by Google. However I also know that Solr has a component called
MoreLikeThis. Google's page explains that *mirroring and plagiarism* is
easy to detect but near duplicate detection is much more behind it.


So I want to ask that what is the underlying algorithm Solr MoreLikeThis
component uses and can I use it for such kind of purposes?

Otherwise, I will implement an algorithm for near duplicate document
detection within few days and I will be proud to contribute and adopt it
into Solr.

Thanks;
Furkan KAMACI

Near Duplicate Document Detection at Solr

Reply via email to