I want to detect near duplicate documents (for web documents). I know that
there is an algorithm called Winnowing and there is another technique used
by Google. However I also know that Solr has a component called
MoreLikeThis. Google's page explains that *mirroring and plagiarism* is
easy to detect but near duplicate detection is much more behind it.

So I want to ask that what is the underlying algorithm Solr MoreLikeThis
component uses and can I use it for such kind of purposes?

Otherwise, I will implement an algorithm for near duplicate document
detection within few days and I will be proud to contribute and adopt it
into Solr.

Thanks;
Furkan KAMACI

Reply via email to