I want to detect near duplicate documents (for web documents). I know that there is an algorithm called Winnowing and there is another technique used by Google. However I also know that Solr has a component called MoreLikeThis. Google's page explains that *mirroring and plagiarism* is easy to detect but near duplicate detection is much more behind it.
So I want to ask that what is the underlying algorithm Solr MoreLikeThis component uses and can I use it for such kind of purposes? Otherwise, I will implement an algorithm for near duplicate document detection within few days and I will be proud to contribute and adopt it into Solr. Thanks; Furkan KAMACI