On Sep 23, 2009, at 2:55 PM, Jason Rutherglen wrote:

I think don't this handle near duplicates which would require some of
the methods mentioned recently on the Mahout list.

It's pluggable and I believe the TextProfileSignature is a fuzzy implementation in Solr that was brought over from Nutch.

Agree on the Mahout discussion, too, though: 
http://www.lucidimagination.com/search/document/9d7ad3a882e2a944/finding_the_similarity_of_documents_using_mahout_for_deduplication#b0321c0f25f835a0


On Wed, Sep 23, 2009 at 2:59 AM, Shalin Shekhar Mangar
<shalinman...@gmail.com> wrote:
On Wed, Sep 23, 2009 at 3:14 PM, Ninad Raut <hbase.user.ni...@gmail.com >wrote:

Hi,
When we have news content crawled we face a problme of same content being repeated in many documents. We want to add a near duplicate document
filter
to detect such documents. Is there a way to do that in SOLR?


Look at http://wiki.apache.org/solr/Deduplication

--
Regards,
Shalin Shekhar Mangar.


--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using Solr/Lucene:
http://www.lucidimagination.com/search

Reply via email to