On Sep 23, 2009, at 2:55 PM, Jason Rutherglen wrote:
I think don't this handle near duplicates which would require some of
the methods mentioned recently on the Mahout list.
It's pluggable and I believe the TextProfileSignature is a fuzzy
implementation in Solr that was brought over from Nutch.
Agree on the Mahout discussion, too, though:
http://www.lucidimagination.com/search/document/9d7ad3a882e2a944/finding_the_similarity_of_documents_using_mahout_for_deduplication#b0321c0f25f835a0
On Wed, Sep 23, 2009 at 2:59 AM, Shalin Shekhar Mangar
<shalinman...@gmail.com> wrote:
On Wed, Sep 23, 2009 at 3:14 PM, Ninad Raut <hbase.user.ni...@gmail.com
>wrote:
Hi,
When we have news content crawled we face a problme of same
content being
repeated in many documents. We want to add a near duplicate
document
filter
to detect such documents. Is there a way to do that in SOLR?
Look at http://wiki.apache.org/solr/Deduplication
--
Regards,
Shalin Shekhar Mangar.
--------------------------
Grant Ingersoll
http://www.lucidimagination.com/
Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)
using Solr/Lucene:
http://www.lucidimagination.com/search