Re: Finding near duplicates which searching Documents
On Sep 23, 2009, at 2:55 PM, Jason Rutherglen wrote: I think don't this handle near duplicates which would require some of the methods mentioned recently on the Mahout list. It's pluggable and I believe the TextProfileSignature is a fuzzy implementation in Solr that was brought over from Nutch. Agree on the Mahout discussion, too, though: http://www.lucidimagination.com/search/document/9d7ad3a882e2a944/finding_the_similarity_of_documents_using_mahout_for_deduplication#b0321c0f25f835a0 On Wed, Sep 23, 2009 at 2:59 AM, Shalin Shekhar Mangar shalinman...@gmail.com wrote: On Wed, Sep 23, 2009 at 3:14 PM, Ninad Raut hbase.user.ni...@gmail.com wrote: Hi, When we have news content crawled we face a problme of same content being repeated in many documents. We want to add a near duplicate document filter to detect such documents. Is there a way to do that in SOLR? Look at http://wiki.apache.org/solr/Deduplication -- Regards, Shalin Shekhar Mangar. -- Grant Ingersoll http://www.lucidimagination.com/ Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using Solr/Lucene: http://www.lucidimagination.com/search
Finding near duplicates which searching Documents
Hi, When we have news content crawled we face a problme of same content being repeated in many documents. We want to add a near duplicate document filter to detect such documents. Is there a way to do that in SOLR? Regards, Ninad Raut.
Re: Finding near duplicates which searching Documents
On Wed, Sep 23, 2009 at 3:14 PM, Ninad Raut hbase.user.ni...@gmail.comwrote: Hi, When we have news content crawled we face a problme of same content being repeated in many documents. We want to add a near duplicate document filter to detect such documents. Is there a way to do that in SOLR? Look at http://wiki.apache.org/solr/Deduplication -- Regards, Shalin Shekhar Mangar.
Re: Finding near duplicates which searching Documents
Is this feature included in SOLR 1.4?? On Wed, Sep 23, 2009 at 3:29 PM, Shalin Shekhar Mangar shalinman...@gmail.com wrote: On Wed, Sep 23, 2009 at 3:14 PM, Ninad Raut hbase.user.ni...@gmail.com wrote: Hi, When we have news content crawled we face a problme of same content being repeated in many documents. We want to add a near duplicate document filter to detect such documents. Is there a way to do that in SOLR? Look at http://wiki.apache.org/solr/Deduplication -- Regards, Shalin Shekhar Mangar.
Re: Finding near duplicates which searching Documents
On Wed, Sep 23, 2009 at 3:50 PM, Ninad Raut hbase.user.ni...@gmail.comwrote: Is this feature included in SOLR 1.4?? Yep. -- Regards, Shalin Shekhar Mangar.