Re: Finding near duplicates which searching Documents

2009-09-24 Thread Grant Ingersoll


On Sep 23, 2009, at 2:55 PM, Jason Rutherglen wrote:


I think don't this handle near duplicates which would require some of
the methods mentioned recently on the Mahout list.


It's pluggable and I believe the TextProfileSignature is a fuzzy  
implementation in Solr that was brought over from Nutch.


Agree on the Mahout discussion, too, though: 
http://www.lucidimagination.com/search/document/9d7ad3a882e2a944/finding_the_similarity_of_documents_using_mahout_for_deduplication#b0321c0f25f835a0



On Wed, Sep 23, 2009 at 2:59 AM, Shalin Shekhar Mangar
shalinman...@gmail.com wrote:
On Wed, Sep 23, 2009 at 3:14 PM, Ninad Raut hbase.user.ni...@gmail.com 
wrote:



Hi,
When we have news content crawled we face a problme of same  
content being
repeated in many documents.  We want to add a near duplicate  
document

filter
to detect such documents. Is there a way to do that in SOLR?



Look at http://wiki.apache.org/solr/Deduplication

--
Regards,
Shalin Shekhar Mangar.



--
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
using Solr/Lucene:

http://www.lucidimagination.com/search



Finding near duplicates which searching Documents

2009-09-23 Thread Ninad Raut
Hi,
When we have news content crawled we face a problme of same content being
repeated in many documents.  We want to add a near duplicate document filter
to detect such documents. Is there a way to do that in SOLR?
Regards,
Ninad Raut.


Re: Finding near duplicates which searching Documents

2009-09-23 Thread Shalin Shekhar Mangar
On Wed, Sep 23, 2009 at 3:14 PM, Ninad Raut hbase.user.ni...@gmail.comwrote:

 Hi,
 When we have news content crawled we face a problme of same content being
 repeated in many documents.  We want to add a near duplicate document
 filter
 to detect such documents. Is there a way to do that in SOLR?


Look at http://wiki.apache.org/solr/Deduplication

-- 
Regards,
Shalin Shekhar Mangar.


Re: Finding near duplicates which searching Documents

2009-09-23 Thread Ninad Raut
Is this feature included in SOLR 1.4??

On Wed, Sep 23, 2009 at 3:29 PM, Shalin Shekhar Mangar 
shalinman...@gmail.com wrote:

 On Wed, Sep 23, 2009 at 3:14 PM, Ninad Raut hbase.user.ni...@gmail.com
 wrote:

  Hi,
  When we have news content crawled we face a problme of same content being
  repeated in many documents.  We want to add a near duplicate document
  filter
  to detect such documents. Is there a way to do that in SOLR?
 

 Look at http://wiki.apache.org/solr/Deduplication

 --
 Regards,
 Shalin Shekhar Mangar.



Re: Finding near duplicates which searching Documents

2009-09-23 Thread Shalin Shekhar Mangar
On Wed, Sep 23, 2009 at 3:50 PM, Ninad Raut hbase.user.ni...@gmail.comwrote:

 Is this feature included in SOLR 1.4??


Yep.

-- 
Regards,
Shalin Shekhar Mangar.