You can use Nutch TextProfileSignature to create a less than exact signature 
for pages. It can delete some near duplicates. 
 
-----Original message-----
> From:parnab kumar <parnab.2...@gmail.com>
> Sent: Sat 23-Jun-2012 10:42
> To: user@nutch.apache.org
> Subject: Near Duplicate Detection in nutch /Solr
> 
> Hi,
> 
> I have crawled and  indexed  around 2.5 million web pages . However ,
> almost 30 % of the pages are near duplicates . Is there any functionality
> in SOLR or nutch to remove those near duplicates from the index. Nutch
> dedup command only handles exact duplicates i guess . Exact duplicates wont
> serve my purpose .
>      Please help / advise me on how to address the problem.
> 
> Thanks ,
> Parnab
> 

Reply via email to