The deduplication should be replaced and done in a backend-neutral way i.e. by finding duplicates using the info in the crawldb / webtable only and communicate with the backend to send the deletions. The current implementation is SOLR specific and is inefficient. This has been discussed on several occasions but no one has found the time to work on it so far. It would then be possible to do it per batch ID I suppose.
Alternatively I think it is possible to do the deduplication on the SOLR side. Not sure you can implement your own logic but if so then you could take into account the batch id Julien On 18 July 2013 15:43, Lewis John Mcgibbney <[email protected]>wrote: > Hi Tony, > > On Thursday, July 18, 2013, Tony Mullins <[email protected]> wrote: > > Currently in Nutch2.x SolrDeDup job runs on entire index. > > Is it possible to configure it to run against the current batch Id ? > > It will be possible. There are various issues open (and patches) for 2.3 > which deal with improving solr* jobs > > > https://issues.apache.org/jira/issues/?jql=project%20%3D%20NUTCH%20AND%20fixVersion%20%3D%20%222.3%22%20AND%20status%20%3D%20Open%20ORDER%20BY%20priority%20DESC > > Of particular relevance will be NUTCH-1556 which aims to develop updatedb > to do the exact same. Maybe you can take some inspiration from this? > > > We are trying to maintain historical data in Solr, crawled by nutch on > the > > bases of date on it was crawled. > > > > So in this scenario when I run the nutch crawl script it removes all > > duplicate docs against all dates (in entire index) and If I remove the > > SolrDeDup command from crawl script and run it with numberOfRounds >= 2 > > then I get duplicate docs against each ( generate ->fetch -> parse-> > > dbupdate-> solrindex) cycle. > > > > Thanks, > > Tony. > > > > -- > *Lewis* > -- * *Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com http://twitter.com/digitalpebble

