The deduplication should be replaced and done in a backend-neutral way i.e.
by finding duplicates using the info in the crawldb / webtable only and
communicate with the backend to send the deletions. The current
implementation is SOLR specific and is inefficient.  This has been
discussed on several occasions but no one has found the time to work on it
so far. It would then be possible to do it per batch ID I suppose.

Alternatively I think it is possible to do the deduplication on the SOLR
side. Not sure you can implement your own logic but if so then you could
take into account the batch id

Julien



On 18 July 2013 15:43, Lewis John Mcgibbney <[email protected]>wrote:

> Hi Tony,
>
> On Thursday, July 18, 2013, Tony Mullins <[email protected]> wrote:
> > Currently in Nutch2.x SolrDeDup job runs on entire index.
> > Is it possible to configure it to run against the current batch Id ?
>
> It will be possible. There are various issues open (and patches) for 2.3
> which deal with improving solr* jobs
>
>
> https://issues.apache.org/jira/issues/?jql=project%20%3D%20NUTCH%20AND%20fixVersion%20%3D%20%222.3%22%20AND%20status%20%3D%20Open%20ORDER%20BY%20priority%20DESC
>
> Of particular relevance will be NUTCH-1556 which aims to develop updatedb
> to do the exact same. Maybe you can take some inspiration from this?
>
> > We are trying to maintain historical data in Solr, crawled by nutch on
> the
> > bases of date on it was crawled.
> >
> > So in this scenario when I run the nutch crawl script it removes all
> > duplicate docs against all dates (in entire index) and If I remove the
> > SolrDeDup command from crawl script and run it with numberOfRounds >= 2
> > then I get duplicate docs against each ( generate ->fetch -> parse->
> > dbupdate-> solrindex)  cycle.
> >
> > Thanks,
> > Tony.
> >
>
> --
> *Lewis*
>



-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble

Reply via email to