There's a patch for Nutch 1.3 that does the trick:
https://issues.apache.org/jira/browse/NUTCH-963

On Thursday 10 March 2011 15:48:13 Nemani, Raj wrote:
> All,
> 
> 
> 
> I use Nutch to crawl couple of internal websites and index the crawl
> results into Solr.  Periodically Urls get removed from these websites
> and I am noticing that the documents existing in the index corresponding
> to these deleted Urls do not get cleaned up.
> 
> 
> 
> My db.fetch.interval.default is set to 86400 seconds (24 hrs)
> 
> The following is the command I use to index crawled documents to Solr
> 
> $NUTCH_HOME/bin/nutch solrindex $solr_endpoint crawl/crawldb
> crawl/linkdb crawl/segments/*
> 
> 
> 
> Can you please tell me what I am doing wrong? Is Nutch/Solr indexing not
> seeing the fact that there is a deleted Url that needs to be deleted
> from the solr index?
> 
> 
> 
> Thanks so much in advance
> 
> Raj

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350

Reply via email to