There's a patch for Nutch 1.3 that does the trick: https://issues.apache.org/jira/browse/NUTCH-963
On Thursday 10 March 2011 15:48:13 Nemani, Raj wrote: > All, > > > > I use Nutch to crawl couple of internal websites and index the crawl > results into Solr. Periodically Urls get removed from these websites > and I am noticing that the documents existing in the index corresponding > to these deleted Urls do not get cleaned up. > > > > My db.fetch.interval.default is set to 86400 seconds (24 hrs) > > The following is the command I use to index crawled documents to Solr > > $NUTCH_HOME/bin/nutch solrindex $solr_endpoint crawl/crawldb > crawl/linkdb crawl/segments/* > > > > Can you please tell me what I am doing wrong? Is Nutch/Solr indexing not > seeing the fact that there is a deleted Url that needs to be deleted > from the solr index? > > > > Thanks so much in advance > > Raj -- Markus Jelsma - CTO - Openindex http://www.linkedin.com/in/markus17 050-8536620 / 06-50258350

