All,

 

I use Nutch to crawl couple of internal websites and index the crawl
results into Solr.  Periodically Urls get removed from these websites
and I am noticing that the documents existing in the index corresponding
to these deleted Urls do not get cleaned up.  

 

My db.fetch.interval.default is set to 86400 seconds (24 hrs)

The following is the command I use to index crawled documents to Solr

$NUTCH_HOME/bin/nutch solrindex $solr_endpoint crawl/crawldb
crawl/linkdb crawl/segments/*

 

Can you please tell me what I am doing wrong? Is Nutch/Solr indexing not
seeing the fact that there is a deleted Url that needs to be deleted
from the solr index?

 

Thanks so much in advance

Raj

 

Reply via email to