Hi Joe, On Wed, Nov 21, 2012 at 9:25 PM, Joe Zhang <[email protected]> wrote:
> Are you saying that as long as I crawl some page once, nutch will go and > refetch the page in 30 days by default, without me running the command > again? No this is impossible (unless you have an automated job running). Unless you invoke the command(s) Nutch will do nothing. What I was explaining is that Nutch by default (assuming you are using the default fetching schedule) sets a next fetch time to 30 days, although I think this is configurable within nutch-site.xml >> > So refetching will always happen, even if the same URL has already existed > in Solr index (and used as an ID field)? > Nutch and Solr are different systems and Solr doesn't know very much about Nutch at all. I think you you would really benefit from trying to learn a bit about the crawldb and its constituent parts... and consequently what role these parts/features play within maintaining a crawl cycle. Refetching of an URL will occur when the specified next fetch time has expired and is in the past regardless of what is within your Solr index. Lewis

