Re: timestamp in nutch/solr

Lewis John Mcgibbney Wed, 21 Nov 2012 13:35:55 -0800

Hi Joe,

On Wed, Nov 21, 2012 at 9:25 PM, Joe Zhang <[email protected]> wrote:


> Are you saying that as long as I crawl some page once, nutch will go and
> refetch the page in 30 days by default, without me running the command
> again?

No this is impossible (unless you have an automated job running).
Unless you invoke the command(s) Nutch will do nothing.
What I was explaining is that Nutch by default (assuming you are using
the default fetching schedule) sets a next fetch time to 30 days,
although I think this is configurable within nutch-site.xml

>>
> So refetching will always happen, even if the same URL has already existed
> in Solr index (and used as an ID field)?
>

Nutch and Solr are different systems and Solr doesn't know very much
about Nutch at all. I think you you would really benefit from trying
to learn a bit about the crawldb and its constituent parts... and
consequently what role these parts/features play within maintaining a
crawl cycle.
Refetching of an URL will occur when the specified next fetch time has
expired and is in the past regardless of what is within your Solr
index.

Lewis

Re: timestamp in nutch/solr

Reply via email to