I should also mention that it is common to execute a series of jobs as a 
complete crawl cycle, generate, fetch, parse, update crawldb, update linkdb 
and index (and optionally deduplicate and/or cleaning the index once that's 
committed).

This means that after every crawl, the timestamps all show the next fetch 
time. This fetch time is set by a default and usually influenced by an 
algorithm that can increase or decrease that default depending on whether the 
page was modified since the previous fetch time.



> After the fetch job, it will contain the time it was fetched. If you run
> the updatedb job, it will contain the time it is due to be fetched.
> 
> It makes sense because newly discovered URL's are added to the CrawlDB and
> carry the time after which they are due to be fetched. Consequent fetch and
> update jobs complete the cycle.
> 
> > From the javadocs for CrawlDatum.getFetchTime() (Nutch 1.1):
> > 
> > "Returns either the time of the last fetch, or the next fetch time,
> > depending on whether Fetcher or CrawlDbReducer set the time."
> > 
> > So is there any way to determine which of these two conditions is true,
> > using just the information in CrawlDatum? My goal is to estimate the next
> > time that the item will be fetched.
> > 
> > -MB

Reply via email to