I should also mention that it is common to execute a series of jobs as a complete crawl cycle, generate, fetch, parse, update crawldb, update linkdb and index (and optionally deduplicate and/or cleaning the index once that's committed).
This means that after every crawl, the timestamps all show the next fetch time. This fetch time is set by a default and usually influenced by an algorithm that can increase or decrease that default depending on whether the page was modified since the previous fetch time. > After the fetch job, it will contain the time it was fetched. If you run > the updatedb job, it will contain the time it is due to be fetched. > > It makes sense because newly discovered URL's are added to the CrawlDB and > carry the time after which they are due to be fetched. Consequent fetch and > update jobs complete the cycle. > > > From the javadocs for CrawlDatum.getFetchTime() (Nutch 1.1): > > > > "Returns either the time of the last fetch, or the next fetch time, > > depending on whether Fetcher or CrawlDbReducer set the time." > > > > So is there any way to determine which of these two conditions is true, > > using just the information in CrawlDatum? My goal is to estimate the next > > time that the item will be fetched. > > > > -MB

