On Wed, Nov 21, 2012 at 4:42 AM, Lewis John Mcgibbney < [email protected]> wrote:
> Hi Joe, > > On Wed, Nov 21, 2012 at 12:31 AM, Joe Zhang <[email protected]> wrote: > > > I'm assuming the 'tmstamp" field in a default nutch/solr setup records > > the crawling time of a document. This gives rise to the follwing > question: > > In BasicIndexingFilter#filter() (in trunk) [0] you will see the following > code. > > // add timestamp when fetched, for deduplication > doc.add("tstamp", new Date(datum.getFetchTime())); > > so we regard tstamp as the time the page was fetched (just so we are > using correct terminology) > > Thanks for correcting the terminology :) fetching time it is! > > > > 1. If a URL (assuming no content change) is recrawled by a > > subsequent crawling job and resent to the solr index, is the tmstamp > value > > updated to the later time? > > Well assuming that the page is refetched at the default fetch interval > (30 days) and no content change is detected, then yes, the new tstamp > will take place of the old meaning that the scoring and fetch schedule > can be allocated accordingly. > Your mentioning of refecthing is a bit confusing. I run nutch in a command line fashion: % nutch crawl..... solr........ Are you saying that as long as I crawl some page once, nutch will go and refetch the page in 30 days by default, without me running the command again? > > > 2. If the a same URL is crawled by 2 subsequent crawling jobs, > > This is quite confusing terminology, do you really mean subsequent? > > > and the > > content of the page is updated between the 2 jobs, will nutch (in job 2) > > detect the content change and update the solr index? > > If you mean that the URL is fetched three times in all, and that no > fetches overlap then the answer should be yes. If the page has changed > then this change will represent the last time the page was fetched. > > So refetching will always happen, even if the same URL has already existed in Solr index (and used as an ID field)? > Lewis > > [0] > http://svn.apache.org/repos/asf/nutch/trunk/src/plugin/index-basic/src/java/org/apache/nutch/indexer/basic/BasicIndexingFilter.java >

