Hi Joe,
On Wed, Nov 21, 2012 at 12:31 AM, Joe Zhang <[email protected]> wrote:
> I'm assuming the 'tmstamp" field in a default nutch/solr setup records
> the crawling time of a document. This gives rise to the follwing question:
In BasicIndexingFilter#filter() (in trunk) [0] you will see the following code.
// add timestamp when fetched, for deduplication
doc.add("tstamp", new Date(datum.getFetchTime()));
so we regard tstamp as the time the page was fetched (just so we are
using correct terminology)
>
> 1. If a URL (assuming no content change) is recrawled by a
> subsequent crawling job and resent to the solr index, is the tmstamp value
> updated to the later time?
Well assuming that the page is refetched at the default fetch interval
(30 days) and no content change is detected, then yes, the new tstamp
will take place of the old meaning that the scoring and fetch schedule
can be allocated accordingly.
> 2. If the a same URL is crawled by 2 subsequent crawling jobs,
This is quite confusing terminology, do you really mean subsequent?
> and the
> content of the page is updated between the 2 jobs, will nutch (in job 2)
> detect the content change and update the solr index?
If you mean that the URL is fetched three times in all, and that no
fetches overlap then the answer should be yes. If the page has changed
then this change will represent the last time the page was fetched.
Lewis
[0]
http://svn.apache.org/repos/asf/nutch/trunk/src/plugin/index-basic/src/java/org/apache/nutch/indexer/basic/BasicIndexingFilter.java