On Wed, Nov 21, 2012 at 4:42 AM, Lewis John Mcgibbney <
[email protected]> wrote:

> Hi Joe,
>
> On Wed, Nov 21, 2012 at 12:31 AM, Joe Zhang <[email protected]> wrote:
>
> > I'm assuming the 'tmstamp" field in a default nutch/solr setup records
> > the crawling time of a document. This gives rise to the follwing
> question:
>
> In BasicIndexingFilter#filter() (in trunk) [0] you will see the following
> code.
>
> // add timestamp when fetched, for deduplication
> doc.add("tstamp", new Date(datum.getFetchTime()));
>
> so we regard tstamp as the time the page was fetched (just so we are
> using correct terminology)
>
> Thanks for correcting the terminology :) fetching time it is!




> >
> > 1. If a URL (assuming no content change) is recrawled by a
> > subsequent crawling job and resent to the solr index, is the tmstamp
> value
> > updated to the later time?
>
> Well assuming that the page is refetched at the default fetch interval
> (30 days) and no content change is detected, then yes, the new tstamp
> will take place of the old meaning that the scoring and fetch schedule
> can be allocated accordingly.
>

Your mentioning of refecthing is a bit confusing.

I run nutch in a command line fashion:

% nutch crawl..... solr........

Are you saying that as long as I crawl some page once, nutch will go and
refetch the page in 30 days by default, without me running the command
again?


>
> > 2. If the a same URL is crawled by 2 subsequent crawling jobs,
>
> This is quite confusing terminology, do you really mean subsequent?
>
> > and the
> > content of the page is updated between the 2 jobs, will nutch (in job 2)
> > detect the content change and update the solr index?
>
> If you mean that the URL is fetched three times in all, and that no
> fetches overlap then the answer should be yes. If the page has changed
> then this change will represent the last time the page was fetched.
>
>
So refetching will always happen, even if the same URL has already existed
in Solr index (and used as an ID field)?


> Lewis
>
> [0]
> http://svn.apache.org/repos/asf/nutch/trunk/src/plugin/index-basic/src/java/org/apache/nutch/indexer/basic/BasicIndexingFilter.java
>

Reply via email to