Hi,

Thanks for spending your time in writing this awesome explanation. (And my
apologies for the repetition).

 It's now clear that the Modified time is set by AdapativeFetchSchedule and
not by DefaultFetchSchedule. I am glad to find the modified time in `
content.getMetadata().get(Response.LAST_MODIFIED)` !

Following up with your suggestion for improvements, an issue has been
created in Jira (https://issues.apache.org/jira/browse/NUTCH-2164)

--
Regards,
TG

On Sun, Nov 8, 2015 at 11:39 PM, Sebastian Nagel <wastl.na...@googlemail.com
> wrote:

> Hi,
>
> that might look strange but it's not a bug.
> It could be improved, see below, simply because
> it's not obvious - I also stumbled over this
> point some time ago. It also pops up from time
> to time on the mailing lists, see references below.
>
> - when indexing the modified time (sent by the server)
>   the time from the Content class
>     content.getMetadata().get(Response.LAST_MODIFIED)
>   is used by the index-more plugin
>
> - the "modified time" stored in the CrawlDb is not the
>   modified time sent by the server but the time of the
>   last "real" fetch, excluding fetches which returned
>   an unmodified document, either by if-modified-since
>   HTTP requests or by a signature comparison.
>   See also NUTCH-933.
>
> - it is set by setFetchSchedule(...) but only by
>   AdapativeFetchSchedule not by DefaultFetchSchedule
>   The latter does not use, while the former "adapts"
>   the re-fetch interval dependent on the change frequency.
>
> - the lastModified field in ProtocolStatus shown by toString()
>     _pst_: success(1), lastModified=0
>   was obviously never used. It's probably just a relict.
>   If you remove it CrawDbs become incompatible. But it
>   could be filled with the modified time returned by the
>   server (or, e.g. the file system for protocol-file).
>
> As said, these could be improvements:
> 1 also set modified time by DefaultFetchSchedule
> 2 set ProtocolStatus.lastModified if modified time is available
>
> Please, feel free to open Jira issues for these.
>
> Thanks,
> Sebastian
>
> References:
> https://issues.apache.org/jira/browse/NUTCH-933
>
> http://lucene.472066.n3.nabble.com/setting-modifiedTime-in-DefaultFetchSchedule-td4020457.html
> http://www.mail-archive.com/nutch-user@lucene.apache.org/msg15056.html
>
> On 11/06/2015 01:18 AM, Thamme Gowda N. wrote:
> > Hello,
> >
> > I found a strange issue with 'Modified time' in nutch crawldb.
> >
> > I dumped the crawldb using the command
> > /   nutch readdb xx -dump yy/
> >
> > And inspected the 'Modified time' in the dumped content.
> >
> > Surprisingly, the 'Modified time' is invalid. All the pages have
> 'Modified time: Wed Dec 31 16:00:00
> > PST 1969' (That is 0 - 8Hrs). It is worth noting that 'lastModified=0'
> in Metadata.
> >
> > ​
> > But, I see actual value in the response header.
> >
> >
> >
> > I am using Nutch 1.11, can you verify whether this functionality is
> broken?
> >
> > --
> > Regards,
> > Thamme Gowda N
>
>


-- 
-
ThammeGowda N

Reply via email to