In trunk the modified time is based on whether or not the signature has 
changed. It makes little sense relying on HTTP headers because almost no CMS 
implements it correctly and it messes (or allows to be messed with on purpose) 
with an adaptive schedule.

https://issues.apache.org/jira/browse/NUTCH-1341
 
 
-----Original message-----
> From:j.sulli...@thomsonreuters.com <j.sulli...@thomsonreuters.com>
> Sent: Tue 13-Nov-2012 11:13
> To: user@nutch.apache.org
> Subject: RE: How to find ids of pages that have been newly crawled or 
> modified after a given date with Nutch 2.1
> 
> I think the modifiedTime comes from the http headers if available, if not it 
> is left empty.  In other words it is the time the content was last modified 
> according to the source if available and if not available it is left blank.  
> Depending on what Jacob is trying to achieve the one line patch at 
> https://issues.apache.org/jira/browse/NUTCH-1475 might be what he needs (or 
> might not be).
> 
> James
> 
> -----Original Message-----
> From: Ferdy Galema [mailto:ferdy.gal...@kalooga.com] 
> Sent: Tuesday, November 13, 2012 6:31 PM
> To: user@nutch.apache.org
> Subject: Re: How to find ids of pages that have been newly crawled or 
> modified after a given date with Nutch 2.1
> 
> Hi,
> 
> There might be something wrong with the field modifiedTime. I'm not sure how 
> well you can rely on this field (with the default or the adaptive scheduler).
> 
> If you want to get to the bottom of this, I suggest debugging or running 
> small crawls to test the behaviour. In case something doesn't work as 
> expected, please repost here or open a Jira.
> 
> Ferdy.
> 
> On Mon, Nov 12, 2012 at 8:18 PM, Jacob Sisk <jacob.s...@gmail.com> wrote:
> 
> > Hi,
> >
> > If this question has already been answered please forgive me and point 
> > me to the appropriate thread.
> >
> > I'd like to be able to find the ids of all new pages crawled by nutch 
> > or pages modified since a fixed point in the past.
> >
> > I'm using Nutch 2.1 with MySQL as the back-end and it seems like the 
> > appropriate back-end query should be something like:
> >
> >  "select id from webpage where (prevFetchTime=null & fetchTime>="X") 
> > or (modifiedTime >= "X" )
> >
> > where "X" is some point in the past.
> >
> > What I've found is that modifiedTime is always null.  I am using the
> > adaptive scheduler and the default md5 signature class.   I've tried both
> > re-injecting seed URLs as well as not, it seems to make no difference.
> >  modifiedTime remains null.
> >
> > I am most grateful for any help or advise.  If my nutc-hsite.xml fiel 
> > would help I can forward it along.
> >
> > Thanks,
> > jacob
> >
> 

Reply via email to