Hi folks, Thanks for all of your suggestions. Here are two tentative fixes suggested by my colleagues at work:
Fix 1: Within Nutch itself, in org.apache.nutch.crawl.DbUpDateReducer change line 129 to: long modifiedTime = (modified == FetchSchedule.STATUS_MODIFIED) ? System.currentTimeMillis() : page.getModifiedTime(); Fix (or really workaround) 2: Alter the webpage table to in mysql to contain a column update_ts TIMESTAMP DEFAULT CURRENT_TIMESTAMP Define a trigger as follows: DELIMITER // CREATE TRIGGER updtrigger BEFORE UPDATE ON webpage FOR EACH ROW BEGIN IF NEW.signature <> OLD.signature THEN SET NEW.update_ts = NOW(); END IF; END // I think the first is the lesser of evils and it seems like it works, but I don't know enough about Nutch to determine if this is an abuse of the semantics of the modifiedTime field. I'd love your $0.02. Thanks, jacob On Tue, Nov 13, 2012 at 5:24 AM, Markus Jelsma <markus.jel...@openindex.io>wrote: > In trunk the modified time is based on whether or not the signature has > changed. It makes little sense relying on HTTP headers because almost no > CMS implements it correctly and it messes (or allows to be messed with on > purpose) with an adaptive schedule. > > https://issues.apache.org/jira/browse/NUTCH-1341 > > > -----Original message----- > > From:j.sulli...@thomsonreuters.com <j.sulli...@thomsonreuters.com> > > Sent: Tue 13-Nov-2012 11:13 > > To: user@nutch.apache.org > > Subject: RE: How to find ids of pages that have been newly crawled or > modified after a given date with Nutch 2.1 > > > > I think the modifiedTime comes from the http headers if available, if > not it is left empty. In other words it is the time the content was last > modified according to the source if available and if not available it is > left blank. Depending on what Jacob is trying to achieve the one line > patch at https://issues.apache.org/jira/browse/NUTCH-1475 might be what > he needs (or might not be). > > > > James > > > > -----Original Message----- > > From: Ferdy Galema [mailto:ferdy.gal...@kalooga.com] > > Sent: Tuesday, November 13, 2012 6:31 PM > > To: user@nutch.apache.org > > Subject: Re: How to find ids of pages that have been newly crawled or > modified after a given date with Nutch 2.1 > > > > Hi, > > > > There might be something wrong with the field modifiedTime. I'm not sure > how well you can rely on this field (with the default or the adaptive > scheduler). > > > > If you want to get to the bottom of this, I suggest debugging or running > small crawls to test the behaviour. In case something doesn't work as > expected, please repost here or open a Jira. > > > > Ferdy. > > > > On Mon, Nov 12, 2012 at 8:18 PM, Jacob Sisk <jacob.s...@gmail.com> > wrote: > > > > > Hi, > > > > > > If this question has already been answered please forgive me and point > > > me to the appropriate thread. > > > > > > I'd like to be able to find the ids of all new pages crawled by nutch > > > or pages modified since a fixed point in the past. > > > > > > I'm using Nutch 2.1 with MySQL as the back-end and it seems like the > > > appropriate back-end query should be something like: > > > > > > "select id from webpage where (prevFetchTime=null & fetchTime>="X") > > > or (modifiedTime >= "X" ) > > > > > > where "X" is some point in the past. > > > > > > What I've found is that modifiedTime is always null. I am using the > > > adaptive scheduler and the default md5 signature class. I've tried > both > > > re-injecting seed URLs as well as not, it seems to make no difference. > > > modifiedTime remains null. > > > > > > I am most grateful for any help or advise. If my nutc-hsite.xml fiel > > > would help I can forward it along. > > > > > > Thanks, > > > jacob > > > > > >