RE: How to find ids of pages that have been newly crawled or modified after a given date with Nutch 2.1
Markus, I was mistakenly thinking of a doc field with a similar name. Thanks for pointing that out. -Original Message- From: Markus Jelsma [mailto:markus.jel...@openindex.io] Sent: Tuesday, November 13, 2012 7:24 PM To: user@nutch.apache.org Subject: RE: How to find ids of pages that have been newly crawled or modified after a given date with Nutch 2.1 In trunk the modified time is based on whether or not the signature has changed. It makes little sense relying on HTTP headers because almost no CMS implements it correctly and it messes (or allows to be messed with on purpose) with an adaptive schedule. https://issues.apache.org/jira/browse/NUTCH-1341 -Original message- > From:j.sulli...@thomsonreuters.com > Sent: Tue 13-Nov-2012 11:13 > To: user@nutch.apache.org > Subject: RE: How to find ids of pages that have been newly crawled or > modified after a given date with Nutch 2.1 > > I think the modifiedTime comes from the http headers if available, if not it > is left empty. In other words it is the time the content was last modified > according to the source if available and if not available it is left blank. > Depending on what Jacob is trying to achieve the one line patch at > https://issues.apache.org/jira/browse/NUTCH-1475 might be what he needs (or > might not be). > > James > > -Original Message- > From: Ferdy Galema [mailto:ferdy.gal...@kalooga.com] > Sent: Tuesday, November 13, 2012 6:31 PM > To: user@nutch.apache.org > Subject: Re: How to find ids of pages that have been newly crawled or > modified after a given date with Nutch 2.1 > > Hi, > > There might be something wrong with the field modifiedTime. I'm not sure how > well you can rely on this field (with the default or the adaptive scheduler). > > If you want to get to the bottom of this, I suggest debugging or running > small crawls to test the behaviour. In case something doesn't work as > expected, please repost here or open a Jira. > > Ferdy. > > On Mon, Nov 12, 2012 at 8:18 PM, Jacob Sisk wrote: > > > Hi, > > > > If this question has already been answered please forgive me and > > point me to the appropriate thread. > > > > I'd like to be able to find the ids of all new pages crawled by > > nutch or pages modified since a fixed point in the past. > > > > I'm using Nutch 2.1 with MySQL as the back-end and it seems like the > > appropriate back-end query should be something like: > > > > "select id from webpage where (prevFetchTime=null & fetchTime>="X") > > or (modifiedTime >= "X" ) > > > > where "X" is some point in the past. > > > > What I've found is that modifiedTime is always null. I am using the > > adaptive scheduler and the default md5 signature class. I've tried both > > re-injecting seed URLs as well as not, it seems to make no difference. > > modifiedTime remains null. > > > > I am most grateful for any help or advise. If my nutc-hsite.xml > > fiel would help I can forward it along. > > > > Thanks, > > jacob > > >
Re: How to find ids of pages that have been newly crawled or modified after a given date with Nutch 2.1
Hi folks, Thanks for all of your suggestions. Here are two tentative fixes suggested by my colleagues at work: Fix 1: Within Nutch itself, in org.apache.nutch.crawl.DbUpDateReducer change line 129 to: long modifiedTime = (modified == FetchSchedule.STATUS_MODIFIED) ? System.currentTimeMillis() : page.getModifiedTime(); Fix (or really workaround) 2: Alter the webpage table to in mysql to contain a column update_ts TIMESTAMP DEFAULT CURRENT_TIMESTAMP Define a trigger as follows: DELIMITER // CREATE TRIGGER updtrigger BEFORE UPDATE ON webpage FOR EACH ROW BEGIN IF NEW.signature <> OLD.signature THEN SET NEW.update_ts = NOW(); END IF; END // I think the first is the lesser of evils and it seems like it works, but I don't know enough about Nutch to determine if this is an abuse of the semantics of the modifiedTime field. I'd love your $0.02. Thanks, jacob On Tue, Nov 13, 2012 at 5:24 AM, Markus Jelsma wrote: > In trunk the modified time is based on whether or not the signature has > changed. It makes little sense relying on HTTP headers because almost no > CMS implements it correctly and it messes (or allows to be messed with on > purpose) with an adaptive schedule. > > https://issues.apache.org/jira/browse/NUTCH-1341 > > > -Original message- > > From:j.sulli...@thomsonreuters.com > > Sent: Tue 13-Nov-2012 11:13 > > To: user@nutch.apache.org > > Subject: RE: How to find ids of pages that have been newly crawled or > modified after a given date with Nutch 2.1 > > > > I think the modifiedTime comes from the http headers if available, if > not it is left empty. In other words it is the time the content was last > modified according to the source if available and if not available it is > left blank. Depending on what Jacob is trying to achieve the one line > patch at https://issues.apache.org/jira/browse/NUTCH-1475 might be what > he needs (or might not be). > > > > James > > > > -Original Message- > > From: Ferdy Galema [mailto:ferdy.gal...@kalooga.com] > > Sent: Tuesday, November 13, 2012 6:31 PM > > To: user@nutch.apache.org > > Subject: Re: How to find ids of pages that have been newly crawled or > modified after a given date with Nutch 2.1 > > > > Hi, > > > > There might be something wrong with the field modifiedTime. I'm not sure > how well you can rely on this field (with the default or the adaptive > scheduler). > > > > If you want to get to the bottom of this, I suggest debugging or running > small crawls to test the behaviour. In case something doesn't work as > expected, please repost here or open a Jira. > > > > Ferdy. > > > > On Mon, Nov 12, 2012 at 8:18 PM, Jacob Sisk > wrote: > > > > > Hi, > > > > > > If this question has already been answered please forgive me and point > > > me to the appropriate thread. > > > > > > I'd like to be able to find the ids of all new pages crawled by nutch > > > or pages modified since a fixed point in the past. > > > > > > I'm using Nutch 2.1 with MySQL as the back-end and it seems like the > > > appropriate back-end query should be something like: > > > > > > "select id from webpage where (prevFetchTime=null & fetchTime>="X") > > > or (modifiedTime >= "X" ) > > > > > > where "X" is some point in the past. > > > > > > What I've found is that modifiedTime is always null. I am using the > > > adaptive scheduler and the default md5 signature class. I've tried > both > > > re-injecting seed URLs as well as not, it seems to make no difference. > > > modifiedTime remains null. > > > > > > I am most grateful for any help or advise. If my nutc-hsite.xml fiel > > > would help I can forward it along. > > > > > > Thanks, > > > jacob > > > > > >
RE: How to find ids of pages that have been newly crawled or modified after a given date with Nutch 2.1
In trunk the modified time is based on whether or not the signature has changed. It makes little sense relying on HTTP headers because almost no CMS implements it correctly and it messes (or allows to be messed with on purpose) with an adaptive schedule. https://issues.apache.org/jira/browse/NUTCH-1341 -Original message- > From:j.sulli...@thomsonreuters.com > Sent: Tue 13-Nov-2012 11:13 > To: user@nutch.apache.org > Subject: RE: How to find ids of pages that have been newly crawled or > modified after a given date with Nutch 2.1 > > I think the modifiedTime comes from the http headers if available, if not it > is left empty. In other words it is the time the content was last modified > according to the source if available and if not available it is left blank. > Depending on what Jacob is trying to achieve the one line patch at > https://issues.apache.org/jira/browse/NUTCH-1475 might be what he needs (or > might not be). > > James > > -Original Message- > From: Ferdy Galema [mailto:ferdy.gal...@kalooga.com] > Sent: Tuesday, November 13, 2012 6:31 PM > To: user@nutch.apache.org > Subject: Re: How to find ids of pages that have been newly crawled or > modified after a given date with Nutch 2.1 > > Hi, > > There might be something wrong with the field modifiedTime. I'm not sure how > well you can rely on this field (with the default or the adaptive scheduler). > > If you want to get to the bottom of this, I suggest debugging or running > small crawls to test the behaviour. In case something doesn't work as > expected, please repost here or open a Jira. > > Ferdy. > > On Mon, Nov 12, 2012 at 8:18 PM, Jacob Sisk wrote: > > > Hi, > > > > If this question has already been answered please forgive me and point > > me to the appropriate thread. > > > > I'd like to be able to find the ids of all new pages crawled by nutch > > or pages modified since a fixed point in the past. > > > > I'm using Nutch 2.1 with MySQL as the back-end and it seems like the > > appropriate back-end query should be something like: > > > > "select id from webpage where (prevFetchTime=null & fetchTime>="X") > > or (modifiedTime >= "X" ) > > > > where "X" is some point in the past. > > > > What I've found is that modifiedTime is always null. I am using the > > adaptive scheduler and the default md5 signature class. I've tried both > > re-injecting seed URLs as well as not, it seems to make no difference. > > modifiedTime remains null. > > > > I am most grateful for any help or advise. If my nutc-hsite.xml fiel > > would help I can forward it along. > > > > Thanks, > > jacob > > >
RE: How to find ids of pages that have been newly crawled or modified after a given date with Nutch 2.1
I think the modifiedTime comes from the http headers if available, if not it is left empty. In other words it is the time the content was last modified according to the source if available and if not available it is left blank. Depending on what Jacob is trying to achieve the one line patch at https://issues.apache.org/jira/browse/NUTCH-1475 might be what he needs (or might not be). James -Original Message- From: Ferdy Galema [mailto:ferdy.gal...@kalooga.com] Sent: Tuesday, November 13, 2012 6:31 PM To: user@nutch.apache.org Subject: Re: How to find ids of pages that have been newly crawled or modified after a given date with Nutch 2.1 Hi, There might be something wrong with the field modifiedTime. I'm not sure how well you can rely on this field (with the default or the adaptive scheduler). If you want to get to the bottom of this, I suggest debugging or running small crawls to test the behaviour. In case something doesn't work as expected, please repost here or open a Jira. Ferdy. On Mon, Nov 12, 2012 at 8:18 PM, Jacob Sisk wrote: > Hi, > > If this question has already been answered please forgive me and point > me to the appropriate thread. > > I'd like to be able to find the ids of all new pages crawled by nutch > or pages modified since a fixed point in the past. > > I'm using Nutch 2.1 with MySQL as the back-end and it seems like the > appropriate back-end query should be something like: > > "select id from webpage where (prevFetchTime=null & fetchTime>="X") > or (modifiedTime >= "X" ) > > where "X" is some point in the past. > > What I've found is that modifiedTime is always null. I am using the > adaptive scheduler and the default md5 signature class. I've tried both > re-injecting seed URLs as well as not, it seems to make no difference. > modifiedTime remains null. > > I am most grateful for any help or advise. If my nutc-hsite.xml fiel > would help I can forward it along. > > Thanks, > jacob >
Re: How to find ids of pages that have been newly crawled or modified after a given date with Nutch 2.1
Hi, There might be something wrong with the field modifiedTime. I'm not sure how well you can rely on this field (with the default or the adaptive scheduler). If you want to get to the bottom of this, I suggest debugging or running small crawls to test the behaviour. In case something doesn't work as expected, please repost here or open a Jira. Ferdy. On Mon, Nov 12, 2012 at 8:18 PM, Jacob Sisk wrote: > Hi, > > If this question has already been answered please forgive me and point me > to the appropriate thread. > > I'd like to be able to find the ids of all new pages crawled by nutch or > pages modified since a fixed point in the past. > > I'm using Nutch 2.1 with MySQL as the back-end and it seems like the > appropriate back-end query should be something like: > > "select id from webpage where (prevFetchTime=null & fetchTime>="X") or > (modifiedTime >= "X" ) > > where "X" is some point in the past. > > What I've found is that modifiedTime is always null. I am using the > adaptive scheduler and the default md5 signature class. I've tried both > re-injecting seed URLs as well as not, it seems to make no difference. > modifiedTime remains null. > > I am most grateful for any help or advise. If my nutc-hsite.xml fiel would > help I can forward it along. > > Thanks, > jacob >