RE: How to find ids of pages that have been newly crawled or modified after a given date with Nutch 2.1

2012-11-14 Thread j.sullivan
Markus, I was mistakenly thinking of a doc field with a similar name. Thanks 
for pointing that out.

-Original Message-
From: Markus Jelsma [mailto:markus.jel...@openindex.io] 
Sent: Tuesday, November 13, 2012 7:24 PM
To: user@nutch.apache.org
Subject: RE: How to find ids of pages that have been newly crawled or modified 
after a given date with Nutch 2.1

In trunk the modified time is based on whether or not the signature has 
changed. It makes little sense relying on HTTP headers because almost no CMS 
implements it correctly and it messes (or allows to be messed with on purpose) 
with an adaptive schedule.

https://issues.apache.org/jira/browse/NUTCH-1341
 
 
-Original message-
> From:j.sulli...@thomsonreuters.com 
> Sent: Tue 13-Nov-2012 11:13
> To: user@nutch.apache.org
> Subject: RE: How to find ids of pages that have been newly crawled or 
> modified after a given date with Nutch 2.1
> 
> I think the modifiedTime comes from the http headers if available, if not it 
> is left empty.  In other words it is the time the content was last modified 
> according to the source if available and if not available it is left blank.  
> Depending on what Jacob is trying to achieve the one line patch at 
> https://issues.apache.org/jira/browse/NUTCH-1475 might be what he needs (or 
> might not be).
> 
> James
> 
> -Original Message-
> From: Ferdy Galema [mailto:ferdy.gal...@kalooga.com]
> Sent: Tuesday, November 13, 2012 6:31 PM
> To: user@nutch.apache.org
> Subject: Re: How to find ids of pages that have been newly crawled or 
> modified after a given date with Nutch 2.1
> 
> Hi,
> 
> There might be something wrong with the field modifiedTime. I'm not sure how 
> well you can rely on this field (with the default or the adaptive scheduler).
> 
> If you want to get to the bottom of this, I suggest debugging or running 
> small crawls to test the behaviour. In case something doesn't work as 
> expected, please repost here or open a Jira.
> 
> Ferdy.
> 
> On Mon, Nov 12, 2012 at 8:18 PM, Jacob Sisk  wrote:
> 
> > Hi,
> >
> > If this question has already been answered please forgive me and 
> > point me to the appropriate thread.
> >
> > I'd like to be able to find the ids of all new pages crawled by 
> > nutch or pages modified since a fixed point in the past.
> >
> > I'm using Nutch 2.1 with MySQL as the back-end and it seems like the 
> > appropriate back-end query should be something like:
> >
> >  "select id from webpage where (prevFetchTime=null & fetchTime>="X") 
> > or (modifiedTime >= "X" )
> >
> > where "X" is some point in the past.
> >
> > What I've found is that modifiedTime is always null.  I am using the
> > adaptive scheduler and the default md5 signature class.   I've tried both
> > re-injecting seed URLs as well as not, it seems to make no difference.
> >  modifiedTime remains null.
> >
> > I am most grateful for any help or advise.  If my nutc-hsite.xml 
> > fiel would help I can forward it along.
> >
> > Thanks,
> > jacob
> >
> 


Re: How to find ids of pages that have been newly crawled or modified after a given date with Nutch 2.1

2012-11-13 Thread Jacob Sisk
Hi folks,

Thanks for all of your suggestions.  Here are two tentative fixes suggested
by my colleagues at work:

Fix 1:
Within Nutch itself,  in org.apache.nutch.crawl.DbUpDateReducer  change
line 129 to:

long modifiedTime = (modified == FetchSchedule.STATUS_MODIFIED) ?
System.currentTimeMillis() : page.getModifiedTime();

Fix (or really workaround) 2:
Alter the webpage table to in mysql to contain a column

update_ts TIMESTAMP DEFAULT CURRENT_TIMESTAMP

Define a trigger as follows:

DELIMITER //
 CREATE TRIGGER updtrigger BEFORE UPDATE ON webpage
 FOR EACH ROW
 BEGIN
 IF NEW.signature <> OLD.signature THEN
 SET NEW.update_ts = NOW();
 END IF;
 END
 //


I think the first is the lesser of evils and it seems like it works,
but I don't know enough about Nutch to determine if this is an abuse
of the semantics of the modifiedTime field.  I'd love your $0.02.

Thanks,
jacob






On Tue, Nov 13, 2012 at 5:24 AM, Markus Jelsma
wrote:

> In trunk the modified time is based on whether or not the signature has
> changed. It makes little sense relying on HTTP headers because almost no
> CMS implements it correctly and it messes (or allows to be messed with on
> purpose) with an adaptive schedule.
>
> https://issues.apache.org/jira/browse/NUTCH-1341
>
>
> -Original message-
> > From:j.sulli...@thomsonreuters.com 
> > Sent: Tue 13-Nov-2012 11:13
> > To: user@nutch.apache.org
> > Subject: RE: How to find ids of pages that have been newly crawled or
> modified after a given date with Nutch 2.1
> >
> > I think the modifiedTime comes from the http headers if available, if
> not it is left empty.  In other words it is the time the content was last
> modified according to the source if available and if not available it is
> left blank.  Depending on what Jacob is trying to achieve the one line
> patch at https://issues.apache.org/jira/browse/NUTCH-1475 might be what
> he needs (or might not be).
> >
> > James
> >
> > -Original Message-
> > From: Ferdy Galema [mailto:ferdy.gal...@kalooga.com]
> > Sent: Tuesday, November 13, 2012 6:31 PM
> > To: user@nutch.apache.org
> > Subject: Re: How to find ids of pages that have been newly crawled or
> modified after a given date with Nutch 2.1
> >
> > Hi,
> >
> > There might be something wrong with the field modifiedTime. I'm not sure
> how well you can rely on this field (with the default or the adaptive
> scheduler).
> >
> > If you want to get to the bottom of this, I suggest debugging or running
> small crawls to test the behaviour. In case something doesn't work as
> expected, please repost here or open a Jira.
> >
> > Ferdy.
> >
> > On Mon, Nov 12, 2012 at 8:18 PM, Jacob Sisk 
> wrote:
> >
> > > Hi,
> > >
> > > If this question has already been answered please forgive me and point
> > > me to the appropriate thread.
> > >
> > > I'd like to be able to find the ids of all new pages crawled by nutch
> > > or pages modified since a fixed point in the past.
> > >
> > > I'm using Nutch 2.1 with MySQL as the back-end and it seems like the
> > > appropriate back-end query should be something like:
> > >
> > >  "select id from webpage where (prevFetchTime=null & fetchTime>="X")
> > > or (modifiedTime >= "X" )
> > >
> > > where "X" is some point in the past.
> > >
> > > What I've found is that modifiedTime is always null.  I am using the
> > > adaptive scheduler and the default md5 signature class.   I've tried
> both
> > > re-injecting seed URLs as well as not, it seems to make no difference.
> > >  modifiedTime remains null.
> > >
> > > I am most grateful for any help or advise.  If my nutc-hsite.xml fiel
> > > would help I can forward it along.
> > >
> > > Thanks,
> > > jacob
> > >
> >
>


RE: How to find ids of pages that have been newly crawled or modified after a given date with Nutch 2.1

2012-11-13 Thread Markus Jelsma
In trunk the modified time is based on whether or not the signature has 
changed. It makes little sense relying on HTTP headers because almost no CMS 
implements it correctly and it messes (or allows to be messed with on purpose) 
with an adaptive schedule.

https://issues.apache.org/jira/browse/NUTCH-1341
 
 
-Original message-
> From:j.sulli...@thomsonreuters.com 
> Sent: Tue 13-Nov-2012 11:13
> To: user@nutch.apache.org
> Subject: RE: How to find ids of pages that have been newly crawled or 
> modified after a given date with Nutch 2.1
> 
> I think the modifiedTime comes from the http headers if available, if not it 
> is left empty.  In other words it is the time the content was last modified 
> according to the source if available and if not available it is left blank.  
> Depending on what Jacob is trying to achieve the one line patch at 
> https://issues.apache.org/jira/browse/NUTCH-1475 might be what he needs (or 
> might not be).
> 
> James
> 
> -Original Message-
> From: Ferdy Galema [mailto:ferdy.gal...@kalooga.com] 
> Sent: Tuesday, November 13, 2012 6:31 PM
> To: user@nutch.apache.org
> Subject: Re: How to find ids of pages that have been newly crawled or 
> modified after a given date with Nutch 2.1
> 
> Hi,
> 
> There might be something wrong with the field modifiedTime. I'm not sure how 
> well you can rely on this field (with the default or the adaptive scheduler).
> 
> If you want to get to the bottom of this, I suggest debugging or running 
> small crawls to test the behaviour. In case something doesn't work as 
> expected, please repost here or open a Jira.
> 
> Ferdy.
> 
> On Mon, Nov 12, 2012 at 8:18 PM, Jacob Sisk  wrote:
> 
> > Hi,
> >
> > If this question has already been answered please forgive me and point 
> > me to the appropriate thread.
> >
> > I'd like to be able to find the ids of all new pages crawled by nutch 
> > or pages modified since a fixed point in the past.
> >
> > I'm using Nutch 2.1 with MySQL as the back-end and it seems like the 
> > appropriate back-end query should be something like:
> >
> >  "select id from webpage where (prevFetchTime=null & fetchTime>="X") 
> > or (modifiedTime >= "X" )
> >
> > where "X" is some point in the past.
> >
> > What I've found is that modifiedTime is always null.  I am using the
> > adaptive scheduler and the default md5 signature class.   I've tried both
> > re-injecting seed URLs as well as not, it seems to make no difference.
> >  modifiedTime remains null.
> >
> > I am most grateful for any help or advise.  If my nutc-hsite.xml fiel 
> > would help I can forward it along.
> >
> > Thanks,
> > jacob
> >
> 


RE: How to find ids of pages that have been newly crawled or modified after a given date with Nutch 2.1

2012-11-13 Thread j.sullivan
I think the modifiedTime comes from the http headers if available, if not it is 
left empty.  In other words it is the time the content was last modified 
according to the source if available and if not available it is left blank.  
Depending on what Jacob is trying to achieve the one line patch at 
https://issues.apache.org/jira/browse/NUTCH-1475 might be what he needs (or 
might not be).

James

-Original Message-
From: Ferdy Galema [mailto:ferdy.gal...@kalooga.com] 
Sent: Tuesday, November 13, 2012 6:31 PM
To: user@nutch.apache.org
Subject: Re: How to find ids of pages that have been newly crawled or modified 
after a given date with Nutch 2.1

Hi,

There might be something wrong with the field modifiedTime. I'm not sure how 
well you can rely on this field (with the default or the adaptive scheduler).

If you want to get to the bottom of this, I suggest debugging or running small 
crawls to test the behaviour. In case something doesn't work as expected, 
please repost here or open a Jira.

Ferdy.

On Mon, Nov 12, 2012 at 8:18 PM, Jacob Sisk  wrote:

> Hi,
>
> If this question has already been answered please forgive me and point 
> me to the appropriate thread.
>
> I'd like to be able to find the ids of all new pages crawled by nutch 
> or pages modified since a fixed point in the past.
>
> I'm using Nutch 2.1 with MySQL as the back-end and it seems like the 
> appropriate back-end query should be something like:
>
>  "select id from webpage where (prevFetchTime=null & fetchTime>="X") 
> or (modifiedTime >= "X" )
>
> where "X" is some point in the past.
>
> What I've found is that modifiedTime is always null.  I am using the
> adaptive scheduler and the default md5 signature class.   I've tried both
> re-injecting seed URLs as well as not, it seems to make no difference.
>  modifiedTime remains null.
>
> I am most grateful for any help or advise.  If my nutc-hsite.xml fiel 
> would help I can forward it along.
>
> Thanks,
> jacob
>


Re: How to find ids of pages that have been newly crawled or modified after a given date with Nutch 2.1

2012-11-13 Thread Ferdy Galema
Hi,

There might be something wrong with the field modifiedTime. I'm not sure
how well you can rely on this field (with the default or the adaptive
scheduler).

If you want to get to the bottom of this, I suggest debugging or running
small crawls to test the behaviour. In case something doesn't work as
expected, please repost here or open a Jira.

Ferdy.

On Mon, Nov 12, 2012 at 8:18 PM, Jacob Sisk  wrote:

> Hi,
>
> If this question has already been answered please forgive me and point me
> to the appropriate thread.
>
> I'd like to be able to find the ids of all new pages crawled by nutch or
> pages modified since a fixed point in the past.
>
> I'm using Nutch 2.1 with MySQL as the back-end and it seems like the
> appropriate back-end query should be something like:
>
>  "select id from webpage where (prevFetchTime=null & fetchTime>="X") or
> (modifiedTime >= "X" )
>
> where "X" is some point in the past.
>
> What I've found is that modifiedTime is always null.  I am using the
> adaptive scheduler and the default md5 signature class.   I've tried both
> re-injecting seed URLs as well as not, it seems to make no difference.
>  modifiedTime remains null.
>
> I am most grateful for any help or advise.  If my nutc-hsite.xml fiel would
> help I can forward it along.
>
> Thanks,
> jacob
>