Re: Re : Re: Re : Re: fetcher.max.crawl.delay = -1 doesn't work?

Lewis John Mcgibbney Fri, 17 Feb 2012 04:00:39 -0800

Hi Danicela,

I can confirm that I can recreate this behaviour.


My example,

The follwing page domain robots.txt
http://www.heraldscotland.com/robots.txt
has a crawl.delay of 10ms

With fetcher.verbose set to true and fetcher.max.crawl.delay set to -1 in
nutch-site.xml my logs read
2012-02-17 11:44:58,079 INFO  fetcher.Fetcher - fetching
http://www.heraldscotland.com/
So after fetching of the segment finished I dump the segment/fetch data

lewis@lewis-01:~/ASF/trunk-test/runtime/local$ bin/nutch readseg -dump
segments/20120217115205 output -nocontent -nogenerate -noparse -noparsedata
-noparsetext
SegmentReader: dump segment: segments/20120217115205
SegmentReader: done

which looks like (I also added nutch site for clarity, just to see that
something is getting fetched.)

Recno:: 0
URL:: http://nutch.apache.org/

CrawlDatum::
Version: 7
Status: 33 (fetch_success)
Fetch time: Fri Feb 17 11:52:22 GMT 2012
Modified time: Thu Jan 01 01:00:00 GMT 1970
Retries since fetch: 0
Retry interval: 2592000 seconds (30 days)
Score: 1.0
Signature: null
Metadata: _ngt_: 1329479517277Content-Type: text/html_pst_: success(1),
lastModified=0


Recno:: 1
URL:: http://www.heraldscotland.com/

CrawlDatum::
Version: 7
Status: 37 (fetch_gone)
Fetch time: Fri Feb 17 11:52:21 GMT 2012
Modified time: Thu Jan 01 01:00:00 GMT 1970
Retries since fetch: 0
Retry interval: 2592000 seconds (30 days)
Score: 1.0
Signature: null
Metadata: _ngt_: 1329479517277_pst_: robots_denied(18), lastModified=0

So I delete my crawldb and segment dir and try again with
fetcher.max.crawl.delay set to 100, and the page is fetched and I go for a
cup of tea and everything is fine.

Off the top of my head, I think something would be really neat, if we could
grab the crawl-delay value and display it next to the fetcher log output
something like
2012-02-17 11:44:58,079 INFO  fetcher.Fetcher - fetching
http://www.heraldscotland.com/ (crawl.delay=10ms) or something... wdyt?

But it appears that unless there is something else at play here, that there
is a small problem with this property.

Lewis


On Thu, Feb 16, 2012 at 9:38 AM, Danicela nutch <[email protected]>wrote:

> I have sites which have crawl delay at 20, others at 720, but in both
> cases, it should fetch some pages, but it couldn't.
>
> ----- Message d'origine -----
> De : Lewis John Mcgibbney
> Envoyés : 15.02.12 23:11
> À : [email protected]
> Objet : Re: Re : Re: fetcher.max.crawl.delay = -1 doesn't work?
>
>  Another question I should have asked is how long is the crawl delay in
> robots.txt? If you read the fetcher.max.crawl.delay property description it
> explicitly notes that the fetcher will wait however long it is required by
> robots.tx until it fetches the page. Do you have this information? Thanks
> On Wed, Feb 15, 2012 at 9:08 AM, Danicela nutch 
> <[email protected]>wrote:
> > I don't think I configured such things, how can I be sure ? > > -----
> Message d'origine ----- > De : Lewis John Mcgibbney > Envoyés : 14.02.12
> 19:18 > À : [email protected] > Objet : Re: fetcher.max.crawl.delay =
> -1 doesn't work? > > Hi Danicela, Before I try this, have you configured
> any other overrides > for generating or fetching in nutch-site.xml? Thanks
> On Tue, Feb 14, 2012 > at 3:10 PM, Danicela nutch 
> <[email protected]>wrote:
> > Hi, > > I > have in my nutch-site.xml the value fetcher.max.crawl.delay =
> -1. > > When > I try to fetch a site with a robots.txt with a Crawl Delay,
> it > doesn'
>  t > work. > > If I put fetcher.max.crawl.delay = 10000, it works. > > I
> use > Nutch 1.2, but according to the changelog, nothing has been changed >
> about > that since then. > > Is this a Nutch bug or I misused something ? >
> > > Another thing, in hadoop.log, the pages which couldn't be fetched are >
> > still marked as "fetching", is this normal ? Shouldn't they be marked as
> > > "dropped" or something ? > > Thanks. > -- *Lewis* > -- *Lewis*
>



-- 
*Lewis*

Re: Re : Re: Re : Re: fetcher.max.crawl.delay = -1 doesn't work?

Reply via email to