[jira] [Comment Edited] (NUTCH-1284) Add site fetcher.max.crawl.delay as log output by default.

Tejas Patil (JIRA) Sat, 22 Dec 2012 02:55:15 -0800

    [ 
https://issues.apache.org/jira/browse/NUTCH-1284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13538725#comment-13538725
 ]


Tejas Patil edited comment on NUTCH-1284 at 12/22/12 10:54 AM:
---------------------------------------------------------------

I searched for the relevant mail thread[0] to get an idea why this bug was 
created. 
Quick recap of the issue: 
Despite fetcher.max.crawl.delay was set to -1, nutch was marking the url as 
ROBOTS_DENIED. With fetcher.max.crawl.delay= -1, the expected behavior is to 
wait the amount of time retrieved from robots.txt Crawl-Delay, however long 
that might be.

Lewis could reproduce the issue. He suggested the change mentioned in the bug 
and hinted that there might be some problem with that property.

An additional condition was needed which prevents urls from being marked 
DB_GONE when fetcher.max.crawl.delay= -1 (ie. maxCrawlDelay = -1000). After 
this change, I tested with the scenario mentioned in [0] and it worked fine.

[0]: 
http://lucene.472066.n3.nabble.com/Re-Re-Re-Re-fetcher-max-crawl-delay-1-doesn-t-work-tc3749639.html
                
      was (Author: tejasp):
    I searched for the relevant mail thread[0] to get an idea why this bug was 
created. 
Quick recap of the issue: 
Despite fetcher.max.crawl.delay was set to -1, nutch was marking the url as 
ROBOTS_DENIED. With fetcher.max.crawl.delay= -1, the expected behavior is to 
wait the amount of time retrieved from robots.txt Crawl-Delay, however long 
that might be.

Lewis could reproduce the issue. He suggested the change mentioned in the bug 
and hinted that there might be some problem with that property.

An additional condition was needed to be changed which prevents urls from being 
marked DB_GONE when fetcher.max.crawl.delay= -1 (ie. maxCrawlDelay = -1000). 
After this change, I tested with the scenario mentioned in [0] and it worked 
fine.

[0]: 
http://lucene.472066.n3.nabble.com/Re-Re-Re-Re-fetcher-max-crawl-delay-1-doesn-t-work-tc3749639.html
                  
> Add site fetcher.max.crawl.delay as log output by default.
> ----------------------------------------------------------
>
>                 Key: NUTCH-1284
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1284
>             Project: Nutch
>          Issue Type: New Feature
>          Components: fetcher
>    Affects Versions: nutchgora, 1.5
>            Reporter: Lewis John McGibbney
>            Priority: Trivial
>             Fix For: 1.7
>
>         Attachments: NUTCH-1284.patch
>
>
> Currently, when manually scanning our log output we cannot infer which pages 
> are governed by a crawl delay between successive fetch attempts of any given 
> page within the site. The value should be made available as something like:
> {code}
> 2012-02-19 12:33:33,031 INFO  fetcher.Fetcher - fetching 
> http://nutch.apache.org/ (crawl.delay=XXXms)
> {code}
> This way we can easily and quickly determine whether the fetcher is having to 
> use this functionality or not. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Comment Edited] (NUTCH-1284) Add site fetcher.max.crawl.delay as log output by default.

Reply via email to