[ https://issues.apache.org/jira/browse/NUTCH-1042?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13558321#comment-13558321 ]
Tejas Patil commented on NUTCH-1042: ------------------------------------ linked with NUTCH-1284 > Fetcher.max.crawl.delay property not taken into account correctly when set to > -1 > -------------------------------------------------------------------------------- > > Key: NUTCH-1042 > URL: https://issues.apache.org/jira/browse/NUTCH-1042 > Project: Nutch > Issue Type: Bug > Components: fetcher > Affects Versions: 1.3 > Reporter: Nutch User - 1 > Assignee: Lewis John McGibbney > Fix For: 1.7, 2.2 > > > [Originally: > (http://lucene.472066.n3.nabble.com/A-possible-bug-or-misleading-documentation-td3162397.html).] > From nutch-default.xml: > " > <property> > <name>fetcher.max.crawl.delay</name> > <value>30</value> > <description> > If the Crawl-Delay in robots.txt is set to greater than this value (in > seconds) then the fetcher will skip this page, generating an error report. > If set to -1 the fetcher will never skip such pages and will wait the > amount of time retrieved from robots.txt Crawl-Delay, however long that > might be. > </description> > </property> > " > Fetcher.java: > (http://svn.apache.org/viewvc/nutch/branches/branch-1.3/src/java/org/apache/nutch/fetcher/Fetcher.java?view=markup). > The line 554 in Fetcher.java: "this.maxCrawlDelay = > conf.getInt("fetcher.max.crawl.delay", 30) * 1000;" . > The lines 615-616 in Fetcher.java: > " > if (rules.getCrawlDelay() > 0) { > if (rules.getCrawlDelay() > maxCrawlDelay) { > " > Now, the documentation states that, if fetcher.max.crawl.delay is set to > -1, the crawler will always wait the amount of time the Crawl-Delay > parameter specifies. However, as you can see, if it really is negative > the condition on the line 616 is always true, which leads to skipping > the page whose Crawl-Delay is set. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira