[ 
https://issues.apache.org/jira/browse/NUTCH-385?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-385:
--------------------------------

    Summary: Improve description of thread related configuration for Fetcher  
(was: Server delay feature conflicts with maxThreadsPerHost)

> Improve description of thread related configuration for Fetcher
> ---------------------------------------------------------------
>
>                 Key: NUTCH-385
>                 URL: https://issues.apache.org/jira/browse/NUTCH-385
>             Project: Nutch
>          Issue Type: Bug
>          Components: documentation, fetcher
>            Reporter: Chris Schneider
>            Assignee: Julien Nioche
>             Fix For: 1.9
>
>         Attachments: NUTCH-385.patch
>
>
> For some time I've been puzzled by the interaction between two paramters that 
> control how often the fetcher can access a particular host:
> 1) The server delay, which comes back from the remote server during our 
> processing of the robots.txt file, and which can be limited by 
> fetcher.max.crawl.delay.
> 2) The fetcher.threads.per.host value, particularly when this is greater than 
> the default of 1.
> According to my (limited) understanding of the code in HttpBase.java:
> Suppose that fetcher.threads.per.host is 2, and that (by chance) the fetcher 
> ends up keeping either 1 or 2 fetcher threads pointing at a particular host 
> continuously. In other words, it never tries to point 3 at the host, and it 
> always points a second thread at the host before the first thread finishes 
> accessing it. Since HttpBase.unblockAddr never gets called with 
> (((Integer)THREADS_PER_HOST_COUNT.get(host)).intValue() == 1), it never puts 
> System.currentTimeMillis() + crawlDelay into BLOCKED_ADDR_TO_TIME for the 
> host. Thus, the server delay will never be used at all. The fetcher will be 
> continuously retrieving pages from the host, often with 2 fetchers accessing 
> the host simultaneously.
> Suppose instead that the fetcher finally does allow the last thread to 
> complete before it gets around to pointing another thread at the target host. 
> When the last fetcher thread calls HttpBase.unblockAddr, it will now put 
> System.currentTimeMillis() + crawlDelay into BLOCKED_ADDR_TO_TIME for the 
> host. This, in turn, will prevent any threads from accessing this host until 
> the delay is complete, even though zero threads are currently accessing the 
> host.
> I see this behavior as inconsistent. More importantly, the current 
> implementation certainly doesn't seem to answer my original question about 
> appropriate definitions for what appear to be conflicting parameters. 
> In a nutshell, how could we possibly honor the server delay if we allow more 
> than one fetcher thread to simultaneously access the host?
> It would be one thing if whenever (fetcher.threads.per.host > 1), this 
> trumped the server delay, causing the latter to be ignored completely. That 
> is certainly not the case in the current implementation, as it will wait for 
> server delay whenever the number of threads accessing a given host drops to 
> zero.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to