subject:"\[jira\] \[Commented\] \(NUTCH\-385\) Improve description of thread related configuration for Fetcher"

[jira] [Commented] (NUTCH-385) Improve description of thread related configuration for Fetcher

2014-06-27 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14045721#comment-14045721
 ] 

Hudson commented on NUTCH-385:
--

SUCCESS: Integrated in Nutch-nutchgora #1064 (See 
[https://builds.apache.org/job/Nutch-nutchgora/1064/])
NUTCH-385 Improve description of thread related configuration for Fetcher 
(jnioche: http://svn.apache.org/viewvc/nutch/branches/2.x/?view=rev&rev=1605979)
* /nutch/branches/2.x/CHANGES.txt
* /nutch/branches/2.x/conf/nutch-default.xml


> Improve description of thread related configuration for Fetcher
> ---
>
> Key: NUTCH-385
> URL: https://issues.apache.org/jira/browse/NUTCH-385
> Project: Nutch
>  Issue Type: Bug
>  Components: documentation, fetcher
>Reporter: Chris Schneider
>Assignee: Julien Nioche
> Fix For: 2.3, 1.9
>
> Attachments: NUTCH-385.patch
>
>
> For some time I've been puzzled by the interaction between two paramters that 
> control how often the fetcher can access a particular host:
> 1) The server delay, which comes back from the remote server during our 
> processing of the robots.txt file, and which can be limited by 
> fetcher.max.crawl.delay.
> 2) The fetcher.threads.per.host value, particularly when this is greater than 
> the default of 1.
> According to my (limited) understanding of the code in HttpBase.java:
> Suppose that fetcher.threads.per.host is 2, and that (by chance) the fetcher 
> ends up keeping either 1 or 2 fetcher threads pointing at a particular host 
> continuously. In other words, it never tries to point 3 at the host, and it 
> always points a second thread at the host before the first thread finishes 
> accessing it. Since HttpBase.unblockAddr never gets called with 
> (((Integer)THREADS_PER_HOST_COUNT.get(host)).intValue() == 1), it never puts 
> System.currentTimeMillis() + crawlDelay into BLOCKED_ADDR_TO_TIME for the 
> host. Thus, the server delay will never be used at all. The fetcher will be 
> continuously retrieving pages from the host, often with 2 fetchers accessing 
> the host simultaneously.
> Suppose instead that the fetcher finally does allow the last thread to 
> complete before it gets around to pointing another thread at the target host. 
> When the last fetcher thread calls HttpBase.unblockAddr, it will now put 
> System.currentTimeMillis() + crawlDelay into BLOCKED_ADDR_TO_TIME for the 
> host. This, in turn, will prevent any threads from accessing this host until 
> the delay is complete, even though zero threads are currently accessing the 
> host.
> I see this behavior as inconsistent. More importantly, the current 
> implementation certainly doesn't seem to answer my original question about 
> appropriate definitions for what appear to be conflicting parameters. 
> In a nutshell, how could we possibly honor the server delay if we allow more 
> than one fetcher thread to simultaneously access the host?
> It would be one thing if whenever (fetcher.threads.per.host > 1), this 
> trumped the server delay, causing the latter to be ignored completely. That 
> is certainly not the case in the current implementation, as it will wait for 
> server delay whenever the number of threads accessing a given host drops to 
> zero.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (NUTCH-385) Improve description of thread related configuration for Fetcher

2014-06-27 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14045678#comment-14045678
 ] 

Hudson commented on NUTCH-385:
--

FAILURE: Integrated in Nutch-trunk #2676 (See 
[https://builds.apache.org/job/Nutch-trunk/2676/])
NUTCH-385 Improve description of thread related configuration for Fetcher 
(jnioche: http://svn.apache.org/viewvc/nutch/trunk/?view=rev&rev=1605978)
* /nutch/trunk/CHANGES.txt
* /nutch/trunk/conf/nutch-default.xml


> Improve description of thread related configuration for Fetcher
> ---
>
> Key: NUTCH-385
> URL: https://issues.apache.org/jira/browse/NUTCH-385
> Project: Nutch
>  Issue Type: Bug
>  Components: documentation, fetcher
>Reporter: Chris Schneider
>Assignee: Julien Nioche
> Fix For: 2.3, 1.9
>
> Attachments: NUTCH-385.patch
>
>
> For some time I've been puzzled by the interaction between two paramters that 
> control how often the fetcher can access a particular host:
> 1) The server delay, which comes back from the remote server during our 
> processing of the robots.txt file, and which can be limited by 
> fetcher.max.crawl.delay.
> 2) The fetcher.threads.per.host value, particularly when this is greater than 
> the default of 1.
> According to my (limited) understanding of the code in HttpBase.java:
> Suppose that fetcher.threads.per.host is 2, and that (by chance) the fetcher 
> ends up keeping either 1 or 2 fetcher threads pointing at a particular host 
> continuously. In other words, it never tries to point 3 at the host, and it 
> always points a second thread at the host before the first thread finishes 
> accessing it. Since HttpBase.unblockAddr never gets called with 
> (((Integer)THREADS_PER_HOST_COUNT.get(host)).intValue() == 1), it never puts 
> System.currentTimeMillis() + crawlDelay into BLOCKED_ADDR_TO_TIME for the 
> host. Thus, the server delay will never be used at all. The fetcher will be 
> continuously retrieving pages from the host, often with 2 fetchers accessing 
> the host simultaneously.
> Suppose instead that the fetcher finally does allow the last thread to 
> complete before it gets around to pointing another thread at the target host. 
> When the last fetcher thread calls HttpBase.unblockAddr, it will now put 
> System.currentTimeMillis() + crawlDelay into BLOCKED_ADDR_TO_TIME for the 
> host. This, in turn, will prevent any threads from accessing this host until 
> the delay is complete, even though zero threads are currently accessing the 
> host.
> I see this behavior as inconsistent. More importantly, the current 
> implementation certainly doesn't seem to answer my original question about 
> appropriate definitions for what appear to be conflicting parameters. 
> In a nutshell, how could we possibly honor the server delay if we allow more 
> than one fetcher thread to simultaneously access the host?
> It would be one thing if whenever (fetcher.threads.per.host > 1), this 
> trumped the server delay, causing the latter to be ignored completely. That 
> is certainly not the case in the current implementation, as it will wait for 
> server delay whenever the number of threads accessing a given host drops to 
> zero.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (NUTCH-385) Improve description of thread related configuration for Fetcher

2014-06-26 Thread lufeng (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14045525#comment-14045525
 ] 

lufeng commented on NUTCH-385:
--

Hi Julien

I see the description of "fetcher.threads.per.queue" we can add setting 
"fetcher.threads.per.queue" to value > 1 will also cause "fetcher.server.delay" 
to be ignore. 

Another issue is that I think this property "fetcher.max.crawl.delay" is not 
uniform with "fetcher.server.delay" and "fetcher.server.min.delay". It is 
changed to "fetcher.server.max.delay" more suitable?


> Improve description of thread related configuration for Fetcher
> ---
>
> Key: NUTCH-385
> URL: https://issues.apache.org/jira/browse/NUTCH-385
> Project: Nutch
>  Issue Type: Bug
>  Components: documentation, fetcher
>Reporter: Chris Schneider
>Assignee: Julien Nioche
> Fix For: 1.9
>
> Attachments: NUTCH-385.patch
>
>
> For some time I've been puzzled by the interaction between two paramters that 
> control how often the fetcher can access a particular host:
> 1) The server delay, which comes back from the remote server during our 
> processing of the robots.txt file, and which can be limited by 
> fetcher.max.crawl.delay.
> 2) The fetcher.threads.per.host value, particularly when this is greater than 
> the default of 1.
> According to my (limited) understanding of the code in HttpBase.java:
> Suppose that fetcher.threads.per.host is 2, and that (by chance) the fetcher 
> ends up keeping either 1 or 2 fetcher threads pointing at a particular host 
> continuously. In other words, it never tries to point 3 at the host, and it 
> always points a second thread at the host before the first thread finishes 
> accessing it. Since HttpBase.unblockAddr never gets called with 
> (((Integer)THREADS_PER_HOST_COUNT.get(host)).intValue() == 1), it never puts 
> System.currentTimeMillis() + crawlDelay into BLOCKED_ADDR_TO_TIME for the 
> host. Thus, the server delay will never be used at all. The fetcher will be 
> continuously retrieving pages from the host, often with 2 fetchers accessing 
> the host simultaneously.
> Suppose instead that the fetcher finally does allow the last thread to 
> complete before it gets around to pointing another thread at the target host. 
> When the last fetcher thread calls HttpBase.unblockAddr, it will now put 
> System.currentTimeMillis() + crawlDelay into BLOCKED_ADDR_TO_TIME for the 
> host. This, in turn, will prevent any threads from accessing this host until 
> the delay is complete, even though zero threads are currently accessing the 
> host.
> I see this behavior as inconsistent. More importantly, the current 
> implementation certainly doesn't seem to answer my original question about 
> appropriate definitions for what appear to be conflicting parameters. 
> In a nutshell, how could we possibly honor the server delay if we allow more 
> than one fetcher thread to simultaneously access the host?
> It would be one thing if whenever (fetcher.threads.per.host > 1), this 
> trumped the server delay, causing the latter to be ignored completely. That 
> is certainly not the case in the current implementation, as it will wait for 
> server delay whenever the number of threads accessing a given host drops to 
> zero.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (NUTCH-385) Improve description of thread related configuration for Fetcher

[jira] [Commented] (NUTCH-385) Improve description of thread related configuration for Fetcher

[jira] [Commented] (NUTCH-385) Improve description of thread related configuration for Fetcher

3 matches

Site Navigation

Mail list logo

Footer information