[jira] [Commented] (NUTCH-385) Server delay feature conflicts with maxThreadsPerHost

2014-06-26 Thread Chris Schneider (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14044743#comment-14044743
 ] 

Chris Schneider commented on NUTCH-385:
---

Hi Julien,

Thanks for the documentation changes and for investing your time in an issue I 
raised so long ago. Unfortunately (since I haven't used Nutch in the past 5 
years), it would be difficult for me to validate that your description of the 
fetcher behavior is correct and sufficient. I would recommend that you ask 
Andrzej (or perhaps Doug) to review them instead.

Best Regards,

Chris

 Server delay feature conflicts with maxThreadsPerHost
 -

 Key: NUTCH-385
 URL: https://issues.apache.org/jira/browse/NUTCH-385
 Project: Nutch
  Issue Type: Bug
  Components: documentation, fetcher
Reporter: Chris Schneider
Assignee: Julien Nioche
 Attachments: NUTCH-385.patch


 For some time I've been puzzled by the interaction between two paramters that 
 control how often the fetcher can access a particular host:
 1) The server delay, which comes back from the remote server during our 
 processing of the robots.txt file, and which can be limited by 
 fetcher.max.crawl.delay.
 2) The fetcher.threads.per.host value, particularly when this is greater than 
 the default of 1.
 According to my (limited) understanding of the code in HttpBase.java:
 Suppose that fetcher.threads.per.host is 2, and that (by chance) the fetcher 
 ends up keeping either 1 or 2 fetcher threads pointing at a particular host 
 continuously. In other words, it never tries to point 3 at the host, and it 
 always points a second thread at the host before the first thread finishes 
 accessing it. Since HttpBase.unblockAddr never gets called with 
 (((Integer)THREADS_PER_HOST_COUNT.get(host)).intValue() == 1), it never puts 
 System.currentTimeMillis() + crawlDelay into BLOCKED_ADDR_TO_TIME for the 
 host. Thus, the server delay will never be used at all. The fetcher will be 
 continuously retrieving pages from the host, often with 2 fetchers accessing 
 the host simultaneously.
 Suppose instead that the fetcher finally does allow the last thread to 
 complete before it gets around to pointing another thread at the target host. 
 When the last fetcher thread calls HttpBase.unblockAddr, it will now put 
 System.currentTimeMillis() + crawlDelay into BLOCKED_ADDR_TO_TIME for the 
 host. This, in turn, will prevent any threads from accessing this host until 
 the delay is complete, even though zero threads are currently accessing the 
 host.
 I see this behavior as inconsistent. More importantly, the current 
 implementation certainly doesn't seem to answer my original question about 
 appropriate definitions for what appear to be conflicting parameters. 
 In a nutshell, how could we possibly honor the server delay if we allow more 
 than one fetcher thread to simultaneously access the host?
 It would be one thing if whenever (fetcher.threads.per.host  1), this 
 trumped the server delay, causing the latter to be ignored completely. That 
 is certainly not the case in the current implementation, as it will wait for 
 server delay whenever the number of threads accessing a given host drops to 
 zero.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (NUTCH-385) Server delay feature conflicts with maxThreadsPerHost

2014-04-07 Thread Chris Schneider (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13962291#comment-13962291
 ] 

Chris Schneider commented on NUTCH-385:
---

Hi Julien,

I would imagine that Andrzej would be the best person to document how this 
works currently, as the rest of us are having trouble understanding his 
original intent. Doing so in the WIKI is a good idea, but I would also suggest 
that the documentation of these individual parameters within the Hadoop 
configuration files be updated and extended.

- Chris

 Server delay feature conflicts with maxThreadsPerHost
 -

 Key: NUTCH-385
 URL: https://issues.apache.org/jira/browse/NUTCH-385
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Reporter: Chris Schneider

 For some time I've been puzzled by the interaction between two paramters that 
 control how often the fetcher can access a particular host:
 1) The server delay, which comes back from the remote server during our 
 processing of the robots.txt file, and which can be limited by 
 fetcher.max.crawl.delay.
 2) The fetcher.threads.per.host value, particularly when this is greater than 
 the default of 1.
 According to my (limited) understanding of the code in HttpBase.java:
 Suppose that fetcher.threads.per.host is 2, and that (by chance) the fetcher 
 ends up keeping either 1 or 2 fetcher threads pointing at a particular host 
 continuously. In other words, it never tries to point 3 at the host, and it 
 always points a second thread at the host before the first thread finishes 
 accessing it. Since HttpBase.unblockAddr never gets called with 
 (((Integer)THREADS_PER_HOST_COUNT.get(host)).intValue() == 1), it never puts 
 System.currentTimeMillis() + crawlDelay into BLOCKED_ADDR_TO_TIME for the 
 host. Thus, the server delay will never be used at all. The fetcher will be 
 continuously retrieving pages from the host, often with 2 fetchers accessing 
 the host simultaneously.
 Suppose instead that the fetcher finally does allow the last thread to 
 complete before it gets around to pointing another thread at the target host. 
 When the last fetcher thread calls HttpBase.unblockAddr, it will now put 
 System.currentTimeMillis() + crawlDelay into BLOCKED_ADDR_TO_TIME for the 
 host. This, in turn, will prevent any threads from accessing this host until 
 the delay is complete, even though zero threads are currently accessing the 
 host.
 I see this behavior as inconsistent. More importantly, the current 
 implementation certainly doesn't seem to answer my original question about 
 appropriate definitions for what appear to be conflicting parameters. 
 In a nutshell, how could we possibly honor the server delay if we allow more 
 than one fetcher thread to simultaneously access the host?
 It would be one thing if whenever (fetcher.threads.per.host  1), this 
 trumped the server delay, causing the latter to be ignored completely. That 
 is certainly not the case in the current implementation, as it will wait for 
 server delay whenever the number of threads accessing a given host drops to 
 zero.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (NUTCH-385) Server delay feature conflicts with maxThreadsPerHost

2014-04-06 Thread Chris Schneider (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13961562#comment-13961562
 ] 

Chris Schneider commented on NUTCH-385:
---

Hi Julien,

Actually, I believe the original bug report made two basic requests for 
improvement:

1) The behavior of these two configuration parameters should be changed to make 
them more consistent with one another.

2) The behavior of these two configuration parameters should be clearly 
documented in the configuration file, including any interactions between them 
(such as who trumps whom).

Since then, Andrzej has attempted to justify the current behavior, though there 
seem to be other opinions on how it really ought to work. Even if we decide not 
to change the current implementation, I think it certainly deserves better 
documentation.

Chris

 Server delay feature conflicts with maxThreadsPerHost
 -

 Key: NUTCH-385
 URL: https://issues.apache.org/jira/browse/NUTCH-385
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Reporter: Chris Schneider

 For some time I've been puzzled by the interaction between two paramters that 
 control how often the fetcher can access a particular host:
 1) The server delay, which comes back from the remote server during our 
 processing of the robots.txt file, and which can be limited by 
 fetcher.max.crawl.delay.
 2) The fetcher.threads.per.host value, particularly when this is greater than 
 the default of 1.
 According to my (limited) understanding of the code in HttpBase.java:
 Suppose that fetcher.threads.per.host is 2, and that (by chance) the fetcher 
 ends up keeping either 1 or 2 fetcher threads pointing at a particular host 
 continuously. In other words, it never tries to point 3 at the host, and it 
 always points a second thread at the host before the first thread finishes 
 accessing it. Since HttpBase.unblockAddr never gets called with 
 (((Integer)THREADS_PER_HOST_COUNT.get(host)).intValue() == 1), it never puts 
 System.currentTimeMillis() + crawlDelay into BLOCKED_ADDR_TO_TIME for the 
 host. Thus, the server delay will never be used at all. The fetcher will be 
 continuously retrieving pages from the host, often with 2 fetchers accessing 
 the host simultaneously.
 Suppose instead that the fetcher finally does allow the last thread to 
 complete before it gets around to pointing another thread at the target host. 
 When the last fetcher thread calls HttpBase.unblockAddr, it will now put 
 System.currentTimeMillis() + crawlDelay into BLOCKED_ADDR_TO_TIME for the 
 host. This, in turn, will prevent any threads from accessing this host until 
 the delay is complete, even though zero threads are currently accessing the 
 host.
 I see this behavior as inconsistent. More importantly, the current 
 implementation certainly doesn't seem to answer my original question about 
 appropriate definitions for what appear to be conflicting parameters. 
 In a nutshell, how could we possibly honor the server delay if we allow more 
 than one fetcher thread to simultaneously access the host?
 It would be one thing if whenever (fetcher.threads.per.host  1), this 
 trumped the server delay, causing the latter to be ignored completely. That 
 is certainly not the case in the current implementation, as it will wait for 
 server delay whenever the number of threads accessing a given host drops to 
 zero.



--
This message was sent by Atlassian JIRA
(v6.2#6252)