[jira] [Commented] (NUTCH-385) Improve description of thread related configuration for Fetcher
[ https://issues.apache.org/jira/browse/NUTCH-385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14045721#comment-14045721 ] Hudson commented on NUTCH-385: -- SUCCESS: Integrated in Nutch-nutchgora #1064 (See [https://builds.apache.org/job/Nutch-nutchgora/1064/]) NUTCH-385 Improve description of thread related configuration for Fetcher (jnioche: http://svn.apache.org/viewvc/nutch/branches/2.x/?view=rev&rev=1605979) * /nutch/branches/2.x/CHANGES.txt * /nutch/branches/2.x/conf/nutch-default.xml > Improve description of thread related configuration for Fetcher > --- > > Key: NUTCH-385 > URL: https://issues.apache.org/jira/browse/NUTCH-385 > Project: Nutch > Issue Type: Bug > Components: documentation, fetcher >Reporter: Chris Schneider >Assignee: Julien Nioche > Fix For: 2.3, 1.9 > > Attachments: NUTCH-385.patch > > > For some time I've been puzzled by the interaction between two paramters that > control how often the fetcher can access a particular host: > 1) The server delay, which comes back from the remote server during our > processing of the robots.txt file, and which can be limited by > fetcher.max.crawl.delay. > 2) The fetcher.threads.per.host value, particularly when this is greater than > the default of 1. > According to my (limited) understanding of the code in HttpBase.java: > Suppose that fetcher.threads.per.host is 2, and that (by chance) the fetcher > ends up keeping either 1 or 2 fetcher threads pointing at a particular host > continuously. In other words, it never tries to point 3 at the host, and it > always points a second thread at the host before the first thread finishes > accessing it. Since HttpBase.unblockAddr never gets called with > (((Integer)THREADS_PER_HOST_COUNT.get(host)).intValue() == 1), it never puts > System.currentTimeMillis() + crawlDelay into BLOCKED_ADDR_TO_TIME for the > host. Thus, the server delay will never be used at all. The fetcher will be > continuously retrieving pages from the host, often with 2 fetchers accessing > the host simultaneously. > Suppose instead that the fetcher finally does allow the last thread to > complete before it gets around to pointing another thread at the target host. > When the last fetcher thread calls HttpBase.unblockAddr, it will now put > System.currentTimeMillis() + crawlDelay into BLOCKED_ADDR_TO_TIME for the > host. This, in turn, will prevent any threads from accessing this host until > the delay is complete, even though zero threads are currently accessing the > host. > I see this behavior as inconsistent. More importantly, the current > implementation certainly doesn't seem to answer my original question about > appropriate definitions for what appear to be conflicting parameters. > In a nutshell, how could we possibly honor the server delay if we allow more > than one fetcher thread to simultaneously access the host? > It would be one thing if whenever (fetcher.threads.per.host > 1), this > trumped the server delay, causing the latter to be ignored completely. That > is certainly not the case in the current implementation, as it will wait for > server delay whenever the number of threads accessing a given host drops to > zero. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (NUTCH-385) Improve description of thread related configuration for Fetcher
[ https://issues.apache.org/jira/browse/NUTCH-385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14045678#comment-14045678 ] Hudson commented on NUTCH-385: -- FAILURE: Integrated in Nutch-trunk #2676 (See [https://builds.apache.org/job/Nutch-trunk/2676/]) NUTCH-385 Improve description of thread related configuration for Fetcher (jnioche: http://svn.apache.org/viewvc/nutch/trunk/?view=rev&rev=1605978) * /nutch/trunk/CHANGES.txt * /nutch/trunk/conf/nutch-default.xml > Improve description of thread related configuration for Fetcher > --- > > Key: NUTCH-385 > URL: https://issues.apache.org/jira/browse/NUTCH-385 > Project: Nutch > Issue Type: Bug > Components: documentation, fetcher >Reporter: Chris Schneider >Assignee: Julien Nioche > Fix For: 2.3, 1.9 > > Attachments: NUTCH-385.patch > > > For some time I've been puzzled by the interaction between two paramters that > control how often the fetcher can access a particular host: > 1) The server delay, which comes back from the remote server during our > processing of the robots.txt file, and which can be limited by > fetcher.max.crawl.delay. > 2) The fetcher.threads.per.host value, particularly when this is greater than > the default of 1. > According to my (limited) understanding of the code in HttpBase.java: > Suppose that fetcher.threads.per.host is 2, and that (by chance) the fetcher > ends up keeping either 1 or 2 fetcher threads pointing at a particular host > continuously. In other words, it never tries to point 3 at the host, and it > always points a second thread at the host before the first thread finishes > accessing it. Since HttpBase.unblockAddr never gets called with > (((Integer)THREADS_PER_HOST_COUNT.get(host)).intValue() == 1), it never puts > System.currentTimeMillis() + crawlDelay into BLOCKED_ADDR_TO_TIME for the > host. Thus, the server delay will never be used at all. The fetcher will be > continuously retrieving pages from the host, often with 2 fetchers accessing > the host simultaneously. > Suppose instead that the fetcher finally does allow the last thread to > complete before it gets around to pointing another thread at the target host. > When the last fetcher thread calls HttpBase.unblockAddr, it will now put > System.currentTimeMillis() + crawlDelay into BLOCKED_ADDR_TO_TIME for the > host. This, in turn, will prevent any threads from accessing this host until > the delay is complete, even though zero threads are currently accessing the > host. > I see this behavior as inconsistent. More importantly, the current > implementation certainly doesn't seem to answer my original question about > appropriate definitions for what appear to be conflicting parameters. > In a nutshell, how could we possibly honor the server delay if we allow more > than one fetcher thread to simultaneously access the host? > It would be one thing if whenever (fetcher.threads.per.host > 1), this > trumped the server delay, causing the latter to be ignored completely. That > is certainly not the case in the current implementation, as it will wait for > server delay whenever the number of threads accessing a given host drops to > zero. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (NUTCH-385) Improve description of thread related configuration for Fetcher
[ https://issues.apache.org/jira/browse/NUTCH-385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14045525#comment-14045525 ] lufeng commented on NUTCH-385: -- Hi Julien I see the description of "fetcher.threads.per.queue" we can add setting "fetcher.threads.per.queue" to value > 1 will also cause "fetcher.server.delay" to be ignore. Another issue is that I think this property "fetcher.max.crawl.delay" is not uniform with "fetcher.server.delay" and "fetcher.server.min.delay". It is changed to "fetcher.server.max.delay" more suitable? > Improve description of thread related configuration for Fetcher > --- > > Key: NUTCH-385 > URL: https://issues.apache.org/jira/browse/NUTCH-385 > Project: Nutch > Issue Type: Bug > Components: documentation, fetcher >Reporter: Chris Schneider >Assignee: Julien Nioche > Fix For: 1.9 > > Attachments: NUTCH-385.patch > > > For some time I've been puzzled by the interaction between two paramters that > control how often the fetcher can access a particular host: > 1) The server delay, which comes back from the remote server during our > processing of the robots.txt file, and which can be limited by > fetcher.max.crawl.delay. > 2) The fetcher.threads.per.host value, particularly when this is greater than > the default of 1. > According to my (limited) understanding of the code in HttpBase.java: > Suppose that fetcher.threads.per.host is 2, and that (by chance) the fetcher > ends up keeping either 1 or 2 fetcher threads pointing at a particular host > continuously. In other words, it never tries to point 3 at the host, and it > always points a second thread at the host before the first thread finishes > accessing it. Since HttpBase.unblockAddr never gets called with > (((Integer)THREADS_PER_HOST_COUNT.get(host)).intValue() == 1), it never puts > System.currentTimeMillis() + crawlDelay into BLOCKED_ADDR_TO_TIME for the > host. Thus, the server delay will never be used at all. The fetcher will be > continuously retrieving pages from the host, often with 2 fetchers accessing > the host simultaneously. > Suppose instead that the fetcher finally does allow the last thread to > complete before it gets around to pointing another thread at the target host. > When the last fetcher thread calls HttpBase.unblockAddr, it will now put > System.currentTimeMillis() + crawlDelay into BLOCKED_ADDR_TO_TIME for the > host. This, in turn, will prevent any threads from accessing this host until > the delay is complete, even though zero threads are currently accessing the > host. > I see this behavior as inconsistent. More importantly, the current > implementation certainly doesn't seem to answer my original question about > appropriate definitions for what appear to be conflicting parameters. > In a nutshell, how could we possibly honor the server delay if we allow more > than one fetcher thread to simultaneously access the host? > It would be one thing if whenever (fetcher.threads.per.host > 1), this > trumped the server delay, causing the latter to be ignored completely. That > is certainly not the case in the current implementation, as it will wait for > server delay whenever the number of threads accessing a given host drops to > zero. -- This message was sent by Atlassian JIRA (v6.2#6252)