[jira] [Created] (NUTCH-1801) Fix chain of dependencies between ANT tasks
Julien Nioche created NUTCH-1801: Summary: Fix chain of dependencies between ANT tasks Key: NUTCH-1801 URL: https://issues.apache.org/jira/browse/NUTCH-1801 Project: Nutch Issue Type: Bug Components: build Affects Versions: 1.8 Reporter: Julien Nioche Fix For: 1.9 The chain of dependencies between ANT tasks needs fixing. The main issue is that the dependencies with a 'test' scope in Ivy are not resolved properly or rather the resolution task works fine but is not called from the upper level 'test' tasks. This can easily be reproduced by marking the junit dependency in ivy.xml as conf=test-default. The 'test-core' task for instance relies on the 'job' task which should not be the case. Ideally we'd want to have a separate lib dir for the test dependencies so that they do not get copied into the job file where they are absolutely not needed. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (NUTCH-1801) Fix chain of dependencies between ANT tasks
[ https://issues.apache.org/jira/browse/NUTCH-1801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-1801: - Description: The chain of dependencies between ANT tasks needs fixing. The main issue is that the dependencies with a 'test' scope in Ivy are not resolved properly or rather the resolution task works fine but is not called from the upper level 'test' tasks. This can easily be reproduced by marking the junit dependency in ivy.xml as conf=test-default. Ideally we'd want to have a separate lib dir for the test dependencies so that they do not get copied into the job file where they are absolutely not needed. was: The chain of dependencies between ANT tasks needs fixing. The main issue is that the dependencies with a 'test' scope in Ivy are not resolved properly or rather the resolution task works fine but is not called from the upper level 'test' tasks. This can easily be reproduced by marking the junit dependency in ivy.xml as conf=test-default. The 'test-core' task for instance relies on the 'job' task which should not be the case. Ideally we'd want to have a separate lib dir for the test dependencies so that they do not get copied into the job file where they are absolutely not needed. Fix chain of dependencies between ANT tasks --- Key: NUTCH-1801 URL: https://issues.apache.org/jira/browse/NUTCH-1801 Project: Nutch Issue Type: Bug Components: build Affects Versions: 1.8 Reporter: Julien Nioche Fix For: 1.9 The chain of dependencies between ANT tasks needs fixing. The main issue is that the dependencies with a 'test' scope in Ivy are not resolved properly or rather the resolution task works fine but is not called from the upper level 'test' tasks. This can easily be reproduced by marking the junit dependency in ivy.xml as conf=test-default. Ideally we'd want to have a separate lib dir for the test dependencies so that they do not get copied into the job file where they are absolutely not needed. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (NUTCH-1802) Move TestbedProxy to test environment
Julien Nioche created NUTCH-1802: Summary: Move TestbedProxy to test environment Key: NUTCH-1802 URL: https://issues.apache.org/jira/browse/NUTCH-1802 Project: Nutch Issue Type: Sub-task Components: build Affects Versions: 1.8 Reporter: Julien Nioche The proxy task relies on the test classpath but its code is in src/java/org/apache/nutch/tools/proxy. One of the benefits of moving it to tests is that its dependencies would not be shipped in the job file where they are not needed (e.g. servlet stuff). The Ant task would work as before. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (NUTCH-1803) Put test dependencies in a separate lib dir
Julien Nioche created NUTCH-1803: Summary: Put test dependencies in a separate lib dir Key: NUTCH-1803 URL: https://issues.apache.org/jira/browse/NUTCH-1803 Project: Nutch Issue Type: Sub-task Components: build Affects Versions: 1.8 Reporter: Julien Nioche See main issue [NUTCH-1801]. This would mean that these libs do not get included in the job file and provides a cleaner separation. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (NUTCH-1803) Put test dependencies in a separate lib dir
[ https://issues.apache.org/jira/browse/NUTCH-1803?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-1803: - Attachment: NUTCH-1803.patch Patch which handles the test dependencies in a separate directory from the main deps + fixes order of dependencies so that the resolution of the test libs is done prior to testing. Put test dependencies in a separate lib dir --- Key: NUTCH-1803 URL: https://issues.apache.org/jira/browse/NUTCH-1803 Project: Nutch Issue Type: Sub-task Components: build Affects Versions: 1.8 Reporter: Julien Nioche Fix For: 1.9 Attachments: NUTCH-1803.patch See main issue [NUTCH-1801]. This would mean that these libs do not get included in the job file and provides a cleaner separation. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (NUTCH-1801) Improve handling of test dependencies in ANT+Ivy
[ https://issues.apache.org/jira/browse/NUTCH-1801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-1801: - Description: The chain of dependencies between ANT tasks needs fixing. The main issue is that the dependencies with a 'test' scope in Ivy are not resolved properly or rather the resolution task works fine but is not called from the upper level 'test' tasks. This can easily be reproduced by marking the junit dependency in ivy.xml as conf=test-default. Ideally we'd want to have a separate lib dir for the test dependencies so that they do not get copied into the job file where they are absolutely not needed. was: The chain of dependencies between ANT tasks needs fixing. The main issue is that the dependencies with a 'test' scope in Ivy are not resolved properly or rather the resolution task works fine but is not called from the upper level 'test' tasks. This can easily be reproduced by marking the junit dependency in ivy.xml as conf=test-default. Ideally we'd want to have a separate lib dir for the test dependencies so that they do not get copied into the job file where they are absolutely not needed. Improve handling of test dependencies in ANT+Ivy Key: NUTCH-1801 URL: https://issues.apache.org/jira/browse/NUTCH-1801 Project: Nutch Issue Type: Bug Components: build Affects Versions: 1.8 Reporter: Julien Nioche Fix For: 1.9 The chain of dependencies between ANT tasks needs fixing. The main issue is that the dependencies with a 'test' scope in Ivy are not resolved properly or rather the resolution task works fine but is not called from the upper level 'test' tasks. This can easily be reproduced by marking the junit dependency in ivy.xml as conf=test-default. Ideally we'd want to have a separate lib dir for the test dependencies so that they do not get copied into the job file where they are absolutely not needed. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (NUTCH-1804) Move JUnit dependency to test scope
Julien Nioche created NUTCH-1804: Summary: Move JUnit dependency to test scope Key: NUTCH-1804 URL: https://issues.apache.org/jira/browse/NUTCH-1804 Project: Nutch Issue Type: Sub-task Components: build Affects Versions: 1.8 Reporter: Julien Nioche Should work straight with core tests after applying [NUTCH-1803] but requires fixing the build for the plugins by either add the main test dependencies to their classpath or force them to declare JUnit as a test dependency in their own ivy.xml. The latter is probably cleaner but we need to make sure that the test dependencies do not get added to the built version of the plugin. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (NUTCH-1801) Improve handling of test dependencies in ANT+Ivy
[ https://issues.apache.org/jira/browse/NUTCH-1801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-1801: - Summary: Improve handling of test dependencies in ANT+Ivy (was: Fix chain of dependencies between ANT tasks) Improve handling of test dependencies in ANT+Ivy Key: NUTCH-1801 URL: https://issues.apache.org/jira/browse/NUTCH-1801 Project: Nutch Issue Type: Bug Components: build Affects Versions: 1.8 Reporter: Julien Nioche Fix For: 1.9 The chain of dependencies between ANT tasks needs fixing. The main issue is that the dependencies with a 'test' scope in Ivy are not resolved properly or rather the resolution task works fine but is not called from the upper level 'test' tasks. This can easily be reproduced by marking the junit dependency in ivy.xml as conf=test-default. Ideally we'd want to have a separate lib dir for the test dependencies so that they do not get copied into the job file where they are absolutely not needed. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (NUTCH-1802) Move TestbedProxy to test environment
[ https://issues.apache.org/jira/browse/NUTCH-1802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14044542#comment-14044542 ] Julien Nioche commented on NUTCH-1802: -- We'll just need to rename the main class to something not beginning with Test* so that we don't try to call the JUnit test bed on it. Move TestbedProxy to test environment -- Key: NUTCH-1802 URL: https://issues.apache.org/jira/browse/NUTCH-1802 Project: Nutch Issue Type: Sub-task Components: build Affects Versions: 1.8 Reporter: Julien Nioche Fix For: 1.9 The proxy task relies on the test classpath but its code is in src/java/org/apache/nutch/tools/proxy. One of the benefits of moving it to tests is that its dependencies would not be shipped in the job file where they are not needed (e.g. servlet stuff). The Ant task would work as before. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (NUTCH-1805) Remove unnecessary transitive dependencies from Hadoop core
Julien Nioche created NUTCH-1805: Summary: Remove unnecessary transitive dependencies from Hadoop core Key: NUTCH-1805 URL: https://issues.apache.org/jira/browse/NUTCH-1805 Project: Nutch Issue Type: Improvement Components: build Affects Versions: 1.8 Reporter: Julien Nioche The Hadoop libs are not included in the job file as a Hadoop cluster must already be available in order to use it, however some of its transitive dependencies make it to the job file. We already prevent some but could extend that to : exclude org=org.mortbay.jetty/ exclude org=com.sun.jersey/ exclude org=tomcat/ Note that we need some of the Hadoop classes and dependencies in order to run Nutch in local mode. Alternatively we could have a separate Ivy profile only for Hadoop and store the dependencies in a separate location so that they do not get copied to the job jar, however this is probably an overkill if the dependencies above are not needed when running in local mode. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (NUTCH-1806) Delegate processing of URL domains to crawler commons
Julien Nioche created NUTCH-1806: Summary: Delegate processing of URL domains to crawler commons Key: NUTCH-1806 URL: https://issues.apache.org/jira/browse/NUTCH-1806 Project: Nutch Issue Type: Improvement Affects Versions: 1.8 Reporter: Julien Nioche We have code in src/java/org/apache/nutch/util/domain and a resource file conf/domain-suffixes.xml to handle URL domains. This is used mostly from URLUtil.getDomainName. The resource file is not necessarily up to date and since crawler commons has a similar functionality we should use it instead of having to maintain our own resources. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Assigned] (NUTCH-385) Server delay feature conflicts with maxThreadsPerHost
[ https://issues.apache.org/jira/browse/NUTCH-385?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche reassigned NUTCH-385: --- Assignee: Julien Nioche Server delay feature conflicts with maxThreadsPerHost - Key: NUTCH-385 URL: https://issues.apache.org/jira/browse/NUTCH-385 Project: Nutch Issue Type: Bug Components: documentation, fetcher Reporter: Chris Schneider Assignee: Julien Nioche For some time I've been puzzled by the interaction between two paramters that control how often the fetcher can access a particular host: 1) The server delay, which comes back from the remote server during our processing of the robots.txt file, and which can be limited by fetcher.max.crawl.delay. 2) The fetcher.threads.per.host value, particularly when this is greater than the default of 1. According to my (limited) understanding of the code in HttpBase.java: Suppose that fetcher.threads.per.host is 2, and that (by chance) the fetcher ends up keeping either 1 or 2 fetcher threads pointing at a particular host continuously. In other words, it never tries to point 3 at the host, and it always points a second thread at the host before the first thread finishes accessing it. Since HttpBase.unblockAddr never gets called with (((Integer)THREADS_PER_HOST_COUNT.get(host)).intValue() == 1), it never puts System.currentTimeMillis() + crawlDelay into BLOCKED_ADDR_TO_TIME for the host. Thus, the server delay will never be used at all. The fetcher will be continuously retrieving pages from the host, often with 2 fetchers accessing the host simultaneously. Suppose instead that the fetcher finally does allow the last thread to complete before it gets around to pointing another thread at the target host. When the last fetcher thread calls HttpBase.unblockAddr, it will now put System.currentTimeMillis() + crawlDelay into BLOCKED_ADDR_TO_TIME for the host. This, in turn, will prevent any threads from accessing this host until the delay is complete, even though zero threads are currently accessing the host. I see this behavior as inconsistent. More importantly, the current implementation certainly doesn't seem to answer my original question about appropriate definitions for what appear to be conflicting parameters. In a nutshell, how could we possibly honor the server delay if we allow more than one fetcher thread to simultaneously access the host? It would be one thing if whenever (fetcher.threads.per.host 1), this trumped the server delay, causing the latter to be ignored completely. That is certainly not the case in the current implementation, as it will wait for server delay whenever the number of threads accessing a given host drops to zero. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (NUTCH-385) Server delay feature conflicts with maxThreadsPerHost
[ https://issues.apache.org/jira/browse/NUTCH-385?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-385: Attachment: NUTCH-385.patch Improved description of the thread related configuration for the Fetcher. The Fetcher implementation has changed a lot since this issue was opened and the descriptions in nutch-default.xml reflect the currrent behaviour. [~schmed] Ok with these changes? BTW do you still use Nutch 8 years after opening the issue? Server delay feature conflicts with maxThreadsPerHost - Key: NUTCH-385 URL: https://issues.apache.org/jira/browse/NUTCH-385 Project: Nutch Issue Type: Bug Components: documentation, fetcher Reporter: Chris Schneider Assignee: Julien Nioche Attachments: NUTCH-385.patch For some time I've been puzzled by the interaction between two paramters that control how often the fetcher can access a particular host: 1) The server delay, which comes back from the remote server during our processing of the robots.txt file, and which can be limited by fetcher.max.crawl.delay. 2) The fetcher.threads.per.host value, particularly when this is greater than the default of 1. According to my (limited) understanding of the code in HttpBase.java: Suppose that fetcher.threads.per.host is 2, and that (by chance) the fetcher ends up keeping either 1 or 2 fetcher threads pointing at a particular host continuously. In other words, it never tries to point 3 at the host, and it always points a second thread at the host before the first thread finishes accessing it. Since HttpBase.unblockAddr never gets called with (((Integer)THREADS_PER_HOST_COUNT.get(host)).intValue() == 1), it never puts System.currentTimeMillis() + crawlDelay into BLOCKED_ADDR_TO_TIME for the host. Thus, the server delay will never be used at all. The fetcher will be continuously retrieving pages from the host, often with 2 fetchers accessing the host simultaneously. Suppose instead that the fetcher finally does allow the last thread to complete before it gets around to pointing another thread at the target host. When the last fetcher thread calls HttpBase.unblockAddr, it will now put System.currentTimeMillis() + crawlDelay into BLOCKED_ADDR_TO_TIME for the host. This, in turn, will prevent any threads from accessing this host until the delay is complete, even though zero threads are currently accessing the host. I see this behavior as inconsistent. More importantly, the current implementation certainly doesn't seem to answer my original question about appropriate definitions for what appear to be conflicting parameters. In a nutshell, how could we possibly honor the server delay if we allow more than one fetcher thread to simultaneously access the host? It would be one thing if whenever (fetcher.threads.per.host 1), this trumped the server delay, causing the latter to be ignored completely. That is certainly not the case in the current implementation, as it will wait for server delay whenever the number of threads accessing a given host drops to zero. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (NUTCH-385) Server delay feature conflicts with maxThreadsPerHost
[ https://issues.apache.org/jira/browse/NUTCH-385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14044743#comment-14044743 ] Chris Schneider commented on NUTCH-385: --- Hi Julien, Thanks for the documentation changes and for investing your time in an issue I raised so long ago. Unfortunately (since I haven't used Nutch in the past 5 years), it would be difficult for me to validate that your description of the fetcher behavior is correct and sufficient. I would recommend that you ask Andrzej (or perhaps Doug) to review them instead. Best Regards, Chris Server delay feature conflicts with maxThreadsPerHost - Key: NUTCH-385 URL: https://issues.apache.org/jira/browse/NUTCH-385 Project: Nutch Issue Type: Bug Components: documentation, fetcher Reporter: Chris Schneider Assignee: Julien Nioche Attachments: NUTCH-385.patch For some time I've been puzzled by the interaction between two paramters that control how often the fetcher can access a particular host: 1) The server delay, which comes back from the remote server during our processing of the robots.txt file, and which can be limited by fetcher.max.crawl.delay. 2) The fetcher.threads.per.host value, particularly when this is greater than the default of 1. According to my (limited) understanding of the code in HttpBase.java: Suppose that fetcher.threads.per.host is 2, and that (by chance) the fetcher ends up keeping either 1 or 2 fetcher threads pointing at a particular host continuously. In other words, it never tries to point 3 at the host, and it always points a second thread at the host before the first thread finishes accessing it. Since HttpBase.unblockAddr never gets called with (((Integer)THREADS_PER_HOST_COUNT.get(host)).intValue() == 1), it never puts System.currentTimeMillis() + crawlDelay into BLOCKED_ADDR_TO_TIME for the host. Thus, the server delay will never be used at all. The fetcher will be continuously retrieving pages from the host, often with 2 fetchers accessing the host simultaneously. Suppose instead that the fetcher finally does allow the last thread to complete before it gets around to pointing another thread at the target host. When the last fetcher thread calls HttpBase.unblockAddr, it will now put System.currentTimeMillis() + crawlDelay into BLOCKED_ADDR_TO_TIME for the host. This, in turn, will prevent any threads from accessing this host until the delay is complete, even though zero threads are currently accessing the host. I see this behavior as inconsistent. More importantly, the current implementation certainly doesn't seem to answer my original question about appropriate definitions for what appear to be conflicting parameters. In a nutshell, how could we possibly honor the server delay if we allow more than one fetcher thread to simultaneously access the host? It would be one thing if whenever (fetcher.threads.per.host 1), this trumped the server delay, causing the latter to be ignored completely. That is certainly not the case in the current implementation, as it will wait for server delay whenever the number of threads accessing a given host drops to zero. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (NUTCH-385) Server delay feature conflicts with maxThreadsPerHost
[ https://issues.apache.org/jira/browse/NUTCH-385?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-385: Fix Version/s: 1.9 Server delay feature conflicts with maxThreadsPerHost - Key: NUTCH-385 URL: https://issues.apache.org/jira/browse/NUTCH-385 Project: Nutch Issue Type: Bug Components: documentation, fetcher Reporter: Chris Schneider Assignee: Julien Nioche Fix For: 1.9 Attachments: NUTCH-385.patch For some time I've been puzzled by the interaction between two paramters that control how often the fetcher can access a particular host: 1) The server delay, which comes back from the remote server during our processing of the robots.txt file, and which can be limited by fetcher.max.crawl.delay. 2) The fetcher.threads.per.host value, particularly when this is greater than the default of 1. According to my (limited) understanding of the code in HttpBase.java: Suppose that fetcher.threads.per.host is 2, and that (by chance) the fetcher ends up keeping either 1 or 2 fetcher threads pointing at a particular host continuously. In other words, it never tries to point 3 at the host, and it always points a second thread at the host before the first thread finishes accessing it. Since HttpBase.unblockAddr never gets called with (((Integer)THREADS_PER_HOST_COUNT.get(host)).intValue() == 1), it never puts System.currentTimeMillis() + crawlDelay into BLOCKED_ADDR_TO_TIME for the host. Thus, the server delay will never be used at all. The fetcher will be continuously retrieving pages from the host, often with 2 fetchers accessing the host simultaneously. Suppose instead that the fetcher finally does allow the last thread to complete before it gets around to pointing another thread at the target host. When the last fetcher thread calls HttpBase.unblockAddr, it will now put System.currentTimeMillis() + crawlDelay into BLOCKED_ADDR_TO_TIME for the host. This, in turn, will prevent any threads from accessing this host until the delay is complete, even though zero threads are currently accessing the host. I see this behavior as inconsistent. More importantly, the current implementation certainly doesn't seem to answer my original question about appropriate definitions for what appear to be conflicting parameters. In a nutshell, how could we possibly honor the server delay if we allow more than one fetcher thread to simultaneously access the host? It would be one thing if whenever (fetcher.threads.per.host 1), this trumped the server delay, causing the latter to be ignored completely. That is certainly not the case in the current implementation, as it will wait for server delay whenever the number of threads accessing a given host drops to zero. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (NUTCH-385) Server delay feature conflicts with maxThreadsPerHost
[ https://issues.apache.org/jira/browse/NUTCH-385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14044767#comment-14044767 ] Julien Nioche commented on NUTCH-385: - Will commit shortly unless someone objects or proposes a better formulation Server delay feature conflicts with maxThreadsPerHost - Key: NUTCH-385 URL: https://issues.apache.org/jira/browse/NUTCH-385 Project: Nutch Issue Type: Bug Components: documentation, fetcher Reporter: Chris Schneider Assignee: Julien Nioche Fix For: 1.9 Attachments: NUTCH-385.patch For some time I've been puzzled by the interaction between two paramters that control how often the fetcher can access a particular host: 1) The server delay, which comes back from the remote server during our processing of the robots.txt file, and which can be limited by fetcher.max.crawl.delay. 2) The fetcher.threads.per.host value, particularly when this is greater than the default of 1. According to my (limited) understanding of the code in HttpBase.java: Suppose that fetcher.threads.per.host is 2, and that (by chance) the fetcher ends up keeping either 1 or 2 fetcher threads pointing at a particular host continuously. In other words, it never tries to point 3 at the host, and it always points a second thread at the host before the first thread finishes accessing it. Since HttpBase.unblockAddr never gets called with (((Integer)THREADS_PER_HOST_COUNT.get(host)).intValue() == 1), it never puts System.currentTimeMillis() + crawlDelay into BLOCKED_ADDR_TO_TIME for the host. Thus, the server delay will never be used at all. The fetcher will be continuously retrieving pages from the host, often with 2 fetchers accessing the host simultaneously. Suppose instead that the fetcher finally does allow the last thread to complete before it gets around to pointing another thread at the target host. When the last fetcher thread calls HttpBase.unblockAddr, it will now put System.currentTimeMillis() + crawlDelay into BLOCKED_ADDR_TO_TIME for the host. This, in turn, will prevent any threads from accessing this host until the delay is complete, even though zero threads are currently accessing the host. I see this behavior as inconsistent. More importantly, the current implementation certainly doesn't seem to answer my original question about appropriate definitions for what appear to be conflicting parameters. In a nutshell, how could we possibly honor the server delay if we allow more than one fetcher thread to simultaneously access the host? It would be one thing if whenever (fetcher.threads.per.host 1), this trumped the server delay, causing the latter to be ignored completely. That is certainly not the case in the current implementation, as it will wait for server delay whenever the number of threads accessing a given host drops to zero. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (NUTCH-385) Improve description of thread related configuration for Fetcher
[ https://issues.apache.org/jira/browse/NUTCH-385?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-385: Summary: Improve description of thread related configuration for Fetcher (was: Server delay feature conflicts with maxThreadsPerHost) Improve description of thread related configuration for Fetcher --- Key: NUTCH-385 URL: https://issues.apache.org/jira/browse/NUTCH-385 Project: Nutch Issue Type: Bug Components: documentation, fetcher Reporter: Chris Schneider Assignee: Julien Nioche Fix For: 1.9 Attachments: NUTCH-385.patch For some time I've been puzzled by the interaction between two paramters that control how often the fetcher can access a particular host: 1) The server delay, which comes back from the remote server during our processing of the robots.txt file, and which can be limited by fetcher.max.crawl.delay. 2) The fetcher.threads.per.host value, particularly when this is greater than the default of 1. According to my (limited) understanding of the code in HttpBase.java: Suppose that fetcher.threads.per.host is 2, and that (by chance) the fetcher ends up keeping either 1 or 2 fetcher threads pointing at a particular host continuously. In other words, it never tries to point 3 at the host, and it always points a second thread at the host before the first thread finishes accessing it. Since HttpBase.unblockAddr never gets called with (((Integer)THREADS_PER_HOST_COUNT.get(host)).intValue() == 1), it never puts System.currentTimeMillis() + crawlDelay into BLOCKED_ADDR_TO_TIME for the host. Thus, the server delay will never be used at all. The fetcher will be continuously retrieving pages from the host, often with 2 fetchers accessing the host simultaneously. Suppose instead that the fetcher finally does allow the last thread to complete before it gets around to pointing another thread at the target host. When the last fetcher thread calls HttpBase.unblockAddr, it will now put System.currentTimeMillis() + crawlDelay into BLOCKED_ADDR_TO_TIME for the host. This, in turn, will prevent any threads from accessing this host until the delay is complete, even though zero threads are currently accessing the host. I see this behavior as inconsistent. More importantly, the current implementation certainly doesn't seem to answer my original question about appropriate definitions for what appear to be conflicting parameters. In a nutshell, how could we possibly honor the server delay if we allow more than one fetcher thread to simultaneously access the host? It would be one thing if whenever (fetcher.threads.per.host 1), this trumped the server delay, causing the latter to be ignored completely. That is certainly not the case in the current implementation, as it will wait for server delay whenever the number of threads accessing a given host drops to zero. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (NUTCH-1798) Unable to get any documents to index in elastic search
[ https://issues.apache.org/jira/browse/NUTCH-1798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14045224#comment-14045224 ] Aaron Bedward commented on NUTCH-1798: -- Right... i have made a few observations (i may have misunderstood the architecture but please bare with me) I have managed to get 2.x indexing with ES by making the following changes to the crawl script Line 149: echo Indexing $CRAWL_ID on SOLR index - $SOLRURL -Line 150: $bin/nutch solrindex $commonOptions $SOLRURL -all -crawlId $CRAWL_ID Line 150: $bin/nutch solrindex $SOLRURL -all -crawlId $CRAWL_ID Example call: ./bin/crawl urls test http://localhost:9300 2 However i believe the script should use $bin/nutch index -D solr.server.url=$SOLRURL Hope this helps anybody trying to use ES, i will commit my source code MongoDB over the weekend Unable to get any documents to index in elastic search -- Key: NUTCH-1798 URL: https://issues.apache.org/jira/browse/NUTCH-1798 Project: Nutch Issue Type: Bug Affects Versions: 2.3 Environment: Ubuntu 13.10, Elasticsearch 1, HBASE 0.94.9 Reporter: Aaron Bedward Fix For: 2.3 Attachments: part-r-0 Hopefully this is something i am doing wrong. I have checked out 2.x as i would like to use the new metatag extraction features. I have then run ant runtime to build, I have updated the nutch-site.xml like so: property nameplugin.includes/name valueprotocol-http|urlfilter-regex|parse-(html|tika|metatags)|index-(basic|anchor|more|metatags)|indexer-elasticsearch|urlnormalizer-(pass|regex|basic)|scoring-opic/value descriptionRegular expression naming plugin directory names to include. Any plugin not matching this expression is excluded. In any case you need at least include the nutch-extensionpoints plugin. By default Nutch includes crawling just HTML and plain text via HTTP, and basic indexing and search plugins. In order to use HTTPS please enable protocol-httpclient, but be aware of possible intermittent problems with the underlying commons-httpclient library. /description /property property nameelastic.cluster/name valueelasticsearch/value descriptionThe cluster name to discover. Either host and potr must be defined or cluster./description /property I have then created a folder called urls and added seed.txt. i ran the following commands bin/nutch inject urls bin/nutch generate -topN 1000 bin/nutch fetch -all bin/nutch parse -all bin/nutch updatedb bin/nutch index -all it runs no errors however no documents have been index i also tried setting up the following with solr and no documents are indexed Log: 2014-06-24 02:57:57,804 INFO parse.ParserJob - ParserJob: success 2014-06-24 02:57:57,805 INFO parse.ParserJob - ParserJob: finished at 2014-06-24 02:57:57, time elapsed: 00:00:06 2014-06-24 02:57:59,823 INFO indexer.IndexingJob - IndexingJob: starting 2014-06-24 02:58:00,815 INFO basic.BasicIndexingFilter - Maximum title length for indexing set to: 100 2014-06-24 02:58:00,815 INFO indexer.IndexingFilters - Adding org.apache.nutch.indexer.basic.BasicIndexingFilter 2014-06-24 02:58:01,774 INFO indexer.IndexingFilters - Adding org.apache.nutch.indexer.more.MoreIndexingFilter 2014-06-24 02:58:01,776 INFO anchor.AnchorIndexingFilter - Anchor deduplication is: off 2014-06-24 02:58:01,776 INFO indexer.IndexingFilters - Adding org.apache.nutch.indexer.anchor.AnchorIndexingFilter 2014-06-24 02:58:03,946 WARN util.NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 2014-06-24 02:58:04,920 INFO indexer.IndexWriters - Adding org.apache.nutch.indexwriter.elastic.ElasticIndexWriter 2014-06-24 02:58:05,261 INFO elasticsearch.node - [Silver] version[1.1.0], pid[21885], build[2181e11/2014-03-25T15:59:51Z] 2014-06-24 02:58:05,261 INFO elasticsearch.node - [Silver] initializing ... 2014-06-24 02:58:05,377 INFO elasticsearch.plugins - [Silver] loaded [], sites [] 2014-06-24 02:58:08,339 INFO elasticsearch.node - [Silver] initialized 2014-06-24 02:58:08,339 INFO elasticsearch.node - [Silver] starting ... 2014-06-24 02:58:08,431 INFO elasticsearch.transport - [Silver] bound_address {inet[/0:0:0:0:0:0:0:0:9301]}, publish_address {inet[/10.0.2.15:9301]} 2014-06-24 02:58:11,540 INFO cluster.service - [Silver] detected_master [Doughboy][U02ugUDtRZW4ttx6lbyMLg][dev-ElasticSearch][inet[/10.0.2.4:9300]], added {[Doughboy][U02ugUDtRZW4ttx6lbyMLg][dev-ElasticSearch][inet[/10.0.2.4:9300]],[Silver Squire][2NyU10FARvaL92rU5GqpcA][nutch][inet[/10.0.2.15:9300]],}, reason: zen-disco-receive(from master
[jira] [Comment Edited] (NUTCH-1798) Unable to get any documents to index in elastic search
[ https://issues.apache.org/jira/browse/NUTCH-1798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14045224#comment-14045224 ] Aaron Bedward edited comment on NUTCH-1798 at 6/26/14 9:32 PM: --- Right... i have made a few observations (i may have misunderstood the architecture but please bare with me) I have managed to get 2.x indexing with ES by making the following changes to the crawl script Line 149: echo Indexing $CRAWL_ID on SOLR index - $SOLRURL -Line 150: $bin/nutch solrindex $commonOptions $SOLRURL -all -crawlId $CRAWL_ID Line 150: $bin/nutch solrindex $SOLRURL -all -crawlId $CRAWL_ID Example call: ./bin/crawl urls test http://localhost:9300 2 However i believe the script should use $bin/nutch index -D solr.server.url=$SOLRURL Hope this helps anybody trying to use ES, i will commit my source code for MongoDB over the weekend was (Author: mrbedward): Right... i have made a few observations (i may have misunderstood the architecture but please bare with me) I have managed to get 2.x indexing with ES by making the following changes to the crawl script Line 149: echo Indexing $CRAWL_ID on SOLR index - $SOLRURL -Line 150: $bin/nutch solrindex $commonOptions $SOLRURL -all -crawlId $CRAWL_ID Line 150: $bin/nutch solrindex $SOLRURL -all -crawlId $CRAWL_ID Example call: ./bin/crawl urls test http://localhost:9300 2 However i believe the script should use $bin/nutch index -D solr.server.url=$SOLRURL Hope this helps anybody trying to use ES, i will commit my source code MongoDB over the weekend Unable to get any documents to index in elastic search -- Key: NUTCH-1798 URL: https://issues.apache.org/jira/browse/NUTCH-1798 Project: Nutch Issue Type: Bug Affects Versions: 2.3 Environment: Ubuntu 13.10, Elasticsearch 1, HBASE 0.94.9 Reporter: Aaron Bedward Fix For: 2.3 Attachments: part-r-0 Hopefully this is something i am doing wrong. I have checked out 2.x as i would like to use the new metatag extraction features. I have then run ant runtime to build, I have updated the nutch-site.xml like so: property nameplugin.includes/name valueprotocol-http|urlfilter-regex|parse-(html|tika|metatags)|index-(basic|anchor|more|metatags)|indexer-elasticsearch|urlnormalizer-(pass|regex|basic)|scoring-opic/value descriptionRegular expression naming plugin directory names to include. Any plugin not matching this expression is excluded. In any case you need at least include the nutch-extensionpoints plugin. By default Nutch includes crawling just HTML and plain text via HTTP, and basic indexing and search plugins. In order to use HTTPS please enable protocol-httpclient, but be aware of possible intermittent problems with the underlying commons-httpclient library. /description /property property nameelastic.cluster/name valueelasticsearch/value descriptionThe cluster name to discover. Either host and potr must be defined or cluster./description /property I have then created a folder called urls and added seed.txt. i ran the following commands bin/nutch inject urls bin/nutch generate -topN 1000 bin/nutch fetch -all bin/nutch parse -all bin/nutch updatedb bin/nutch index -all it runs no errors however no documents have been index i also tried setting up the following with solr and no documents are indexed Log: 2014-06-24 02:57:57,804 INFO parse.ParserJob - ParserJob: success 2014-06-24 02:57:57,805 INFO parse.ParserJob - ParserJob: finished at 2014-06-24 02:57:57, time elapsed: 00:00:06 2014-06-24 02:57:59,823 INFO indexer.IndexingJob - IndexingJob: starting 2014-06-24 02:58:00,815 INFO basic.BasicIndexingFilter - Maximum title length for indexing set to: 100 2014-06-24 02:58:00,815 INFO indexer.IndexingFilters - Adding org.apache.nutch.indexer.basic.BasicIndexingFilter 2014-06-24 02:58:01,774 INFO indexer.IndexingFilters - Adding org.apache.nutch.indexer.more.MoreIndexingFilter 2014-06-24 02:58:01,776 INFO anchor.AnchorIndexingFilter - Anchor deduplication is: off 2014-06-24 02:58:01,776 INFO indexer.IndexingFilters - Adding org.apache.nutch.indexer.anchor.AnchorIndexingFilter 2014-06-24 02:58:03,946 WARN util.NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 2014-06-24 02:58:04,920 INFO indexer.IndexWriters - Adding org.apache.nutch.indexwriter.elastic.ElasticIndexWriter 2014-06-24 02:58:05,261 INFO elasticsearch.node - [Silver] version[1.1.0], pid[21885], build[2181e11/2014-03-25T15:59:51Z] 2014-06-24 02:58:05,261 INFO elasticsearch.node - [Silver] initializing ... 2014-06-24 02:58:05,377 INFO
[jira] [Commented] (NUTCH-385) Improve description of thread related configuration for Fetcher
[ https://issues.apache.org/jira/browse/NUTCH-385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14045525#comment-14045525 ] lufeng commented on NUTCH-385: -- Hi Julien I see the description of fetcher.threads.per.queue we can add setting fetcher.threads.per.queue to value 1 will also cause fetcher.server.delay to be ignore. Another issue is that I think this property fetcher.max.crawl.delay is not uniform with fetcher.server.delay and fetcher.server.min.delay. It is changed to fetcher.server.max.delay more suitable? Improve description of thread related configuration for Fetcher --- Key: NUTCH-385 URL: https://issues.apache.org/jira/browse/NUTCH-385 Project: Nutch Issue Type: Bug Components: documentation, fetcher Reporter: Chris Schneider Assignee: Julien Nioche Fix For: 1.9 Attachments: NUTCH-385.patch For some time I've been puzzled by the interaction between two paramters that control how often the fetcher can access a particular host: 1) The server delay, which comes back from the remote server during our processing of the robots.txt file, and which can be limited by fetcher.max.crawl.delay. 2) The fetcher.threads.per.host value, particularly when this is greater than the default of 1. According to my (limited) understanding of the code in HttpBase.java: Suppose that fetcher.threads.per.host is 2, and that (by chance) the fetcher ends up keeping either 1 or 2 fetcher threads pointing at a particular host continuously. In other words, it never tries to point 3 at the host, and it always points a second thread at the host before the first thread finishes accessing it. Since HttpBase.unblockAddr never gets called with (((Integer)THREADS_PER_HOST_COUNT.get(host)).intValue() == 1), it never puts System.currentTimeMillis() + crawlDelay into BLOCKED_ADDR_TO_TIME for the host. Thus, the server delay will never be used at all. The fetcher will be continuously retrieving pages from the host, often with 2 fetchers accessing the host simultaneously. Suppose instead that the fetcher finally does allow the last thread to complete before it gets around to pointing another thread at the target host. When the last fetcher thread calls HttpBase.unblockAddr, it will now put System.currentTimeMillis() + crawlDelay into BLOCKED_ADDR_TO_TIME for the host. This, in turn, will prevent any threads from accessing this host until the delay is complete, even though zero threads are currently accessing the host. I see this behavior as inconsistent. More importantly, the current implementation certainly doesn't seem to answer my original question about appropriate definitions for what appear to be conflicting parameters. In a nutshell, how could we possibly honor the server delay if we allow more than one fetcher thread to simultaneously access the host? It would be one thing if whenever (fetcher.threads.per.host 1), this trumped the server delay, causing the latter to be ignored completely. That is certainly not the case in the current implementation, as it will wait for server delay whenever the number of threads accessing a given host drops to zero. -- This message was sent by Atlassian JIRA (v6.2#6252)