[jira] [Commented] (NUTCH-2806) Nutch can't parse links
[ https://issues.apache.org/jira/browse/NUTCH-2806?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17155785#comment-17155785 ] Jorge Luis Betancourt Gonzalez commented on NUTCH-2806: --- Hi [~immobilier-dz] can you check the value of the {{db.ignore.external.links}} setting in your configuration? By default, it is set to false, which means that Nutch should be able to at least detect/add the external links for crawling in a future crawl. See [https://github.com/apache/nutch/blob/2.x/conf/nutch-default.xml#L498-L505] Finally, keep in mind that normally is best to send this type of inquiries to the users/developers mailing lists ([https://nutch.apache.org/mailing_lists.html]). > Nutch can't parse links > > > Key: NUTCH-2806 > URL: https://issues.apache.org/jira/browse/NUTCH-2806 > Project: Nutch > Issue Type: Bug > Components: parser >Affects Versions: 2.4 >Reporter: lina dziri >Priority: Major > Fix For: 2.4 > > > Testing with the following site: > [https://www.algeriahome.com|https://www.algeriahome.com/] , nutch only parse > links that does contain the base url. > Tried tika as parser, tried to update db.max.outlinks.per.page to -1, tried > practically every comments about detecting all the links, doubted urlfilter > or regex-normalizer so it was disabled but having the same results. > each time I rebuild nutch and test the parser, it gives the same urls count > arround 378. > Can somebody help out to fix this. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (NUTCH-2806) Nutch can't parse links
lina dziri created NUTCH-2806: - Summary: Nutch can't parse links Key: NUTCH-2806 URL: https://issues.apache.org/jira/browse/NUTCH-2806 Project: Nutch Issue Type: Bug Components: parser Affects Versions: 2.4 Reporter: lina dziri Fix For: 2.4 Testing with the following site: [https://www.algeriahome.com|https://www.algeriahome.com/] , nutch only parse links that does contain the base url. Tried tika as parser, tried to update db.max.outlinks.per.page to -1, tried practically every comments about detecting all the links, doubted urlfilter or regex-normalizer so it was disabled but having the same results. each time I rebuild nutch and test the parser, it gives the same urls count arround 378. Can somebody help out to fix this. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (NUTCH-2805) Rename plugin urlfilter-domainblacklist
[ https://issues.apache.org/jira/browse/NUTCH-2805?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17155728#comment-17155728 ] Lewis John McGibbney commented on NUTCH-2805: - Nice > Rename plugin urlfilter-domainblacklist > --- > > Key: NUTCH-2805 > URL: https://issues.apache.org/jira/browse/NUTCH-2805 > Project: Nutch > Issue Type: Sub-task > Components: plugin, urlfilter >Reporter: Sebastian Nagel >Assignee: Shashanka Balakuntala Srinivasa >Priority: Major > Fix For: 1.18 > > > As part of NUTCH-2802 the plugin {{urlfilter-domainblacklist}} should be > renamed including variable names in Java classes and the file names of > configuration files. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (NUTCH-2805) Rename plugin urlfilter-domainblacklist
[ https://issues.apache.org/jira/browse/NUTCH-2805?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17155687#comment-17155687 ] Shashanka Balakuntala Srinivasa commented on NUTCH-2805: Creating PR in a while > Rename plugin urlfilter-domainblacklist > --- > > Key: NUTCH-2805 > URL: https://issues.apache.org/jira/browse/NUTCH-2805 > Project: Nutch > Issue Type: Sub-task > Components: plugin, urlfilter >Reporter: Sebastian Nagel >Assignee: Shashanka Balakuntala Srinivasa >Priority: Major > Fix For: 1.18 > > > As part of NUTCH-2802 the plugin {{urlfilter-domainblacklist}} should be > renamed including variable names in Java classes and the file names of > configuration files. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (NUTCH-2805) Rename plugin urlfilter-domainblacklist
[ https://issues.apache.org/jira/browse/NUTCH-2805?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shashanka Balakuntala Srinivasa reassigned NUTCH-2805: -- Assignee: Shashanka Balakuntala Srinivasa > Rename plugin urlfilter-domainblacklist > --- > > Key: NUTCH-2805 > URL: https://issues.apache.org/jira/browse/NUTCH-2805 > Project: Nutch > Issue Type: Sub-task > Components: plugin, urlfilter >Reporter: Sebastian Nagel >Assignee: Shashanka Balakuntala Srinivasa >Priority: Major > Fix For: 1.18 > > > As part of NUTCH-2802 the plugin {{urlfilter-domainblacklist}} should be > renamed including variable names in Java classes and the file names of > configuration files. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (NUTCH-2803) Rename property http.robot.rules.whitelist
[ https://issues.apache.org/jira/browse/NUTCH-2803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17155668#comment-17155668 ] ASF GitHub Bot commented on NUTCH-2803: --- lewismc opened a new pull request #539: URL: https://github.com/apache/nutch/pull/539 This issue addresses https://issues.apache.org/jira/browse/NUTCH-2803 Feels good to be contributing to Nutch again ... !!! Use of language here... **white** being replaced with **allow** This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Rename property http.robot.rules.whitelist > -- > > Key: NUTCH-2803 > URL: https://issues.apache.org/jira/browse/NUTCH-2803 > Project: Nutch > Issue Type: Sub-task > Components: configuration, robots >Reporter: Sebastian Nagel >Assignee: Lewis John McGibbney >Priority: Major > Fix For: 1.18 > > > As part of NUTCH-2802 the property {{http.robot.rules.whitelist}} should be > renamed. > See the [definition of > http.robot.rules.whitelist|http://nutch.apache.org/apidocs/apidocs-1.17/resources/nutch-default.xml#http.robot.rules.whitelist]: > bq. Comma separated list of hostnames or IP addresses to ignore robot rules > parsing for. Use with care and only if you are explicitly allowed by the site > owner to ignore the site's robots.txt! -- This message was sent by Atlassian Jira (v8.3.4#803005)
[GitHub] [nutch] lewismc opened a new pull request #539: NUTCH-2803 Rename property http.robot.rules.whitelist
lewismc opened a new pull request #539: URL: https://github.com/apache/nutch/pull/539 This issue addresses https://issues.apache.org/jira/browse/NUTCH-2803 Feels good to be contributing to Nutch again ... !!! Use of language here... **white** being replaced with **allow** This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Assigned] (NUTCH-2803) Rename property http.robot.rules.whitelist
[ https://issues.apache.org/jira/browse/NUTCH-2803?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney reassigned NUTCH-2803: --- Assignee: Lewis John McGibbney > Rename property http.robot.rules.whitelist > -- > > Key: NUTCH-2803 > URL: https://issues.apache.org/jira/browse/NUTCH-2803 > Project: Nutch > Issue Type: Sub-task > Components: configuration, robots >Reporter: Sebastian Nagel >Assignee: Lewis John McGibbney >Priority: Major > Fix For: 1.18 > > > As part of NUTCH-2802 the property {{http.robot.rules.whitelist}} should be > renamed. > See the [definition of > http.robot.rules.whitelist|http://nutch.apache.org/apidocs/apidocs-1.17/resources/nutch-default.xml#http.robot.rules.whitelist]: > bq. Comma separated list of hostnames or IP addresses to ignore robot rules > parsing for. Use with care and only if you are explicitly allowed by the site > owner to ignore the site's robots.txt! -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (NUTCH-2805) Rename plugin urlfilter-domainblacklist
Sebastian Nagel created NUTCH-2805: -- Summary: Rename plugin urlfilter-domainblacklist Key: NUTCH-2805 URL: https://issues.apache.org/jira/browse/NUTCH-2805 Project: Nutch Issue Type: Sub-task Components: plugin, urlfilter Reporter: Sebastian Nagel Fix For: 1.18 As part of NUTCH-2802 the plugin {{urlfilter-domainblacklist}} should be renamed including variable names in Java classes and the file names of configuration files. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (NUTCH-2804) Rename blacklist/whitelist in configuration of subcollection plugin
Sebastian Nagel created NUTCH-2804: -- Summary: Rename blacklist/whitelist in configuration of subcollection plugin Key: NUTCH-2804 URL: https://issues.apache.org/jira/browse/NUTCH-2804 Project: Nutch Issue Type: Sub-task Components: indexer, plugin Reporter: Sebastian Nagel Fix For: 1.18 As part of NUTCH-2802 the element names ("blacklist"/"whitelist") in the file {{conf/subcollection.xml}} to configure inclusion/exclusion of documents by URL into a "subcollection" should be renamed. Also variables names in the Java classes should reflect this change of terminology. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (NUTCH-2803) Rename property http.robot.rules.whitelist
Sebastian Nagel created NUTCH-2803: -- Summary: Rename property http.robot.rules.whitelist Key: NUTCH-2803 URL: https://issues.apache.org/jira/browse/NUTCH-2803 Project: Nutch Issue Type: Sub-task Components: configuration, robots Reporter: Sebastian Nagel Fix For: 1.18 As part of NUTCH-2802 the property {{http.robot.rules.whitelist}} should be renamed. See the [definition of http.robot.rules.whitelist|http://nutch.apache.org/apidocs/apidocs-1.17/resources/nutch-default.xml#http.robot.rules.whitelist]: bq. Comma separated list of hostnames or IP addresses to ignore robot rules parsing for. Use with care and only if you are explicitly allowed by the site owner to ignore the site's robots.txt! -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (NUTCH-2782) protocol-http / lib-http: support TLSv1.3
[ https://issues.apache.org/jira/browse/NUTCH-2782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17155641#comment-17155641 ] ASF GitHub Bot commented on NUTCH-2782: --- balashashanka opened a new pull request #538: URL: https://github.com/apache/nutch/pull/538 1. Added the TLSv1.3 and supported cipher suites 2. Fixed indentation using eclipse-codeformat.xml This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > protocol-http / lib-http: support TLSv1.3 > - > > Key: NUTCH-2782 > URL: https://issues.apache.org/jira/browse/NUTCH-2782 > Project: Nutch > Issue Type: Improvement > Components: plugin, protocol >Affects Versions: 1.16 >Reporter: Sebastian Nagel >Assignee: Shashanka Balakuntala Srinivasa >Priority: Major > Labels: help-wanted > Fix For: 1.18 > > > [TLSv1.3| https://en.wikipedia.org/wiki/Transport_Layer_Security#TLS_1.3] > (since 2018) is not included in the list of supported protocols in lib-http > ([HttpBase.java, line > 311|https://github.com/apache/nutch/blob/dcbb0f2bf450c6bec6f45125c68f5c7a0f061474/src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpBase.java#L311]). > It should be added. Also the list of supported ciphers needs to be updated > accordingly. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[GitHub] [nutch] balashashanka opened a new pull request #538: NUTCH-2782: protocol-http / lib-http: support TLSv1.3
balashashanka opened a new pull request #538: URL: https://github.com/apache/nutch/pull/538 1. Added the TLSv1.3 and supported cipher suites 2. Fixed indentation using eclipse-codeformat.xml This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Commented] (NUTCH-2802) Replace blacklist/whitelist by more inclusive and precise terminology
[ https://issues.apache.org/jira/browse/NUTCH-2802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17155638#comment-17155638 ] Sebastian Nagel commented on NUTCH-2802: Thanks, [~lewismc]! But let's better open sub-tasks (I'll open them during the next minutes). It's better to see every change finally in the change log. > Replace blacklist/whitelist by more inclusive and precise terminology > - > > Key: NUTCH-2802 > URL: https://issues.apache.org/jira/browse/NUTCH-2802 > Project: Nutch > Issue Type: Improvement > Components: configuration, plugin >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney >Priority: Major > Fix For: 1.18 > > > The terms blacklist and whitelist should be replaced by a more inclusive and > more precise terminology, see the proposal and discussion on the @dev mailing > list > ([1|https://lists.apache.org/thread.html/r43789859e45e6c961c4838f27f84f1e487691dbbbcb0a633deeb9fdb%40%3Cdev.nutch.apache.org%3E], > > [2|https://lists.apache.org/thread.html/r8f8341b53a02c141dbcecbcf9a4c1988d89f461cba1f8b0019bc7192%40%3Cdev.nutch.apache.org%3E]). > This is an umbrella issue, subtasks to be opened for individual plugins and > configuration properties. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (NUTCH-2802) Replace blacklist/whitelist by more inclusive and precise terminology
[ https://issues.apache.org/jira/browse/NUTCH-2802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17155635#comment-17155635 ] Lewis John McGibbney commented on NUTCH-2802: - [~snagel] thanks for opening this one. I'll go ahead and create a PR shortly. > Replace blacklist/whitelist by more inclusive and precise terminology > - > > Key: NUTCH-2802 > URL: https://issues.apache.org/jira/browse/NUTCH-2802 > Project: Nutch > Issue Type: Improvement > Components: configuration, plugin >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney >Priority: Major > Fix For: 1.18 > > > The terms blacklist and whitelist should be replaced by a more inclusive and > more precise terminology, see the proposal and discussion on the @dev mailing > list > ([1|https://lists.apache.org/thread.html/r43789859e45e6c961c4838f27f84f1e487691dbbbcb0a633deeb9fdb%40%3Cdev.nutch.apache.org%3E], > > [2|https://lists.apache.org/thread.html/r8f8341b53a02c141dbcecbcf9a4c1988d89f461cba1f8b0019bc7192%40%3Cdev.nutch.apache.org%3E]). > This is an umbrella issue, subtasks to be opened for individual plugins and > configuration properties. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (NUTCH-2802) Replace blacklist/whitelist by more inclusive and precise terminology
[ https://issues.apache.org/jira/browse/NUTCH-2802?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney reassigned NUTCH-2802: --- Assignee: Lewis John McGibbney > Replace blacklist/whitelist by more inclusive and precise terminology > - > > Key: NUTCH-2802 > URL: https://issues.apache.org/jira/browse/NUTCH-2802 > Project: Nutch > Issue Type: Improvement > Components: configuration, plugin >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney >Priority: Major > Fix For: 1.18 > > > The terms blacklist and whitelist should be replaced by a more inclusive and > more precise terminology, see the proposal and discussion on the @dev mailing > list > ([1|https://lists.apache.org/thread.html/r43789859e45e6c961c4838f27f84f1e487691dbbbcb0a633deeb9fdb%40%3Cdev.nutch.apache.org%3E], > > [2|https://lists.apache.org/thread.html/r8f8341b53a02c141dbcecbcf9a4c1988d89f461cba1f8b0019bc7192%40%3Cdev.nutch.apache.org%3E]). > This is an umbrella issue, subtasks to be opened for individual plugins and > configuration properties. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (NUTCH-2802) Replace blacklist/whitelist by more inclusive and precise terminology
Sebastian Nagel created NUTCH-2802: -- Summary: Replace blacklist/whitelist by more inclusive and precise terminology Key: NUTCH-2802 URL: https://issues.apache.org/jira/browse/NUTCH-2802 Project: Nutch Issue Type: Improvement Components: configuration, plugin Reporter: Lewis John McGibbney Fix For: 1.18 The terms blacklist and whitelist should be replaced by a more inclusive and more precise terminology, see the proposal and discussion on the @dev mailing list ([1|https://lists.apache.org/thread.html/r43789859e45e6c961c4838f27f84f1e487691dbbbcb0a633deeb9fdb%40%3Cdev.nutch.apache.org%3E], [2|https://lists.apache.org/thread.html/r8f8341b53a02c141dbcecbcf9a4c1988d89f461cba1f8b0019bc7192%40%3Cdev.nutch.apache.org%3E]). This is an umbrella issue, subtasks to be opened for individual plugins and configuration properties. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (NUTCH-2801) RobotsRulesParser command-line checker to use http.robots.agents as fall-back
[ https://issues.apache.org/jira/browse/NUTCH-2801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17155507#comment-17155507 ] ASF GitHub Bot commented on NUTCH-2801: --- sebastian-nagel commented on a change in pull request #537: URL: https://github.com/apache/nutch/pull/537#discussion_r452865637 ## File path: src/java/org/apache/nutch/protocol/RobotRulesParser.java ## @@ -376,13 +379,18 @@ public int run(String[] args) { */ private static class TestRobotRulesParser extends RobotRulesParser { -public TestRobotRulesParser(Configuration conf) { +public void setConf(Configuration conf) { // make sure that agent name is set so that setConf() does not complain, Review comment: Thanks. You're right the comment wasn't up-to-date. Would have been simpler to drop the command-line overwrite of the checked agent names and rely only on properties. But I didn't want to break the existing behavior. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > RobotsRulesParser command-line checker to use http.robots.agents as fall-back > - > > Key: NUTCH-2801 > URL: https://issues.apache.org/jira/browse/NUTCH-2801 > Project: Nutch > Issue Type: Bug > Components: checker, robots >Affects Versions: 1.17 >Reporter: Sebastian Nagel >Assignee: Sebastian Nagel >Priority: Minor > Fix For: 1.18 > > > The RobotsRulesParser command-line tool, used to check a list of URLs against > one robots.txt file, should use the value of the property > {{http.robots.agents}} as fall-back if no user agent names are explicitly > given as command-line argument. In this case it should behave same as the > robots.txt parser, looking first for {{http.agent.name}}, then for other > names listed in {{http.robots.agents}}, finally picking the rules for > {{User-agent: *}} > {noformat} > $> cat robots.txt > User-agent: Nutch > Allow: / > User-agent: * > Disallow: / > $> bin/nutch org.apache.nutch.protocol.RobotRulesParser \ > -Dhttp.agent.name=mybot \ > -Dhttp.robots.agents='nutch,goodbot' \ > robots.txt urls.txt > Testing robots.txt for agent names: mybot,nutch,goodbot > not allowed:https://www.example.com/ > {noformat} > The log message "Testing ... for ...: mybot,nutch,goodbot" is misleading. > Only the name "mybot" is actually checked. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[GitHub] [nutch] sebastian-nagel commented on a change in pull request #537: [NUTCH-2801] RobotsRulesParser command-line checker to use http.robots.agents as fall-back
sebastian-nagel commented on a change in pull request #537: URL: https://github.com/apache/nutch/pull/537#discussion_r452865637 ## File path: src/java/org/apache/nutch/protocol/RobotRulesParser.java ## @@ -376,13 +379,18 @@ public int run(String[] args) { */ private static class TestRobotRulesParser extends RobotRulesParser { -public TestRobotRulesParser(Configuration conf) { +public void setConf(Configuration conf) { // make sure that agent name is set so that setConf() does not complain, Review comment: Thanks. You're right the comment wasn't up-to-date. Would have been simpler to drop the command-line overwrite of the checked agent names and rely only on properties. But I didn't want to break the existing behavior. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [nutch] balashashanka commented on a change in pull request #537: [NUTCH-2801] RobotsRulesParser command-line checker to use http.robots.agents as fall-back
balashashanka commented on a change in pull request #537: URL: https://github.com/apache/nutch/pull/537#discussion_r452858141 ## File path: src/java/org/apache/nutch/protocol/RobotRulesParser.java ## @@ -376,13 +379,18 @@ public int run(String[] args) { */ private static class TestRobotRulesParser extends RobotRulesParser { -public TestRobotRulesParser(Configuration conf) { +public void setConf(Configuration conf) { // make sure that agent name is set so that setConf() does not complain, Review comment: Hi @sebastian-nagel, a small question. Is this comment still valid, since if the user-agent is not provided we will check with robots.agent and others right? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Commented] (NUTCH-2801) RobotsRulesParser command-line checker to use http.robots.agents as fall-back
[ https://issues.apache.org/jira/browse/NUTCH-2801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17155490#comment-17155490 ] ASF GitHub Bot commented on NUTCH-2801: --- balashashanka commented on a change in pull request #537: URL: https://github.com/apache/nutch/pull/537#discussion_r452858141 ## File path: src/java/org/apache/nutch/protocol/RobotRulesParser.java ## @@ -376,13 +379,18 @@ public int run(String[] args) { */ private static class TestRobotRulesParser extends RobotRulesParser { -public TestRobotRulesParser(Configuration conf) { +public void setConf(Configuration conf) { // make sure that agent name is set so that setConf() does not complain, Review comment: Hi @sebastian-nagel, a small question. Is this comment still valid, since if the user-agent is not provided we will check with robots.agent and others right? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > RobotsRulesParser command-line checker to use http.robots.agents as fall-back > - > > Key: NUTCH-2801 > URL: https://issues.apache.org/jira/browse/NUTCH-2801 > Project: Nutch > Issue Type: Bug > Components: checker, robots >Affects Versions: 1.17 >Reporter: Sebastian Nagel >Assignee: Sebastian Nagel >Priority: Minor > Fix For: 1.18 > > > The RobotsRulesParser command-line tool, used to check a list of URLs against > one robots.txt file, should use the value of the property > {{http.robots.agents}} as fall-back if no user agent names are explicitly > given as command-line argument. In this case it should behave same as the > robots.txt parser, looking first for {{http.agent.name}}, then for other > names listed in {{http.robots.agents}}, finally picking the rules for > {{User-agent: *}} > {noformat} > $> cat robots.txt > User-agent: Nutch > Allow: / > User-agent: * > Disallow: / > $> bin/nutch org.apache.nutch.protocol.RobotRulesParser \ > -Dhttp.agent.name=mybot \ > -Dhttp.robots.agents='nutch,goodbot' \ > robots.txt urls.txt > Testing robots.txt for agent names: mybot,nutch,goodbot > not allowed:https://www.example.com/ > {noformat} > The log message "Testing ... for ...: mybot,nutch,goodbot" is misleading. > Only the name "mybot" is actually checked. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (NUTCH-2801) RobotsRulesParser command-line checker to use http.robots.agents as fall-back
[ https://issues.apache.org/jira/browse/NUTCH-2801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17155472#comment-17155472 ] ASF GitHub Bot commented on NUTCH-2801: --- sebastian-nagel opened a new pull request #537: URL: https://github.com/apache/nutch/pull/537 - if no agent names are given as command-line arguments use values ofhttp.agent.name and http.robots.agents as agent names to be checked - update command-line help ``` $> nutch org.apache.nutch.protocol.RobotRulesParser \ -Dhttp.agent.name='mybot' \ -Dhttp.robots.agents='nutch,goodbot' \ robots.txt urls.txt Testing robots.txt for agent names: mybot,nutch,goodbot allowed:https://www.example.com/ # command-line overwrite: $> nutch org.apache.nutch.protocol.RobotRulesParser \ -Dhttp.agent.name='mybot' \ -Dhttp.robots.agents='nutch,goodbot' \ robots.txt urls.txt \ badbot,anybot Testing robots.txt for agent names: badbot,anybot not allowed:https://www.example.com/ ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > RobotsRulesParser command-line checker to use http.robots.agents as fall-back > - > > Key: NUTCH-2801 > URL: https://issues.apache.org/jira/browse/NUTCH-2801 > Project: Nutch > Issue Type: Bug > Components: checker, robots >Affects Versions: 1.17 >Reporter: Sebastian Nagel >Assignee: Sebastian Nagel >Priority: Minor > Fix For: 1.18 > > > The RobotsRulesParser command-line tool, used to check a list of URLs against > one robots.txt file, should use the value of the property > {{http.robots.agents}} as fall-back if no user agent names are explicitly > given as command-line argument. In this case it should behave same as the > robots.txt parser, looking first for {{http.agent.name}}, then for other > names listed in {{http.robots.agents}}, finally picking the rules for > {{User-agent: *}} > {noformat} > $> cat robots.txt > User-agent: Nutch > Allow: / > User-agent: * > Disallow: / > $> bin/nutch org.apache.nutch.protocol.RobotRulesParser \ > -Dhttp.agent.name=mybot \ > -Dhttp.robots.agents='nutch,goodbot' \ > robots.txt urls.txt > Testing robots.txt for agent names: mybot,nutch,goodbot > not allowed:https://www.example.com/ > {noformat} > The log message "Testing ... for ...: mybot,nutch,goodbot" is misleading. > Only the name "mybot" is actually checked. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[GitHub] [nutch] sebastian-nagel opened a new pull request #537: [NUTCH-2801] RobotsRulesParser command-line checker to use http.robots.agents as fall-back
sebastian-nagel opened a new pull request #537: URL: https://github.com/apache/nutch/pull/537 - if no agent names are given as command-line arguments use values ofhttp.agent.name and http.robots.agents as agent names to be checked - update command-line help ``` $> nutch org.apache.nutch.protocol.RobotRulesParser \ -Dhttp.agent.name='mybot' \ -Dhttp.robots.agents='nutch,goodbot' \ robots.txt urls.txt Testing robots.txt for agent names: mybot,nutch,goodbot allowed:https://www.example.com/ # command-line overwrite: $> nutch org.apache.nutch.protocol.RobotRulesParser \ -Dhttp.agent.name='mybot' \ -Dhttp.robots.agents='nutch,goodbot' \ robots.txt urls.txt \ badbot,anybot Testing robots.txt for agent names: badbot,anybot not allowed:https://www.example.com/ ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Created] (NUTCH-2801) RobotsRulesParser command-line checker to use http.robots.agents as fall-back
Sebastian Nagel created NUTCH-2801: -- Summary: RobotsRulesParser command-line checker to use http.robots.agents as fall-back Key: NUTCH-2801 URL: https://issues.apache.org/jira/browse/NUTCH-2801 Project: Nutch Issue Type: Bug Components: checker, robots Affects Versions: 1.17 Reporter: Sebastian Nagel Assignee: Sebastian Nagel Fix For: 1.18 The RobotsRulesParser command-line tool, used to check a list of URLs against one robots.txt file, should use the value of the property {{http.robots.agents}} as fall-back if no user agent names are explicitly given as command-line argument. In this case it should behave same as the robots.txt parser, looking first for {{http.agent.name}}, then for other names listed in {{http.robots.agents}}, finally picking the rules for {{User-agent: *}} {noformat} $> cat robots.txt User-agent: Nutch Allow: / User-agent: * Disallow: / $> bin/nutch org.apache.nutch.protocol.RobotRulesParser \ -Dhttp.agent.name=mybot \ -Dhttp.robots.agents='nutch,goodbot' \ robots.txt urls.txt Testing robots.txt for agent names: mybot,nutch,goodbot not allowed:https://www.example.com/ {noformat} The log message "Testing ... for ...: mybot,nutch,goodbot" is misleading. Only the name "mybot" is actually checked. -- This message was sent by Atlassian Jira (v8.3.4#803005)