[jira] [Commented] (NUTCH-2806) Nutch can't parse links

2020-07-10 Thread Jorge Luis Betancourt Gonzalez (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-2806?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17155785#comment-17155785
 ] 

Jorge Luis Betancourt Gonzalez commented on NUTCH-2806:
---

Hi [~immobilier-dz] can you check the value of the {{db.ignore.external.links}} 
setting in your configuration? By default, it is set to false, which means that 
Nutch should be able to at least detect/add the external links for crawling in 
a future crawl. See 
[https://github.com/apache/nutch/blob/2.x/conf/nutch-default.xml#L498-L505]

Finally, keep in mind that normally is best to send this type of inquiries to 
the users/developers mailing lists 
([https://nutch.apache.org/mailing_lists.html]).

> Nutch can't parse links 
> 
>
> Key: NUTCH-2806
> URL: https://issues.apache.org/jira/browse/NUTCH-2806
> Project: Nutch
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 2.4
>Reporter: lina dziri
>Priority: Major
> Fix For: 2.4
>
>
> Testing with the following site: 
> [https://www.algeriahome.com|https://www.algeriahome.com/] , nutch only parse 
> links that does contain the base url. 
>  Tried tika as parser, tried to update db.max.outlinks.per.page to -1, tried 
> practically every comments about detecting all the links, doubted urlfilter 
> or regex-normalizer so it was disabled but having the same results. 
>  each time I rebuild nutch and test the parser, it gives the same urls count 
> arround 378. 
>  Can somebody help out to fix this.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (NUTCH-2806) Nutch can't parse links

2020-07-10 Thread lina dziri (Jira)
lina dziri created NUTCH-2806:
-

 Summary: Nutch can't parse links 
 Key: NUTCH-2806
 URL: https://issues.apache.org/jira/browse/NUTCH-2806
 Project: Nutch
  Issue Type: Bug
  Components: parser
Affects Versions: 2.4
Reporter: lina dziri
 Fix For: 2.4


Testing with the following site: 
[https://www.algeriahome.com|https://www.algeriahome.com/] , nutch only parse 
links that does contain the base url. 
 Tried tika as parser, tried to update db.max.outlinks.per.page to -1, tried 
practically every comments about detecting all the links, doubted urlfilter or 
regex-normalizer so it was disabled but having the same results. 
 each time I rebuild nutch and test the parser, it gives the same urls count 
arround 378. 
 Can somebody help out to fix this.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (NUTCH-2805) Rename plugin urlfilter-domainblacklist

2020-07-10 Thread Lewis John McGibbney (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-2805?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17155728#comment-17155728
 ] 

Lewis John McGibbney commented on NUTCH-2805:
-

Nice

> Rename plugin urlfilter-domainblacklist
> ---
>
> Key: NUTCH-2805
> URL: https://issues.apache.org/jira/browse/NUTCH-2805
> Project: Nutch
>  Issue Type: Sub-task
>  Components: plugin, urlfilter
>Reporter: Sebastian Nagel
>Assignee: Shashanka Balakuntala Srinivasa
>Priority: Major
> Fix For: 1.18
>
>
> As part of NUTCH-2802 the plugin {{urlfilter-domainblacklist}} should be 
> renamed including variable names in Java classes and the file names of 
> configuration files.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (NUTCH-2805) Rename plugin urlfilter-domainblacklist

2020-07-10 Thread Shashanka Balakuntala Srinivasa (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-2805?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17155687#comment-17155687
 ] 

Shashanka Balakuntala Srinivasa commented on NUTCH-2805:


Creating PR in a while

> Rename plugin urlfilter-domainblacklist
> ---
>
> Key: NUTCH-2805
> URL: https://issues.apache.org/jira/browse/NUTCH-2805
> Project: Nutch
>  Issue Type: Sub-task
>  Components: plugin, urlfilter
>Reporter: Sebastian Nagel
>Assignee: Shashanka Balakuntala Srinivasa
>Priority: Major
> Fix For: 1.18
>
>
> As part of NUTCH-2802 the plugin {{urlfilter-domainblacklist}} should be 
> renamed including variable names in Java classes and the file names of 
> configuration files.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (NUTCH-2805) Rename plugin urlfilter-domainblacklist

2020-07-10 Thread Shashanka Balakuntala Srinivasa (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2805?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shashanka Balakuntala Srinivasa reassigned NUTCH-2805:
--

Assignee: Shashanka Balakuntala Srinivasa

> Rename plugin urlfilter-domainblacklist
> ---
>
> Key: NUTCH-2805
> URL: https://issues.apache.org/jira/browse/NUTCH-2805
> Project: Nutch
>  Issue Type: Sub-task
>  Components: plugin, urlfilter
>Reporter: Sebastian Nagel
>Assignee: Shashanka Balakuntala Srinivasa
>Priority: Major
> Fix For: 1.18
>
>
> As part of NUTCH-2802 the plugin {{urlfilter-domainblacklist}} should be 
> renamed including variable names in Java classes and the file names of 
> configuration files.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (NUTCH-2803) Rename property http.robot.rules.whitelist

2020-07-10 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-2803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17155668#comment-17155668
 ] 

ASF GitHub Bot commented on NUTCH-2803:
---

lewismc opened a new pull request #539:
URL: https://github.com/apache/nutch/pull/539


   This issue addresses https://issues.apache.org/jira/browse/NUTCH-2803
   
   Feels good to be contributing to Nutch again ... !!!
   
   Use of language here... **white** being replaced with **allow**
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Rename property http.robot.rules.whitelist
> --
>
> Key: NUTCH-2803
> URL: https://issues.apache.org/jira/browse/NUTCH-2803
> Project: Nutch
>  Issue Type: Sub-task
>  Components: configuration, robots
>Reporter: Sebastian Nagel
>Assignee: Lewis John McGibbney
>Priority: Major
> Fix For: 1.18
>
>
> As part of NUTCH-2802 the property {{http.robot.rules.whitelist}} should be 
> renamed.
> See the [definition of 
> http.robot.rules.whitelist|http://nutch.apache.org/apidocs/apidocs-1.17/resources/nutch-default.xml#http.robot.rules.whitelist]:
> bq. Comma separated list of hostnames or IP addresses to ignore robot rules 
> parsing for. Use with care and only if you are explicitly allowed by the site 
> owner to ignore the site's robots.txt!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [nutch] lewismc opened a new pull request #539: NUTCH-2803 Rename property http.robot.rules.whitelist

2020-07-10 Thread GitBox


lewismc opened a new pull request #539:
URL: https://github.com/apache/nutch/pull/539


   This issue addresses https://issues.apache.org/jira/browse/NUTCH-2803
   
   Feels good to be contributing to Nutch again ... !!!
   
   Use of language here... **white** being replaced with **allow**
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Assigned] (NUTCH-2803) Rename property http.robot.rules.whitelist

2020-07-10 Thread Lewis John McGibbney (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2803?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney reassigned NUTCH-2803:
---

Assignee: Lewis John McGibbney

> Rename property http.robot.rules.whitelist
> --
>
> Key: NUTCH-2803
> URL: https://issues.apache.org/jira/browse/NUTCH-2803
> Project: Nutch
>  Issue Type: Sub-task
>  Components: configuration, robots
>Reporter: Sebastian Nagel
>Assignee: Lewis John McGibbney
>Priority: Major
> Fix For: 1.18
>
>
> As part of NUTCH-2802 the property {{http.robot.rules.whitelist}} should be 
> renamed.
> See the [definition of 
> http.robot.rules.whitelist|http://nutch.apache.org/apidocs/apidocs-1.17/resources/nutch-default.xml#http.robot.rules.whitelist]:
> bq. Comma separated list of hostnames or IP addresses to ignore robot rules 
> parsing for. Use with care and only if you are explicitly allowed by the site 
> owner to ignore the site's robots.txt!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (NUTCH-2805) Rename plugin urlfilter-domainblacklist

2020-07-10 Thread Sebastian Nagel (Jira)
Sebastian Nagel created NUTCH-2805:
--

 Summary: Rename plugin urlfilter-domainblacklist
 Key: NUTCH-2805
 URL: https://issues.apache.org/jira/browse/NUTCH-2805
 Project: Nutch
  Issue Type: Sub-task
  Components: plugin, urlfilter
Reporter: Sebastian Nagel
 Fix For: 1.18


As part of NUTCH-2802 the plugin {{urlfilter-domainblacklist}} should be 
renamed including variable names in Java classes and the file names of 
configuration files.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (NUTCH-2804) Rename blacklist/whitelist in configuration of subcollection plugin

2020-07-10 Thread Sebastian Nagel (Jira)
Sebastian Nagel created NUTCH-2804:
--

 Summary: Rename blacklist/whitelist in configuration of 
subcollection plugin
 Key: NUTCH-2804
 URL: https://issues.apache.org/jira/browse/NUTCH-2804
 Project: Nutch
  Issue Type: Sub-task
  Components: indexer, plugin
Reporter: Sebastian Nagel
 Fix For: 1.18


As part of NUTCH-2802 the element names ("blacklist"/"whitelist") in the file 
{{conf/subcollection.xml}} to configure inclusion/exclusion of documents by URL 
into a "subcollection" should be renamed. Also variables names in the Java 
classes should reflect this change of terminology.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (NUTCH-2803) Rename property http.robot.rules.whitelist

2020-07-10 Thread Sebastian Nagel (Jira)
Sebastian Nagel created NUTCH-2803:
--

 Summary: Rename property http.robot.rules.whitelist
 Key: NUTCH-2803
 URL: https://issues.apache.org/jira/browse/NUTCH-2803
 Project: Nutch
  Issue Type: Sub-task
  Components: configuration, robots
Reporter: Sebastian Nagel
 Fix For: 1.18


As part of NUTCH-2802 the property {{http.robot.rules.whitelist}} should be 
renamed.

See the [definition of 
http.robot.rules.whitelist|http://nutch.apache.org/apidocs/apidocs-1.17/resources/nutch-default.xml#http.robot.rules.whitelist]:

bq. Comma separated list of hostnames or IP addresses to ignore robot rules 
parsing for. Use with care and only if you are explicitly allowed by the site 
owner to ignore the site's robots.txt!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (NUTCH-2782) protocol-http / lib-http: support TLSv1.3

2020-07-10 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-2782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17155641#comment-17155641
 ] 

ASF GitHub Bot commented on NUTCH-2782:
---

balashashanka opened a new pull request #538:
URL: https://github.com/apache/nutch/pull/538


   1. Added the TLSv1.3 and supported cipher suites
   2. Fixed indentation using eclipse-codeformat.xml
   
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> protocol-http / lib-http: support TLSv1.3
> -
>
> Key: NUTCH-2782
> URL: https://issues.apache.org/jira/browse/NUTCH-2782
> Project: Nutch
>  Issue Type: Improvement
>  Components: plugin, protocol
>Affects Versions: 1.16
>Reporter: Sebastian Nagel
>Assignee: Shashanka Balakuntala Srinivasa
>Priority: Major
>  Labels: help-wanted
> Fix For: 1.18
>
>
> [TLSv1.3| https://en.wikipedia.org/wiki/Transport_Layer_Security#TLS_1.3] 
> (since 2018) is not included in the list of supported protocols in lib-http 
> ([HttpBase.java, line 
> 311|https://github.com/apache/nutch/blob/dcbb0f2bf450c6bec6f45125c68f5c7a0f061474/src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpBase.java#L311]).
>  It should be added. Also the list of supported ciphers needs to be updated 
> accordingly.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [nutch] balashashanka opened a new pull request #538: NUTCH-2782: protocol-http / lib-http: support TLSv1.3

2020-07-10 Thread GitBox


balashashanka opened a new pull request #538:
URL: https://github.com/apache/nutch/pull/538


   1. Added the TLSv1.3 and supported cipher suites
   2. Fixed indentation using eclipse-codeformat.xml
   
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Commented] (NUTCH-2802) Replace blacklist/whitelist by more inclusive and precise terminology

2020-07-10 Thread Sebastian Nagel (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-2802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17155638#comment-17155638
 ] 

Sebastian Nagel commented on NUTCH-2802:


Thanks, [~lewismc]! But let's better open sub-tasks (I'll open them during the 
next minutes). It's better to see every change finally in the change log.

> Replace blacklist/whitelist by more inclusive and precise terminology
> -
>
> Key: NUTCH-2802
> URL: https://issues.apache.org/jira/browse/NUTCH-2802
> Project: Nutch
>  Issue Type: Improvement
>  Components: configuration, plugin
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Major
> Fix For: 1.18
>
>
> The terms blacklist and whitelist should be replaced by a more inclusive and 
> more precise terminology, see the proposal and discussion on the @dev mailing 
> list 
> ([1|https://lists.apache.org/thread.html/r43789859e45e6c961c4838f27f84f1e487691dbbbcb0a633deeb9fdb%40%3Cdev.nutch.apache.org%3E],
>  
> [2|https://lists.apache.org/thread.html/r8f8341b53a02c141dbcecbcf9a4c1988d89f461cba1f8b0019bc7192%40%3Cdev.nutch.apache.org%3E]).
> This is an umbrella issue, subtasks to be opened for individual plugins and 
> configuration properties.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (NUTCH-2802) Replace blacklist/whitelist by more inclusive and precise terminology

2020-07-10 Thread Lewis John McGibbney (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-2802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17155635#comment-17155635
 ] 

Lewis John McGibbney commented on NUTCH-2802:
-

[~snagel] thanks for opening this one. I'll go ahead and create a PR shortly. 

> Replace blacklist/whitelist by more inclusive and precise terminology
> -
>
> Key: NUTCH-2802
> URL: https://issues.apache.org/jira/browse/NUTCH-2802
> Project: Nutch
>  Issue Type: Improvement
>  Components: configuration, plugin
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Major
> Fix For: 1.18
>
>
> The terms blacklist and whitelist should be replaced by a more inclusive and 
> more precise terminology, see the proposal and discussion on the @dev mailing 
> list 
> ([1|https://lists.apache.org/thread.html/r43789859e45e6c961c4838f27f84f1e487691dbbbcb0a633deeb9fdb%40%3Cdev.nutch.apache.org%3E],
>  
> [2|https://lists.apache.org/thread.html/r8f8341b53a02c141dbcecbcf9a4c1988d89f461cba1f8b0019bc7192%40%3Cdev.nutch.apache.org%3E]).
> This is an umbrella issue, subtasks to be opened for individual plugins and 
> configuration properties.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (NUTCH-2802) Replace blacklist/whitelist by more inclusive and precise terminology

2020-07-10 Thread Lewis John McGibbney (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2802?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney reassigned NUTCH-2802:
---

Assignee: Lewis John McGibbney

> Replace blacklist/whitelist by more inclusive and precise terminology
> -
>
> Key: NUTCH-2802
> URL: https://issues.apache.org/jira/browse/NUTCH-2802
> Project: Nutch
>  Issue Type: Improvement
>  Components: configuration, plugin
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Major
> Fix For: 1.18
>
>
> The terms blacklist and whitelist should be replaced by a more inclusive and 
> more precise terminology, see the proposal and discussion on the @dev mailing 
> list 
> ([1|https://lists.apache.org/thread.html/r43789859e45e6c961c4838f27f84f1e487691dbbbcb0a633deeb9fdb%40%3Cdev.nutch.apache.org%3E],
>  
> [2|https://lists.apache.org/thread.html/r8f8341b53a02c141dbcecbcf9a4c1988d89f461cba1f8b0019bc7192%40%3Cdev.nutch.apache.org%3E]).
> This is an umbrella issue, subtasks to be opened for individual plugins and 
> configuration properties.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (NUTCH-2802) Replace blacklist/whitelist by more inclusive and precise terminology

2020-07-10 Thread Sebastian Nagel (Jira)
Sebastian Nagel created NUTCH-2802:
--

 Summary: Replace blacklist/whitelist by more inclusive and precise 
terminology
 Key: NUTCH-2802
 URL: https://issues.apache.org/jira/browse/NUTCH-2802
 Project: Nutch
  Issue Type: Improvement
  Components: configuration, plugin
Reporter: Lewis John McGibbney
 Fix For: 1.18


The terms blacklist and whitelist should be replaced by a more inclusive and 
more precise terminology, see the proposal and discussion on the @dev mailing 
list 
([1|https://lists.apache.org/thread.html/r43789859e45e6c961c4838f27f84f1e487691dbbbcb0a633deeb9fdb%40%3Cdev.nutch.apache.org%3E],
 
[2|https://lists.apache.org/thread.html/r8f8341b53a02c141dbcecbcf9a4c1988d89f461cba1f8b0019bc7192%40%3Cdev.nutch.apache.org%3E]).

This is an umbrella issue, subtasks to be opened for individual plugins and 
configuration properties.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (NUTCH-2801) RobotsRulesParser command-line checker to use http.robots.agents as fall-back

2020-07-10 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-2801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17155507#comment-17155507
 ] 

ASF GitHub Bot commented on NUTCH-2801:
---

sebastian-nagel commented on a change in pull request #537:
URL: https://github.com/apache/nutch/pull/537#discussion_r452865637



##
File path: src/java/org/apache/nutch/protocol/RobotRulesParser.java
##
@@ -376,13 +379,18 @@ public int run(String[] args) {
*/
   private static class TestRobotRulesParser extends RobotRulesParser {
 
-public TestRobotRulesParser(Configuration conf) {
+public void setConf(Configuration conf) {
   // make sure that agent name is set so that setConf() does not complain,

Review comment:
   Thanks. You're right the comment wasn't up-to-date. Would have been 
simpler to drop the command-line overwrite of the checked agent names and rely 
only on properties. But I didn't want to break the existing behavior.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> RobotsRulesParser command-line checker to use http.robots.agents as fall-back
> -
>
> Key: NUTCH-2801
> URL: https://issues.apache.org/jira/browse/NUTCH-2801
> Project: Nutch
>  Issue Type: Bug
>  Components: checker, robots
>Affects Versions: 1.17
>Reporter: Sebastian Nagel
>Assignee: Sebastian Nagel
>Priority: Minor
> Fix For: 1.18
>
>
> The RobotsRulesParser command-line tool, used to check a list of URLs against 
> one robots.txt file, should use the value of the property 
> {{http.robots.agents}} as fall-back if no user agent names are explicitly 
> given as command-line argument. In this case it should behave same as the 
> robots.txt parser, looking first for {{http.agent.name}}, then for other 
> names listed in {{http.robots.agents}}, finally picking the rules for 
> {{User-agent: *}}
> {noformat}
> $> cat robots.txt
> User-agent: Nutch
> Allow: /
> User-agent: *
> Disallow: /
> $> bin/nutch org.apache.nutch.protocol.RobotRulesParser \
>   -Dhttp.agent.name=mybot \
>   -Dhttp.robots.agents='nutch,goodbot' \
>   robots.txt urls.txt 
> Testing robots.txt for agent names: mybot,nutch,goodbot
> not allowed:https://www.example.com/
> {noformat}
> The log message "Testing ... for ...: mybot,nutch,goodbot" is misleading. 
> Only the name "mybot" is actually checked.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [nutch] sebastian-nagel commented on a change in pull request #537: [NUTCH-2801] RobotsRulesParser command-line checker to use http.robots.agents as fall-back

2020-07-10 Thread GitBox


sebastian-nagel commented on a change in pull request #537:
URL: https://github.com/apache/nutch/pull/537#discussion_r452865637



##
File path: src/java/org/apache/nutch/protocol/RobotRulesParser.java
##
@@ -376,13 +379,18 @@ public int run(String[] args) {
*/
   private static class TestRobotRulesParser extends RobotRulesParser {
 
-public TestRobotRulesParser(Configuration conf) {
+public void setConf(Configuration conf) {
   // make sure that agent name is set so that setConf() does not complain,

Review comment:
   Thanks. You're right the comment wasn't up-to-date. Would have been 
simpler to drop the command-line overwrite of the checked agent names and rely 
only on properties. But I didn't want to break the existing behavior.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [nutch] balashashanka commented on a change in pull request #537: [NUTCH-2801] RobotsRulesParser command-line checker to use http.robots.agents as fall-back

2020-07-10 Thread GitBox


balashashanka commented on a change in pull request #537:
URL: https://github.com/apache/nutch/pull/537#discussion_r452858141



##
File path: src/java/org/apache/nutch/protocol/RobotRulesParser.java
##
@@ -376,13 +379,18 @@ public int run(String[] args) {
*/
   private static class TestRobotRulesParser extends RobotRulesParser {
 
-public TestRobotRulesParser(Configuration conf) {
+public void setConf(Configuration conf) {
   // make sure that agent name is set so that setConf() does not complain,

Review comment:
   Hi @sebastian-nagel, a small question. Is this comment still valid, 
since if the user-agent is not provided we will check with robots.agent and 
others right? 





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Commented] (NUTCH-2801) RobotsRulesParser command-line checker to use http.robots.agents as fall-back

2020-07-10 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-2801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17155490#comment-17155490
 ] 

ASF GitHub Bot commented on NUTCH-2801:
---

balashashanka commented on a change in pull request #537:
URL: https://github.com/apache/nutch/pull/537#discussion_r452858141



##
File path: src/java/org/apache/nutch/protocol/RobotRulesParser.java
##
@@ -376,13 +379,18 @@ public int run(String[] args) {
*/
   private static class TestRobotRulesParser extends RobotRulesParser {
 
-public TestRobotRulesParser(Configuration conf) {
+public void setConf(Configuration conf) {
   // make sure that agent name is set so that setConf() does not complain,

Review comment:
   Hi @sebastian-nagel, a small question. Is this comment still valid, 
since if the user-agent is not provided we will check with robots.agent and 
others right? 





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> RobotsRulesParser command-line checker to use http.robots.agents as fall-back
> -
>
> Key: NUTCH-2801
> URL: https://issues.apache.org/jira/browse/NUTCH-2801
> Project: Nutch
>  Issue Type: Bug
>  Components: checker, robots
>Affects Versions: 1.17
>Reporter: Sebastian Nagel
>Assignee: Sebastian Nagel
>Priority: Minor
> Fix For: 1.18
>
>
> The RobotsRulesParser command-line tool, used to check a list of URLs against 
> one robots.txt file, should use the value of the property 
> {{http.robots.agents}} as fall-back if no user agent names are explicitly 
> given as command-line argument. In this case it should behave same as the 
> robots.txt parser, looking first for {{http.agent.name}}, then for other 
> names listed in {{http.robots.agents}}, finally picking the rules for 
> {{User-agent: *}}
> {noformat}
> $> cat robots.txt
> User-agent: Nutch
> Allow: /
> User-agent: *
> Disallow: /
> $> bin/nutch org.apache.nutch.protocol.RobotRulesParser \
>   -Dhttp.agent.name=mybot \
>   -Dhttp.robots.agents='nutch,goodbot' \
>   robots.txt urls.txt 
> Testing robots.txt for agent names: mybot,nutch,goodbot
> not allowed:https://www.example.com/
> {noformat}
> The log message "Testing ... for ...: mybot,nutch,goodbot" is misleading. 
> Only the name "mybot" is actually checked.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (NUTCH-2801) RobotsRulesParser command-line checker to use http.robots.agents as fall-back

2020-07-10 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-2801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17155472#comment-17155472
 ] 

ASF GitHub Bot commented on NUTCH-2801:
---

sebastian-nagel opened a new pull request #537:
URL: https://github.com/apache/nutch/pull/537


   - if no agent names are given as command-line arguments use values 
ofhttp.agent.name and http.robots.agents as agent names to be checked
   - update command-line help
   
   ```
   $> nutch org.apache.nutch.protocol.RobotRulesParser \
 -Dhttp.agent.name='mybot' \
 -Dhttp.robots.agents='nutch,goodbot' \
 robots.txt urls.txt 
   Testing robots.txt for agent names: mybot,nutch,goodbot
   allowed:https://www.example.com/
   
   # command-line overwrite:
   $> nutch org.apache.nutch.protocol.RobotRulesParser \
 -Dhttp.agent.name='mybot' \
 -Dhttp.robots.agents='nutch,goodbot' \
 robots.txt urls.txt \
 badbot,anybot
   Testing robots.txt for agent names: badbot,anybot
   not allowed:https://www.example.com/
   ```



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> RobotsRulesParser command-line checker to use http.robots.agents as fall-back
> -
>
> Key: NUTCH-2801
> URL: https://issues.apache.org/jira/browse/NUTCH-2801
> Project: Nutch
>  Issue Type: Bug
>  Components: checker, robots
>Affects Versions: 1.17
>Reporter: Sebastian Nagel
>Assignee: Sebastian Nagel
>Priority: Minor
> Fix For: 1.18
>
>
> The RobotsRulesParser command-line tool, used to check a list of URLs against 
> one robots.txt file, should use the value of the property 
> {{http.robots.agents}} as fall-back if no user agent names are explicitly 
> given as command-line argument. In this case it should behave same as the 
> robots.txt parser, looking first for {{http.agent.name}}, then for other 
> names listed in {{http.robots.agents}}, finally picking the rules for 
> {{User-agent: *}}
> {noformat}
> $> cat robots.txt
> User-agent: Nutch
> Allow: /
> User-agent: *
> Disallow: /
> $> bin/nutch org.apache.nutch.protocol.RobotRulesParser \
>   -Dhttp.agent.name=mybot \
>   -Dhttp.robots.agents='nutch,goodbot' \
>   robots.txt urls.txt 
> Testing robots.txt for agent names: mybot,nutch,goodbot
> not allowed:https://www.example.com/
> {noformat}
> The log message "Testing ... for ...: mybot,nutch,goodbot" is misleading. 
> Only the name "mybot" is actually checked.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [nutch] sebastian-nagel opened a new pull request #537: [NUTCH-2801] RobotsRulesParser command-line checker to use http.robots.agents as fall-back

2020-07-10 Thread GitBox


sebastian-nagel opened a new pull request #537:
URL: https://github.com/apache/nutch/pull/537


   - if no agent names are given as command-line arguments use values 
ofhttp.agent.name and http.robots.agents as agent names to be checked
   - update command-line help
   
   ```
   $> nutch org.apache.nutch.protocol.RobotRulesParser \
 -Dhttp.agent.name='mybot' \
 -Dhttp.robots.agents='nutch,goodbot' \
 robots.txt urls.txt 
   Testing robots.txt for agent names: mybot,nutch,goodbot
   allowed:https://www.example.com/
   
   # command-line overwrite:
   $> nutch org.apache.nutch.protocol.RobotRulesParser \
 -Dhttp.agent.name='mybot' \
 -Dhttp.robots.agents='nutch,goodbot' \
 robots.txt urls.txt \
 badbot,anybot
   Testing robots.txt for agent names: badbot,anybot
   not allowed:https://www.example.com/
   ```



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Created] (NUTCH-2801) RobotsRulesParser command-line checker to use http.robots.agents as fall-back

2020-07-10 Thread Sebastian Nagel (Jira)
Sebastian Nagel created NUTCH-2801:
--

 Summary: RobotsRulesParser command-line checker to use 
http.robots.agents as fall-back
 Key: NUTCH-2801
 URL: https://issues.apache.org/jira/browse/NUTCH-2801
 Project: Nutch
  Issue Type: Bug
  Components: checker, robots
Affects Versions: 1.17
Reporter: Sebastian Nagel
Assignee: Sebastian Nagel
 Fix For: 1.18


The RobotsRulesParser command-line tool, used to check a list of URLs against 
one robots.txt file, should use the value of the property 
{{http.robots.agents}} as fall-back if no user agent names are explicitly given 
as command-line argument. In this case it should behave same as the robots.txt 
parser, looking first for {{http.agent.name}}, then for other names listed in 
{{http.robots.agents}}, finally picking the rules for {{User-agent: *}}

{noformat}
$> cat robots.txt
User-agent: Nutch
Allow: /
User-agent: *
Disallow: /

$> bin/nutch org.apache.nutch.protocol.RobotRulesParser \
  -Dhttp.agent.name=mybot \
  -Dhttp.robots.agents='nutch,goodbot' \
  robots.txt urls.txt 
Testing robots.txt for agent names: mybot,nutch,goodbot
not allowed:https://www.example.com/
{noformat}

The log message "Testing ... for ...: mybot,nutch,goodbot" is misleading. Only 
the name "mybot" is actually checked.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)