[jira] [Created] (NUTCH-1987) Make bin/crawl indexer agnostic
Michael Joyce created NUTCH-1987: Summary: Make bin/crawl indexer agnostic Key: NUTCH-1987 URL: https://issues.apache.org/jira/browse/NUTCH-1987 Project: Nutch Issue Type: Improvement Affects Versions: 1.9 Reporter: Michael Joyce Fix For: 1.10 The crawl script makes it a bit challenging to use an indexer that isn't Solr. For instance, when I want to use the indexer-elastic plugin I still need to call the crawler script with a fake Solr URL otherwise it will skip the indexing step all together. {code} bin/crawl urls/ crawl/ http://fakeurl.com:9200; 1 {code} It would be nice to keep configuration for the Solr indexer in the conf files (to mirror the elastic search indexer conf and others) and to make the indexing parameter simply toggle whether indexing does or doesn't occur instead of also trying to configure the indexer at the same time. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-1986) Clarify Elastic Search Indexer Plugin Settings
[ https://issues.apache.org/jira/browse/NUTCH-1986?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14496366#comment-14496366 ] ASF GitHub Bot commented on NUTCH-1986: --- GitHub user MJJoyce opened a pull request: https://github.com/apache/nutch/pull/17 NUTCH-1986 - Update and clarify default Elasticsearch conf values - Host value is now defaulted to 'localhost'. - Update port description to make it apparent that 9300 is more likely the value you want to use. This should keep people from setting this to the potentially more commonly seen 9200 and messing up connections. - Set the cluster default value to the default Elasticsearch cluster name of 'elasticsearch'. Also updated the description to make it evident that this value still needs to be changed if you're connecting via host/port and your cluster name is something other than the default. You can merge this pull request into a Git repository by running: $ git pull https://github.com/MJJoyce/nutch NUTCH-1986 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/nutch/pull/17.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #17 commit f2e30595a450ae788f0b996899b06193d15fd2d7 Author: Michael Joyce mltjo...@gmail.com Date: 2015-04-15T15:24:27Z NUTCH-1986 - Update and clarify default Elasticsearch conf values - Host value is now defaulted to 'localhost'. - Update port description to make it apparent that 9300 is more likely the value you want to use. This should keep people from setting this to the potentially more commonly seen 9200 and messing up connections. - Set the cluster default value to the default Elasticsearch cluster name of 'elasticsearch'. Also updated the description to make it evident that this value still needs to be changed if you're connecting via host/port and your cluster name is something other than the default. Clarify Elastic Search Indexer Plugin Settings -- Key: NUTCH-1986 URL: https://issues.apache.org/jira/browse/NUTCH-1986 Project: Nutch Issue Type: Improvement Components: documentation, indexer, plugin Affects Versions: 1.9 Reporter: Michael Joyce Fix For: 1.10 Was working on getting indexing into elastic search working and realized that the majority of my difficulties were simply me misunderstanding what the config needed. Patch incoming to hopefully clarify what is needed by default, what each option does, and add any helpful defaults. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[Nutch Wiki] Update of WhiteListRobots by ChrisMattmann
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The WhiteListRobots page has been changed by ChrisMattmann: https://wiki.apache.org/nutch/WhiteListRobots Comment: - initial page New page: Nutch now has a [[https://issues.apache.org/jira/browse/NUTCH-1927|white list for robots.txt]] capability that can be used to selectively on a per host and/or IP basis turn on/off robots.txt parsing. Read on to find out how to use it.
[jira] [Commented] (NUTCH-1972) Dockerfile for Nutch 1.x
[ https://issues.apache.org/jira/browse/NUTCH-1972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14496257#comment-14496257 ] Michael Joyce commented on NUTCH-1972: -- Awesome, thanks for merging [~chrismattmann]!! Dockerfile for Nutch 1.x Key: NUTCH-1972 URL: https://issues.apache.org/jira/browse/NUTCH-1972 Project: Nutch Issue Type: Improvement Components: deployment Reporter: Michael Joyce Assignee: Chris A. Mattmann Priority: Minor Fix For: 1.10 Attachments: Joyce-NUTCH-1792-patch.txt Hi folks, I noticed that there was a Docker file for Nutch 2.x but I didn't see anything for 1.x. I figured I would throw something up real quick. Note that this currently doesn't install Solr. I didn't need it at the time when I was making this, but I'll work on getting it added before too long. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (NUTCH-1986) Clarify Elastic Search Indexer Plugin Settings
Michael Joyce created NUTCH-1986: Summary: Clarify Elastic Search Indexer Plugin Settings Key: NUTCH-1986 URL: https://issues.apache.org/jira/browse/NUTCH-1986 Project: Nutch Issue Type: Improvement Components: documentation, indexer, plugin Affects Versions: 1.9 Reporter: Michael Joyce Fix For: 1.10 Was working on getting indexing into elastic search working and realized that the majority of my difficulties were simply me misunderstanding what the config needed. Patch incoming to hopefully clarify what is needed by default, what each option does, and add any helpful defaults. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-1987) Make bin/crawl indexer agnostic
[ https://issues.apache.org/jira/browse/NUTCH-1987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14496426#comment-14496426 ] Michael Joyce commented on NUTCH-1987: -- Hi folks, I'll have a patch up in a bit for this. I think my current plan to minimize the number of changes that I'm shoving into a single patch is to: * Add solr.server.url to nutch-default and set the value to some sane default (http://127.0.0.1:8983/solr/) * Make the 'index' calls in the bin/nutch script generic and slightly change the call format. * Update some variable names and echos in the bin/crawl script so it doesn't only mention Solr and confuse people I envision a call being something similar to this after these changes: {code} # Run the indexer bin/crawl urls/ crawl/ run_indexer 1 # Don't run the indexer bin/crawl urls/ crawl/ 1 {code} I don't think this is necessarily the ideal solution but it minimizes calling formats for people with existing setups and only really requires that a single configuration value is added/updated. Note, this change obviously requires some/many documentation updates. I'm more than happy to help with those as well but I wasn't including them in this ticket. Thoughts? Make bin/crawl indexer agnostic --- Key: NUTCH-1987 URL: https://issues.apache.org/jira/browse/NUTCH-1987 Project: Nutch Issue Type: Improvement Affects Versions: 1.9 Reporter: Michael Joyce Fix For: 1.10 The crawl script makes it a bit challenging to use an indexer that isn't Solr. For instance, when I want to use the indexer-elastic plugin I still need to call the crawler script with a fake Solr URL otherwise it will skip the indexing step all together. {code} bin/crawl urls/ crawl/ http://fakeurl.com:9200; 1 {code} It would be nice to keep configuration for the Solr indexer in the conf files (to mirror the elastic search indexer conf and others) and to make the indexing parameter simply toggle whether indexing does or doesn't occur instead of also trying to configure the indexer at the same time. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (NUTCH-1987) Make bin/crawl indexer agnostic
[ https://issues.apache.org/jira/browse/NUTCH-1987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14496426#comment-14496426 ] Michael Joyce edited comment on NUTCH-1987 at 4/15/15 3:54 PM: --- Hi folks, I'll have a patch up in a bit for this. I think my current plan to minimize the number of changes that I'm shoving into a single patch is to: * Add solr.server.url to nutch-default and set the value to some sane default (http://127.0.0.1:8983/solr/) * Make the 'index' calls in the bin/nutch script generic and slightly change the call format. * Update some variable names and echos in the bin/crawl script so it doesn't only mention Solr and confuse people I envision a call being something similar to this after these changes: {code} # Run the indexer bin/crawl urls/ crawl/ run_indexer 1 # Don't run the indexer bin/crawl urls/ crawl/ 1 {code} I don't think this is necessarily the ideal solution but it minimizes call format changes for people with existing setups and only really requires that a single configuration value is added/updated if you want to keep using Solr on an existing setup. Note, this change obviously requires documentation updates. I'm more than happy to help with those as well but I wasn't including them in this ticket. Thoughts? was (Author: mjoyce): Hi folks, I'll have a patch up in a bit for this. I think my current plan to minimize the number of changes that I'm shoving into a single patch is to: * Add solr.server.url to nutch-default and set the value to some sane default (http://127.0.0.1:8983/solr/) * Make the 'index' calls in the bin/nutch script generic and slightly change the call format. * Update some variable names and echos in the bin/crawl script so it doesn't only mention Solr and confuse people I envision a call being something similar to this after these changes: {code} # Run the indexer bin/crawl urls/ crawl/ run_indexer 1 # Don't run the indexer bin/crawl urls/ crawl/ 1 {code} I don't think this is necessarily the ideal solution but it minimizes calling formats for people with existing setups and only really requires that a single configuration value is added/updated. Note, this change obviously requires some/many documentation updates. I'm more than happy to help with those as well but I wasn't including them in this ticket. Thoughts? Make bin/crawl indexer agnostic --- Key: NUTCH-1987 URL: https://issues.apache.org/jira/browse/NUTCH-1987 Project: Nutch Issue Type: Improvement Affects Versions: 1.9 Reporter: Michael Joyce Fix For: 1.10 The crawl script makes it a bit challenging to use an indexer that isn't Solr. For instance, when I want to use the indexer-elastic plugin I still need to call the crawler script with a fake Solr URL otherwise it will skip the indexing step all together. {code} bin/crawl urls/ crawl/ http://fakeurl.com:9200; 1 {code} It would be nice to keep configuration for the Solr indexer in the conf files (to mirror the elastic search indexer conf and others) and to make the indexing parameter simply toggle whether indexing does or doesn't occur instead of also trying to configure the indexer at the same time. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-1987) Make bin/crawl indexer agnostic
[ https://issues.apache.org/jira/browse/NUTCH-1987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14496621#comment-14496621 ] ASF GitHub Bot commented on NUTCH-1987: --- GitHub user MJJoyce opened a pull request: https://github.com/apache/nutch/pull/18 NUTCH-1987 - Make bin/crawl indexer agnostic - Add solr.server.url property to nutch-default and set to value consistent with URL used in the Nutch Tutorial. - Change SOLRURL references to INDEXFLAG for consistency. - Update all occurrences of crawl usage strings to no longer reference solrURL and instead mention an optional string run_indexer. - Update indexer section to no longer set Solr URL property and remove Solr references from prints. You can merge this pull request into a Git repository by running: $ git pull https://github.com/MJJoyce/nutch NUTCH-1987 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/nutch/pull/18.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #18 commit a39de23453a6f8ea2a9ab2a94872af3305f16021 Author: Michael Joyce mltjo...@gmail.com Date: 2015-04-15T17:41:36Z NUTCH-1987 - Make bin/crawl indexer agnostic - Add solr.server.url property to nutch-default and set to value consistent with URL used in the Nutch Tutorial. - Change SOLRURL references to INDEXFLAG for consistency. - Update all occurrences of crawl usage strings to no longer reference solrURL and instead mention an optional string run_indexer. - Update indexer section to no longer set Solr URL property and remove Solr references from prints. Make bin/crawl indexer agnostic --- Key: NUTCH-1987 URL: https://issues.apache.org/jira/browse/NUTCH-1987 Project: Nutch Issue Type: Improvement Affects Versions: 1.9 Reporter: Michael Joyce Fix For: 1.10 The crawl script makes it a bit challenging to use an indexer that isn't Solr. For instance, when I want to use the indexer-elastic plugin I still need to call the crawler script with a fake Solr URL otherwise it will skip the indexing step all together. {code} bin/crawl urls/ crawl/ http://fakeurl.com:9200; 1 {code} It would be nice to keep configuration for the Solr indexer in the conf files (to mirror the elastic search indexer conf and others) and to make the indexing parameter simply toggle whether indexing does or doesn't occur instead of also trying to configure the indexer at the same time. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (NUTCH-1988) Make nested output directory dump optional
Michael Joyce created NUTCH-1988: Summary: Make nested output directory dump optional Key: NUTCH-1988 URL: https://issues.apache.org/jira/browse/NUTCH-1988 Project: Nutch Issue Type: Improvement Components: dumpers Affects Versions: 1.9 Reporter: Michael Joyce Fix For: 1.10 NUTCH-1957 added nested directories to the bin/nutch dump output to help avoid naming conflicts in output files. It would be nice to be able to specify that you want the older flat directory output as an optional parameter. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-1988) Make nested output directory dump optional
[ https://issues.apache.org/jira/browse/NUTCH-1988?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14496755#comment-14496755 ] Michael Joyce commented on NUTCH-1988: -- Hi folks. Here's an example output run of this. {code} [mjjoyce@machine local]$ bin/nutch dump -outputDir ./foodir -segment ../local_elasticsearch_testt/crawl/segments/ [mjjoyce@machine local]$ bin/nutch dump -flatdir -outputDir ./foodir2 -segment ../local_elasticsearch_testt/crawl/segments/ [mjjoyce@machine local]$ ls -R foodir foodir: 8f f8 foodir/8f: a7 foodir/8f/a7: 8d84f847f7310620a9edc4327bbfc133_.html foodir/f8: df foodir/f8/df: fec7849283af7a0adc77eddefb242b6e_.html [mjjoyce@machine local]$ ls -R foodir2 foodir2: 8d84f847f7310620a9edc4327bbfc133_.html fec7849283af7a0adc77eddefb242b6e_.html [mjjoyce@machine local]$ {code} Make nested output directory dump optional -- Key: NUTCH-1988 URL: https://issues.apache.org/jira/browse/NUTCH-1988 Project: Nutch Issue Type: Improvement Components: dumpers Affects Versions: 1.9 Reporter: Michael Joyce Priority: Minor Fix For: 1.10 NUTCH-1957 added nested directories to the bin/nutch dump output to help avoid naming conflicts in output files. It would be nice to be able to specify that you want the older flat directory output as an optional parameter. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-1988) Make nested output directory dump optional
[ https://issues.apache.org/jira/browse/NUTCH-1988?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14496751#comment-14496751 ] ASF GitHub Bot commented on NUTCH-1988: --- GitHub user MJJoyce opened a pull request: https://github.com/apache/nutch/pull/19 NUTCH-1988 - Add optional flat directory flag to dump command - Add optional flatdir flag to dump command so that a user can dump their crawl data to a flat directory instead of the nested structure added in NUTCH-1957. You can merge this pull request into a Git repository by running: $ git pull https://github.com/MJJoyce/nutch NUTCH-1988 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/nutch/pull/19.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #19 commit 40ca3e576781328b9b5afc22548a93bfd3df75bd Author: Michael Joyce mltjo...@gmail.com Date: 2015-04-15T19:19:07Z NUTCH-1988 - Add optional flat directory flag to dump command - Add optional flatdir flag to dump command so that a user can dump their crawl data to a flat directory instead of the nested structure added in NUTCH-1957. Make nested output directory dump optional -- Key: NUTCH-1988 URL: https://issues.apache.org/jira/browse/NUTCH-1988 Project: Nutch Issue Type: Improvement Components: dumpers Affects Versions: 1.9 Reporter: Michael Joyce Priority: Minor Fix For: 1.10 NUTCH-1957 added nested directories to the bin/nutch dump output to help avoid naming conflicts in output files. It would be nice to be able to specify that you want the older flat directory output as an optional parameter. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (NUTCH-1988) Make nested output directory dump optional
[ https://issues.apache.org/jira/browse/NUTCH-1988?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Joyce updated NUTCH-1988: - Priority: Minor (was: Major) Make nested output directory dump optional -- Key: NUTCH-1988 URL: https://issues.apache.org/jira/browse/NUTCH-1988 Project: Nutch Issue Type: Improvement Components: dumpers Affects Versions: 1.9 Reporter: Michael Joyce Priority: Minor Fix For: 1.10 NUTCH-1957 added nested directories to the bin/nutch dump output to help avoid naming conflicts in output files. It would be nice to be able to specify that you want the older flat directory output as an optional parameter. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[Nutch Wiki] Update of WhiteListRobots by ChrisMattmann
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The WhiteListRobots page has been changed by ChrisMattmann: https://wiki.apache.org/nutch/WhiteListRobots?action=diffrev1=2rev2=3 Nutch now has a [[https://issues.apache.org/jira/browse/NUTCH-1927|white list for robots.txt]] capability that can be used to selectively on a per host and/or IP basis turn on/off robots.txt parsing. Read on to find out how to use it. - = List hostnames and/or IP addresses in Nutch conf = + == List hostnames and/or IP addresses in Nutch conf == In the Nutch configuration directory (conf/), edit nutch-default.xml (and/or nutch-site.xml) and add the following information: @@ -28, +28 @@ /property }}} - = Testing the configuration = + == Testing the configuration == Create a sample URLs file to test your whitelist. For example, create a file, call it url (without the quotes) and store each URL on a line: @@ -44, +44 @@ Disallow: / }}} - = Build the Nutch runtime and execute RobotRulesParser = + == Build the Nutch runtime and execute RobotRulesParser == Now, build the Nutch runtime, e.g., by running ```ant runtime```. From your ```runtime/local/ directory, run this command:
[Nutch Wiki] Update of WhiteListRobots by ChrisMattmann
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The WhiteListRobots page has been changed by ChrisMattmann: https://wiki.apache.org/nutch/WhiteListRobots?action=diffrev1=1rev2=2 Comment: - Add example docs + = White List for Robots.txt = + Nutch now has a [[https://issues.apache.org/jira/browse/NUTCH-1927|white list for robots.txt]] capability that can be used to selectively on a per host and/or IP basis turn on/off robots.txt parsing. Read on to find out how to use it. + = List hostnames and/or IP addresses in Nutch conf = + + In the Nutch configuration directory (conf/), edit nutch-default.xml (and/or nutch-site.xml) and add the following information: + + {{{ + property + namerobot.rules.whitelist/name + value/value + descriptionComma separated list of hostnames or IP addresses to ignore robot rules parsing for. + /description + /property + }}} + + For example, try this, to whitelist the host, baron.pagemewhen.com: + + {{{ + property + namerobot.rules.whitelist/name + valuebaron.pagemewhen.com/value + descriptionComma separated list of hostnames or IP addresses to ignore robot rules parsing for. + /description + /property + }}} + + = Testing the configuration = + + Create a sample URLs file to test your whitelist. For example, create a file, call it url (without the quotes) and store each URL on a line: + + {{{ + http://baron.pagemewhen.com/~chris/foo1.txt + http://baron.pagemewhen.com/~chris/ + }}} + + Create a sample robots.txt file, e.g., robots.txt (without the quotes): + + {{{ + User-agent: * + Disallow: / + }}} + + = Build the Nutch runtime and execute RobotRulesParser = + + Now, build the Nutch runtime, e.g., by running ```ant runtime```. + From your ```runtime/local/ directory, run this command: + + {{{ + java -cp build/apache-nutch-1.10-SNAPSHOT.job:build/apache-nutch-1.10-SNAPSHOT.jar:runtime/local/lib/hadoop-core-1.2.0.jar:runtime/local/lib/crawler-commons-0.5.jar:runtime/local/lib/slf4j-log4j12-1.6.1.jar:runtime/local/lib/slf4j-api-1.7.9.jar:runtime/local/lib/log4j-1.2.15.jar:runtime/local/lib/guava-11.0.2.jar:runtime/local/lib/commons-logging-1.1.1.jar org.apache.nutch.protocol.RobotRulesParser robots.txt urls Nutch-crawler + }}} + + You should see the following output: + + {{{ + Robots: whitelist: [baron.pagemewhen.com] + Apr 14, 2015 11:53:20 PM org.apache.nutch.protocol.WhiteListRobotRules isWhiteListed + INFO: Host: [baron.pagemewhen.com] is whitelisted and robots.txt rules parsing will be ignored + allowed: http://baron.pagemewhen.com/~chris/foo1.txt + Apr 14, 2015 11:53:20 PM org.apache.nutch.protocol.WhiteListRobotRules isWhiteListed + INFO: Host: [baron.pagemewhen.com] is whitelisted and robots.txt rules parsing will be ignored + allowed: http://baron.pagemewhen.com/~chris/ + }}} +
[jira] [Commented] (NUTCH-1927) Create a whitelist of IPs/hostnames to allow skipping of RobotRules parsing
[ https://issues.apache.org/jira/browse/NUTCH-1927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14497152#comment-14497152 ] Chris A. Mattmann commented on NUTCH-1927: -- Added some documentation here: https://wiki.apache.org/nutch/WhiteListRobots Create a whitelist of IPs/hostnames to allow skipping of RobotRules parsing --- Key: NUTCH-1927 URL: https://issues.apache.org/jira/browse/NUTCH-1927 Project: Nutch Issue Type: New Feature Components: fetcher Reporter: Chris A. Mattmann Assignee: Chris A. Mattmann Labels: available, patch Fix For: 1.10 Attachments: NUTCH-1927.Mattmann.041115.patch.txt, NUTCH-1927.Mattmann.041215.patch.txt, NUTCH-1927.Mattmann.041415.patch.txt Based on discussion on the dev list, to use Nutch for some security research valid use cases (DDoS; DNS and other testing), I am going to create a patch that allows a whitelist: {code:xml} property namerobot.rules.whitelist/name value132.54.99.22,hostname.apache.org,foo.jpl.nasa.gov/value descriptionComma separated list of hostnames or IP addresses to ignore robot rules parsing for. /description /property {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-1988) Make nested output directory dump optional
[ https://issues.apache.org/jira/browse/NUTCH-1988?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14497149#comment-14497149 ] Sebastian Nagel commented on NUTCH-1988: +1 Could be alternatively {{-dirlevels n}}, n=0 would be equivalent to {{-flatdir}}. Make nested output directory dump optional -- Key: NUTCH-1988 URL: https://issues.apache.org/jira/browse/NUTCH-1988 Project: Nutch Issue Type: Improvement Components: dumpers Affects Versions: 1.9 Reporter: Michael Joyce Priority: Minor Fix For: 1.10 NUTCH-1957 added nested directories to the bin/nutch dump output to help avoid naming conflicts in output files. It would be nice to be able to specify that you want the older flat directory output as an optional parameter. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-1927) Create a whitelist of IPs/hostnames to allow skipping of RobotRules parsing
[ https://issues.apache.org/jira/browse/NUTCH-1927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14497251#comment-14497251 ] Sebastian Nagel commented on NUTCH-1927: Hi Chris, the class WhiteListRobotRules seems to me still overly complex. It should be possible to keep the cache as is and only put a reference to light-weight singleton RobotRules object (such as created by the default constructor of the WhiteListRobotRules) in case a host is whitelisted. Also I do not understand why getCrawlDelay() needs to store the last URL: the Crawl-Delay specified in the robots.txt can be used to override the default delay/interval when a robot/crawler accesses the same host successively: it's a fixed value and does not depend on any previous fetches. Don't know whether this is a problem: we (almost) everywhere use org.slf4j.Logger and not java.util.logging.Logger. Create a whitelist of IPs/hostnames to allow skipping of RobotRules parsing --- Key: NUTCH-1927 URL: https://issues.apache.org/jira/browse/NUTCH-1927 Project: Nutch Issue Type: New Feature Components: fetcher Reporter: Chris A. Mattmann Assignee: Chris A. Mattmann Labels: available, patch Fix For: 1.10 Attachments: NUTCH-1927.Mattmann.041115.patch.txt, NUTCH-1927.Mattmann.041215.patch.txt, NUTCH-1927.Mattmann.041415.patch.txt Based on discussion on the dev list, to use Nutch for some security research valid use cases (DDoS; DNS and other testing), I am going to create a patch that allows a whitelist: {code:xml} property namerobot.rules.whitelist/name value132.54.99.22,hostname.apache.org,foo.jpl.nasa.gov/value descriptionComma separated list of hostnames or IP addresses to ignore robot rules parsing for. /description /property {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Re: Review Request 33112: NUTCH-1927: Create a whitelist of IPs/hostnames to allow skipping of RobotRules parsing
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/33112/ --- (Updated April 16, 2015, 2:19 a.m.) Review request for nutch. Bugs: NUTCH-1927 https://issues.apache.org/jira/browse/NUTCH-1927 Repository: nutch Description --- Based on discussion on the dev list, to use Nutch for some security research valid use cases (DDoS; DNS and other testing), I am going to create a patch that allows a whitelist: property namerobot.rules.whitelist/name value132.54.99.22,hostname.apache.org,foo.jpl.nasa.gov/value descriptionComma separated list of hostnames or IP addresses to ignore robot rules parsing for. /description /property Diffs (updated) - ./trunk/CHANGES.txt 1673623 ./trunk/conf/nutch-default.xml 1673623 ./trunk/src/java/org/apache/nutch/protocol/RobotRules.java 1673623 ./trunk/src/java/org/apache/nutch/protocol/RobotRulesParser.java 1673623 ./trunk/src/java/org/apache/nutch/protocol/WhiteListRobotRules.java PRE-CREATION ./trunk/src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpRobotRulesParser.java 1673623 ./trunk/src/plugin/protocol-ftp/src/java/org/apache/nutch/protocol/ftp/FtpRobotRulesParser.java 1673623 Diff: https://reviews.apache.org/r/33112/diff/ Testing --- Tested using: RobotRulesParser in the o.a.n.protocol package against my home server. Robots.txt looks like: [chipotle:~/src/nutch] mattmann% more robots.txt User-agent: * Disallow: / [chipotle:~/src/nutch] mattmann% urls file: [chipotle:~/src/nutch] mattmann% more urls http://baron.pagemewhen.com/~chris/foo1.txt http://baron.pagemewhen.com/~chris/ [chipotle:~/src/nutch] mattmann% [chipotle:~/src/nutch] mattmann% java -cp build/apache-nutch-1.10-SNAPSHOT.job:build/apache-nutch-1.10-SNAPSHOT.jar:runtime/local/lib/hadoop-core-1.2.0.jar:runtime/local/lib/crawler-commons-0.5.jar:runtime/local/lib/slf4j-log4j12-1.6.1.jar:runtime/local/lib/slf4j-api-1.7.9.jar:runtime/local/lib/log4j-1.2.15.jar:runtime/local/lib/guava-11.0.2.jar:runtime/local/lib/commons-logging-1.1.1.jar org.apache.nutch.protocol.RobotRulesParser robots.txt urls Nutch-crawler Apr 12, 2015 9:22:50 AM org.apache.nutch.protocol.WhiteListRobotRules isWhiteListed INFO: Host: [baron.pagemewhen.com] is whitelisted and robots.txt rules parsing will be ignored allowed:http://baron.pagemewhen.com/~chris/foo1.txt Apr 12, 2015 9:22:50 AM org.apache.nutch.protocol.WhiteListRobotRules isWhiteListed INFO: Host: [baron.pagemewhen.com] is whitelisted and robots.txt rules parsing will be ignored allowed:http://baron.pagemewhen.com/~chris/ [chipotle:~/src/nutch] mattmann% Thanks, Chris Mattmann
[jira] [Commented] (NUTCH-1927) Create a whitelist of IPs/hostnames to allow skipping of RobotRules parsing
[ https://issues.apache.org/jira/browse/NUTCH-1927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14497476#comment-14497476 ] Chris A. Mattmann commented on NUTCH-1927: -- Hi Seb! Comments: bq. Hi Chris, the class WhiteListRobotRules seems to me still overly complex. It should be possible to keep the cache as is and only put a reference to light-weight singleton RobotRules object (such as created by the default constructor of the WhiteListRobotRules) in case a host is whitelisted. I don't understand this. Can you please reply with code? For example, WhiteListRobotRules *does* in fact simply store a singleton reference to a RobotRules object, under the premises for which it's constructed (no longer in the Fetcher but really only in the Protocol Layers by way of the RobotRulesParser base class). I did add a constructor for constructing blank WhiteListRobotRules in which it does construct a new default RobotRules instance - is that what you are objecting to? Do you want me to remove the constructor that takes no parameters? bq. Also I do not understand why getCrawlDelay() needs to store the last URL: the Crawl-Delay specified in the robots.txt can be used to override the default delay/interval when a robot/crawler accesses the same host successively: it's a fixed value and does not depend on any previous fetches. Right - and all I'm doing is to ensure that when it's first called in Fetcher.java#L722 when it's going to get a WhiteListRobotsRule Decorator from the CACHE, that in Fetcher.java#L735 (where it doesn't pass the URL again) that it remembers the URL that it was constructed with (when it was created in the cache in RobotsRuleParser.java#L179 in my patch). bq. Don't know whether this is a problem: we (almost) everywhere use org.slf4j.Logger and not java.util.logging.Logger. Happy to change this. So, new patch to change to sl4fj Logger; other than that we OK? Create a whitelist of IPs/hostnames to allow skipping of RobotRules parsing --- Key: NUTCH-1927 URL: https://issues.apache.org/jira/browse/NUTCH-1927 Project: Nutch Issue Type: New Feature Components: fetcher Reporter: Chris A. Mattmann Assignee: Chris A. Mattmann Labels: available, patch Fix For: 1.10 Attachments: NUTCH-1927.Mattmann.041115.patch.txt, NUTCH-1927.Mattmann.041215.patch.txt, NUTCH-1927.Mattmann.041415.patch.txt Based on discussion on the dev list, to use Nutch for some security research valid use cases (DDoS; DNS and other testing), I am going to create a patch that allows a whitelist: {code:xml} property namerobot.rules.whitelist/name value132.54.99.22,hostname.apache.org,foo.jpl.nasa.gov/value descriptionComma separated list of hostnames or IP addresses to ignore robot rules parsing for. /description /property {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-1987) Make bin/crawl indexer agnostic
[ https://issues.apache.org/jira/browse/NUTCH-1987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14496986#comment-14496986 ] Sebastian Nagel commented on NUTCH-1987: Agreed: it's time to skip the Solr-URL because we support alternative indexing back-ends. And it's good to add a default Sorl-URL to nutch-default.xml and document the property this way. Whether or not to run the indexer is an option. Instead of still relying on a magic positional parameter, wouldn't it be more natural to do this by command-line options: {code:none} # -i index crawled content # -D property=value passed to Nutch commands/tools bin/crawl -i -D solr.server.url=http://.../solr/ urls/ crawl/ 3 # equivalent if solr.server.url is default or defined in nutch-site.xml: bin/crawl -i urls/ crawl/ 3 # does not harm to keep this for back-ward compatibility: bin/crawl urls/ crawl/ http://.../solr/ 3 {code} This would make the options extensible and allows to add new ones, e.g., to enable/disable link inversion or webgraph creation. Make bin/crawl indexer agnostic --- Key: NUTCH-1987 URL: https://issues.apache.org/jira/browse/NUTCH-1987 Project: Nutch Issue Type: Improvement Affects Versions: 1.9 Reporter: Michael Joyce Fix For: 1.10 The crawl script makes it a bit challenging to use an indexer that isn't Solr. For instance, when I want to use the indexer-elastic plugin I still need to call the crawler script with a fake Solr URL otherwise it will skip the indexing step all together. {code} bin/crawl urls/ crawl/ http://fakeurl.com:9200; 1 {code} It would be nice to keep configuration for the Solr indexer in the conf files (to mirror the elastic search indexer conf and others) and to make the indexing parameter simply toggle whether indexing does or doesn't occur instead of also trying to configure the indexer at the same time. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-1986) Clarify Elastic Search Indexer Plugin Settings
[ https://issues.apache.org/jira/browse/NUTCH-1986?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14497000#comment-14497000 ] Sebastian Nagel commented on NUTCH-1986: +1 that's the default values you have to start with Clarify Elastic Search Indexer Plugin Settings -- Key: NUTCH-1986 URL: https://issues.apache.org/jira/browse/NUTCH-1986 Project: Nutch Issue Type: Improvement Components: documentation, indexer, plugin Affects Versions: 1.9 Reporter: Michael Joyce Fix For: 1.10 Was working on getting indexing into elastic search working and realized that the majority of my difficulties were simply me misunderstanding what the config needed. Patch incoming to hopefully clarify what is needed by default, what each option does, and add any helpful defaults. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-1987) Make bin/crawl indexer agnostic
[ https://issues.apache.org/jira/browse/NUTCH-1987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14497027#comment-14497027 ] Michael Joyce commented on NUTCH-1987: -- Hey Sebastian, thanks for the feedback. I agree the positional argument handling is a bit daft. I was aiming more for a quick intermediate solution that didn't disrupt too much while getting this functionality in there. I'm happy to update this patch with a bit nicer handling of arguments or waiting and doing a quick follow-on patch if this gets merged. Whatever works for everyone is fine with me. Make bin/crawl indexer agnostic --- Key: NUTCH-1987 URL: https://issues.apache.org/jira/browse/NUTCH-1987 Project: Nutch Issue Type: Improvement Affects Versions: 1.9 Reporter: Michael Joyce Fix For: 1.10 The crawl script makes it a bit challenging to use an indexer that isn't Solr. For instance, when I want to use the indexer-elastic plugin I still need to call the crawler script with a fake Solr URL otherwise it will skip the indexing step all together. {code} bin/crawl urls/ crawl/ http://fakeurl.com:9200; 1 {code} It would be nice to keep configuration for the Solr indexer in the conf files (to mirror the elastic search indexer conf and others) and to make the indexing parameter simply toggle whether indexing does or doesn't occur instead of also trying to configure the indexer at the same time. -- This message was sent by Atlassian JIRA (v6.3.4#6332)