[jira] [Created] (NUTCH-1987) Make bin/crawl indexer agnostic

2015-04-15 Thread Michael Joyce (JIRA)
Michael Joyce created NUTCH-1987:


 Summary: Make bin/crawl indexer agnostic
 Key: NUTCH-1987
 URL: https://issues.apache.org/jira/browse/NUTCH-1987
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.9
Reporter: Michael Joyce
 Fix For: 1.10


The crawl script makes it a bit challenging to use an indexer that isn't Solr. 
For instance, when I want to use the indexer-elastic plugin I still need to 
call the crawler script with a fake Solr URL otherwise it will skip the 
indexing step all together.

{code}
bin/crawl urls/ crawl/ http://fakeurl.com:9200; 1
{code}

It would be nice to keep configuration for the Solr indexer in the conf files 
(to mirror the elastic search indexer conf and others) and to make the indexing 
parameter simply toggle whether indexing does or doesn't occur instead of also 
trying to configure the indexer at the same time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-1986) Clarify Elastic Search Indexer Plugin Settings

2015-04-15 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1986?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14496366#comment-14496366
 ] 

ASF GitHub Bot commented on NUTCH-1986:
---

GitHub user MJJoyce opened a pull request:

https://github.com/apache/nutch/pull/17

NUTCH-1986 - Update and clarify default Elasticsearch conf values

- Host value is now defaulted to 'localhost'.
- Update port description to make it apparent that 9300 is more likely
  the value you want to use. This should keep people from setting this
  to the potentially more commonly seen 9200 and messing up connections.
- Set the cluster default value to the default Elasticsearch cluster
  name of 'elasticsearch'. Also updated the description to make it
  evident that this value still needs to be changed if you're connecting
  via host/port and your cluster name is something other than the
  default.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/MJJoyce/nutch NUTCH-1986

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/nutch/pull/17.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #17


commit f2e30595a450ae788f0b996899b06193d15fd2d7
Author: Michael Joyce mltjo...@gmail.com
Date:   2015-04-15T15:24:27Z

NUTCH-1986 - Update and clarify default Elasticsearch conf values

- Host value is now defaulted to 'localhost'.
- Update port description to make it apparent that 9300 is more likely
  the value you want to use. This should keep people from setting this
  to the potentially more commonly seen 9200 and messing up connections.
- Set the cluster default value to the default Elasticsearch cluster
  name of 'elasticsearch'. Also updated the description to make it
  evident that this value still needs to be changed if you're connecting
  via host/port and your cluster name is something other than the
  default.




 Clarify Elastic Search Indexer Plugin Settings
 --

 Key: NUTCH-1986
 URL: https://issues.apache.org/jira/browse/NUTCH-1986
 Project: Nutch
  Issue Type: Improvement
  Components: documentation, indexer, plugin
Affects Versions: 1.9
Reporter: Michael Joyce
 Fix For: 1.10


 Was working on getting indexing into elastic search working and realized that 
 the majority of my difficulties were simply me misunderstanding what the 
 config needed. Patch incoming to hopefully clarify what is needed by default, 
 what each option does, and add any helpful defaults.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[Nutch Wiki] Update of WhiteListRobots by ChrisMattmann

2015-04-15 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The WhiteListRobots page has been changed by ChrisMattmann:
https://wiki.apache.org/nutch/WhiteListRobots

Comment:
- initial page

New page:
Nutch now has a [[https://issues.apache.org/jira/browse/NUTCH-1927|white list 
for robots.txt]] capability that can be used to selectively on a per host 
and/or IP basis turn on/off robots.txt parsing. Read on to find out how to use 
it.


[jira] [Commented] (NUTCH-1972) Dockerfile for Nutch 1.x

2015-04-15 Thread Michael Joyce (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14496257#comment-14496257
 ] 

Michael Joyce commented on NUTCH-1972:
--

Awesome, thanks for merging [~chrismattmann]!!

 Dockerfile for Nutch 1.x
 

 Key: NUTCH-1972
 URL: https://issues.apache.org/jira/browse/NUTCH-1972
 Project: Nutch
  Issue Type: Improvement
  Components: deployment
Reporter: Michael Joyce
Assignee: Chris A. Mattmann
Priority: Minor
 Fix For: 1.10

 Attachments: Joyce-NUTCH-1792-patch.txt


 Hi folks,
 I noticed that there was a Docker file for Nutch 2.x but I didn't see 
 anything for 1.x. I figured I would throw something up real quick. Note that 
 this currently doesn't install Solr. I didn't need it at the time when I was 
 making this, but I'll work on getting it added before too long.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (NUTCH-1986) Clarify Elastic Search Indexer Plugin Settings

2015-04-15 Thread Michael Joyce (JIRA)
Michael Joyce created NUTCH-1986:


 Summary: Clarify Elastic Search Indexer Plugin Settings
 Key: NUTCH-1986
 URL: https://issues.apache.org/jira/browse/NUTCH-1986
 Project: Nutch
  Issue Type: Improvement
  Components: documentation, indexer, plugin
Affects Versions: 1.9
Reporter: Michael Joyce
 Fix For: 1.10


Was working on getting indexing into elastic search working and realized that 
the majority of my difficulties were simply me misunderstanding what the config 
needed. Patch incoming to hopefully clarify what is needed by default, what 
each option does, and add any helpful defaults.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-1987) Make bin/crawl indexer agnostic

2015-04-15 Thread Michael Joyce (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14496426#comment-14496426
 ] 

Michael Joyce commented on NUTCH-1987:
--

Hi folks,

I'll have a patch up in a bit for this. I think my current plan to minimize the 
number of changes that I'm shoving into a single patch is to:

* Add solr.server.url to nutch-default and set the value to some sane default 
(http://127.0.0.1:8983/solr/)
* Make the 'index' calls in the bin/nutch script generic and slightly change 
the call format.
* Update some variable names and echos in the bin/crawl script so it doesn't 
only mention Solr and confuse people

I envision a call being something similar to this after these changes:
{code}
# Run the indexer
bin/crawl urls/ crawl/ run_indexer 1

# Don't run the indexer
bin/crawl urls/ crawl/ 1
{code}

I don't think this is necessarily the ideal solution but it minimizes calling 
formats for people with existing setups and only really requires that a single 
configuration value is added/updated. Note, this change obviously requires 
some/many documentation updates. I'm more than happy to help with those as well 
but I wasn't including them in this ticket.

Thoughts?

 Make bin/crawl indexer agnostic
 ---

 Key: NUTCH-1987
 URL: https://issues.apache.org/jira/browse/NUTCH-1987
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.9
Reporter: Michael Joyce
 Fix For: 1.10


 The crawl script makes it a bit challenging to use an indexer that isn't 
 Solr. For instance, when I want to use the indexer-elastic plugin I still 
 need to call the crawler script with a fake Solr URL otherwise it will skip 
 the indexing step all together.
 {code}
 bin/crawl urls/ crawl/ http://fakeurl.com:9200; 1
 {code}
 It would be nice to keep configuration for the Solr indexer in the conf files 
 (to mirror the elastic search indexer conf and others) and to make the 
 indexing parameter simply toggle whether indexing does or doesn't occur 
 instead of also trying to configure the indexer at the same time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (NUTCH-1987) Make bin/crawl indexer agnostic

2015-04-15 Thread Michael Joyce (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14496426#comment-14496426
 ] 

Michael Joyce edited comment on NUTCH-1987 at 4/15/15 3:54 PM:
---

Hi folks,

I'll have a patch up in a bit for this. I think my current plan to minimize the 
number of changes that I'm shoving into a single patch is to:

* Add solr.server.url to nutch-default and set the value to some sane default 
(http://127.0.0.1:8983/solr/)
* Make the 'index' calls in the bin/nutch script generic and slightly change 
the call format.
* Update some variable names and echos in the bin/crawl script so it doesn't 
only mention Solr and confuse people

I envision a call being something similar to this after these changes:
{code}
# Run the indexer
bin/crawl urls/ crawl/ run_indexer 1

# Don't run the indexer
bin/crawl urls/ crawl/ 1
{code}

I don't think this is necessarily the ideal solution but it minimizes call 
format changes for people with existing setups and only really requires that a 
single configuration value is added/updated if you want to keep using Solr on 
an existing setup. Note, this change obviously requires documentation updates. 
I'm more than happy to help with those as well but I wasn't including them in 
this ticket.

Thoughts?


was (Author: mjoyce):
Hi folks,

I'll have a patch up in a bit for this. I think my current plan to minimize the 
number of changes that I'm shoving into a single patch is to:

* Add solr.server.url to nutch-default and set the value to some sane default 
(http://127.0.0.1:8983/solr/)
* Make the 'index' calls in the bin/nutch script generic and slightly change 
the call format.
* Update some variable names and echos in the bin/crawl script so it doesn't 
only mention Solr and confuse people

I envision a call being something similar to this after these changes:
{code}
# Run the indexer
bin/crawl urls/ crawl/ run_indexer 1

# Don't run the indexer
bin/crawl urls/ crawl/ 1
{code}

I don't think this is necessarily the ideal solution but it minimizes calling 
formats for people with existing setups and only really requires that a single 
configuration value is added/updated. Note, this change obviously requires 
some/many documentation updates. I'm more than happy to help with those as well 
but I wasn't including them in this ticket.

Thoughts?

 Make bin/crawl indexer agnostic
 ---

 Key: NUTCH-1987
 URL: https://issues.apache.org/jira/browse/NUTCH-1987
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.9
Reporter: Michael Joyce
 Fix For: 1.10


 The crawl script makes it a bit challenging to use an indexer that isn't 
 Solr. For instance, when I want to use the indexer-elastic plugin I still 
 need to call the crawler script with a fake Solr URL otherwise it will skip 
 the indexing step all together.
 {code}
 bin/crawl urls/ crawl/ http://fakeurl.com:9200; 1
 {code}
 It would be nice to keep configuration for the Solr indexer in the conf files 
 (to mirror the elastic search indexer conf and others) and to make the 
 indexing parameter simply toggle whether indexing does or doesn't occur 
 instead of also trying to configure the indexer at the same time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-1987) Make bin/crawl indexer agnostic

2015-04-15 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14496621#comment-14496621
 ] 

ASF GitHub Bot commented on NUTCH-1987:
---

GitHub user MJJoyce opened a pull request:

https://github.com/apache/nutch/pull/18

NUTCH-1987 - Make bin/crawl indexer agnostic

- Add solr.server.url property to nutch-default and set to value
  consistent with URL used in the Nutch Tutorial.
- Change SOLRURL references to INDEXFLAG for consistency.
- Update all occurrences of crawl usage strings to no longer reference
  solrURL and instead mention an optional string run_indexer.
- Update indexer section to no longer set Solr URL property and remove
  Solr references from prints.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/MJJoyce/nutch NUTCH-1987

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/nutch/pull/18.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #18


commit a39de23453a6f8ea2a9ab2a94872af3305f16021
Author: Michael Joyce mltjo...@gmail.com
Date:   2015-04-15T17:41:36Z

NUTCH-1987 - Make bin/crawl indexer agnostic

- Add solr.server.url property to nutch-default and set to value
  consistent with URL used in the Nutch Tutorial.
- Change SOLRURL references to INDEXFLAG for consistency.
- Update all occurrences of crawl usage strings to no longer reference
  solrURL and instead mention an optional string run_indexer.
- Update indexer section to no longer set Solr URL property and remove
  Solr references from prints.




 Make bin/crawl indexer agnostic
 ---

 Key: NUTCH-1987
 URL: https://issues.apache.org/jira/browse/NUTCH-1987
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.9
Reporter: Michael Joyce
 Fix For: 1.10


 The crawl script makes it a bit challenging to use an indexer that isn't 
 Solr. For instance, when I want to use the indexer-elastic plugin I still 
 need to call the crawler script with a fake Solr URL otherwise it will skip 
 the indexing step all together.
 {code}
 bin/crawl urls/ crawl/ http://fakeurl.com:9200; 1
 {code}
 It would be nice to keep configuration for the Solr indexer in the conf files 
 (to mirror the elastic search indexer conf and others) and to make the 
 indexing parameter simply toggle whether indexing does or doesn't occur 
 instead of also trying to configure the indexer at the same time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (NUTCH-1988) Make nested output directory dump optional

2015-04-15 Thread Michael Joyce (JIRA)
Michael Joyce created NUTCH-1988:


 Summary: Make nested output directory dump optional
 Key: NUTCH-1988
 URL: https://issues.apache.org/jira/browse/NUTCH-1988
 Project: Nutch
  Issue Type: Improvement
  Components: dumpers
Affects Versions: 1.9
Reporter: Michael Joyce
 Fix For: 1.10


NUTCH-1957 added nested directories to the bin/nutch dump output to help avoid 
naming conflicts in output files. It would be nice to be able to specify that 
you want the older flat directory output as an optional parameter.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-1988) Make nested output directory dump optional

2015-04-15 Thread Michael Joyce (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1988?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14496755#comment-14496755
 ] 

Michael Joyce commented on NUTCH-1988:
--

Hi folks. Here's an example output run of this.

{code}
[mjjoyce@machine local]$ bin/nutch dump -outputDir ./foodir -segment 
../local_elasticsearch_testt/crawl/segments/
[mjjoyce@machine local]$ bin/nutch dump -flatdir -outputDir ./foodir2 -segment 
../local_elasticsearch_testt/crawl/segments/
[mjjoyce@machine local]$ ls -R foodir
foodir:
8f  f8

foodir/8f:
a7

foodir/8f/a7:
8d84f847f7310620a9edc4327bbfc133_.html

foodir/f8:
df

foodir/f8/df:
fec7849283af7a0adc77eddefb242b6e_.html
[mjjoyce@machine local]$ ls -R foodir2
foodir2:
8d84f847f7310620a9edc4327bbfc133_.html  fec7849283af7a0adc77eddefb242b6e_.html
[mjjoyce@machine local]$ 
{code}

 Make nested output directory dump optional
 --

 Key: NUTCH-1988
 URL: https://issues.apache.org/jira/browse/NUTCH-1988
 Project: Nutch
  Issue Type: Improvement
  Components: dumpers
Affects Versions: 1.9
Reporter: Michael Joyce
Priority: Minor
 Fix For: 1.10


 NUTCH-1957 added nested directories to the bin/nutch dump output to help 
 avoid naming conflicts in output files. It would be nice to be able to 
 specify that you want the older flat directory output as an optional 
 parameter.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-1988) Make nested output directory dump optional

2015-04-15 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1988?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14496751#comment-14496751
 ] 

ASF GitHub Bot commented on NUTCH-1988:
---

GitHub user MJJoyce opened a pull request:

https://github.com/apache/nutch/pull/19

NUTCH-1988 - Add optional flat directory flag to dump command

- Add optional flatdir flag to dump command so that a user can dump
  their crawl data to a flat directory instead of the nested structure
  added in NUTCH-1957.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/MJJoyce/nutch NUTCH-1988

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/nutch/pull/19.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #19


commit 40ca3e576781328b9b5afc22548a93bfd3df75bd
Author: Michael Joyce mltjo...@gmail.com
Date:   2015-04-15T19:19:07Z

NUTCH-1988 - Add optional flat directory flag to dump command

- Add optional flatdir flag to dump command so that a user can dump
  their crawl data to a flat directory instead of the nested structure
  added in NUTCH-1957.




 Make nested output directory dump optional
 --

 Key: NUTCH-1988
 URL: https://issues.apache.org/jira/browse/NUTCH-1988
 Project: Nutch
  Issue Type: Improvement
  Components: dumpers
Affects Versions: 1.9
Reporter: Michael Joyce
Priority: Minor
 Fix For: 1.10


 NUTCH-1957 added nested directories to the bin/nutch dump output to help 
 avoid naming conflicts in output files. It would be nice to be able to 
 specify that you want the older flat directory output as an optional 
 parameter.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-1988) Make nested output directory dump optional

2015-04-15 Thread Michael Joyce (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1988?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Joyce updated NUTCH-1988:
-
Priority: Minor  (was: Major)

 Make nested output directory dump optional
 --

 Key: NUTCH-1988
 URL: https://issues.apache.org/jira/browse/NUTCH-1988
 Project: Nutch
  Issue Type: Improvement
  Components: dumpers
Affects Versions: 1.9
Reporter: Michael Joyce
Priority: Minor
 Fix For: 1.10


 NUTCH-1957 added nested directories to the bin/nutch dump output to help 
 avoid naming conflicts in output files. It would be nice to be able to 
 specify that you want the older flat directory output as an optional 
 parameter.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[Nutch Wiki] Update of WhiteListRobots by ChrisMattmann

2015-04-15 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The WhiteListRobots page has been changed by ChrisMattmann:
https://wiki.apache.org/nutch/WhiteListRobots?action=diffrev1=2rev2=3

  
  Nutch now has a [[https://issues.apache.org/jira/browse/NUTCH-1927|white list 
for robots.txt]] capability that can be used to selectively on a per host 
and/or IP basis turn on/off robots.txt parsing. Read on to find out how to use 
it.
  
- = List hostnames and/or IP addresses in Nutch conf = 
+ == List hostnames and/or IP addresses in Nutch conf ==
  
  In the Nutch configuration directory (conf/), edit nutch-default.xml (and/or 
nutch-site.xml) and add the following information:
  
@@ -28, +28 @@

  /property
  }}}
  
- = Testing the configuration =
+ == Testing the configuration ==
  
  Create a sample URLs file to test your whitelist. For example, create a file, 
call it url (without the quotes) and store each URL on a line:
  
@@ -44, +44 @@

  Disallow: /
  }}}
  
- = Build the Nutch runtime and execute RobotRulesParser =
+ == Build the Nutch runtime and execute RobotRulesParser ==
  
  Now, build the Nutch runtime, e.g., by running ```ant runtime```.
  From your ```runtime/local/ directory, run this command:


[Nutch Wiki] Update of WhiteListRobots by ChrisMattmann

2015-04-15 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The WhiteListRobots page has been changed by ChrisMattmann:
https://wiki.apache.org/nutch/WhiteListRobots?action=diffrev1=1rev2=2

Comment:
- Add example docs

+ = White List for Robots.txt =
+ 
  Nutch now has a [[https://issues.apache.org/jira/browse/NUTCH-1927|white list 
for robots.txt]] capability that can be used to selectively on a per host 
and/or IP basis turn on/off robots.txt parsing. Read on to find out how to use 
it.
  
+ = List hostnames and/or IP addresses in Nutch conf = 
+ 
+ In the Nutch configuration directory (conf/), edit nutch-default.xml (and/or 
nutch-site.xml) and add the following information:
+ 
+ {{{
+ property
+   namerobot.rules.whitelist/name
+   value/value
+   descriptionComma separated list of hostnames or IP addresses to ignore 
robot rules parsing for.
+   /description
+ /property
+ }}}
+ 
+ For example, try this, to whitelist the host, baron.pagemewhen.com:
+ 
+ {{{
+ property
+   namerobot.rules.whitelist/name
+   valuebaron.pagemewhen.com/value
+   descriptionComma separated list of hostnames or IP addresses to ignore 
robot rules parsing for.
+   /description
+ /property
+ }}}
+ 
+ = Testing the configuration =
+ 
+ Create a sample URLs file to test your whitelist. For example, create a file, 
call it url (without the quotes) and store each URL on a line:
+ 
+ {{{
+ http://baron.pagemewhen.com/~chris/foo1.txt
+ http://baron.pagemewhen.com/~chris/
+ }}}
+ 
+ Create a sample robots.txt file, e.g., robots.txt (without the quotes):
+ 
+ {{{
+ User-agent: *
+ Disallow: /
+ }}}
+ 
+ = Build the Nutch runtime and execute RobotRulesParser =
+ 
+ Now, build the Nutch runtime, e.g., by running ```ant runtime```.
+ From your ```runtime/local/ directory, run this command:
+ 
+ {{{
+ java -cp 
build/apache-nutch-1.10-SNAPSHOT.job:build/apache-nutch-1.10-SNAPSHOT.jar:runtime/local/lib/hadoop-core-1.2.0.jar:runtime/local/lib/crawler-commons-0.5.jar:runtime/local/lib/slf4j-log4j12-1.6.1.jar:runtime/local/lib/slf4j-api-1.7.9.jar:runtime/local/lib/log4j-1.2.15.jar:runtime/local/lib/guava-11.0.2.jar:runtime/local/lib/commons-logging-1.1.1.jar
 org.apache.nutch.protocol.RobotRulesParser robots.txt urls Nutch-crawler
+ }}}
+ 
+ You should see the following output:
+ 
+ {{{
+ Robots: whitelist: [baron.pagemewhen.com]
+ Apr 14, 2015 11:53:20 PM org.apache.nutch.protocol.WhiteListRobotRules 
isWhiteListed
+ INFO: Host: [baron.pagemewhen.com] is whitelisted and robots.txt rules 
parsing will be ignored
+ allowed:  http://baron.pagemewhen.com/~chris/foo1.txt
+ Apr 14, 2015 11:53:20 PM org.apache.nutch.protocol.WhiteListRobotRules 
isWhiteListed
+ INFO: Host: [baron.pagemewhen.com] is whitelisted and robots.txt rules 
parsing will be ignored
+ allowed:  http://baron.pagemewhen.com/~chris/
+ }}}
+ 


[jira] [Commented] (NUTCH-1927) Create a whitelist of IPs/hostnames to allow skipping of RobotRules parsing

2015-04-15 Thread Chris A. Mattmann (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14497152#comment-14497152
 ] 

Chris A. Mattmann commented on NUTCH-1927:
--

Added some documentation here:
https://wiki.apache.org/nutch/WhiteListRobots

 Create a whitelist of IPs/hostnames to allow skipping of RobotRules parsing
 ---

 Key: NUTCH-1927
 URL: https://issues.apache.org/jira/browse/NUTCH-1927
 Project: Nutch
  Issue Type: New Feature
  Components: fetcher
Reporter: Chris A. Mattmann
Assignee: Chris A. Mattmann
  Labels: available, patch
 Fix For: 1.10

 Attachments: NUTCH-1927.Mattmann.041115.patch.txt, 
 NUTCH-1927.Mattmann.041215.patch.txt, NUTCH-1927.Mattmann.041415.patch.txt


 Based on discussion on the dev list, to use Nutch for some security research 
 valid use cases (DDoS; DNS and other testing), I am going to create a patch 
 that allows a whitelist:
 {code:xml}
 property
   namerobot.rules.whitelist/name
   value132.54.99.22,hostname.apache.org,foo.jpl.nasa.gov/value
   descriptionComma separated list of hostnames or IP addresses to ignore 
 robot rules parsing for.
   /description
 /property
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-1988) Make nested output directory dump optional

2015-04-15 Thread Sebastian Nagel (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1988?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14497149#comment-14497149
 ] 

Sebastian Nagel commented on NUTCH-1988:


+1
Could be alternatively {{-dirlevels n}}, n=0 would be equivalent to 
{{-flatdir}}.

 Make nested output directory dump optional
 --

 Key: NUTCH-1988
 URL: https://issues.apache.org/jira/browse/NUTCH-1988
 Project: Nutch
  Issue Type: Improvement
  Components: dumpers
Affects Versions: 1.9
Reporter: Michael Joyce
Priority: Minor
 Fix For: 1.10


 NUTCH-1957 added nested directories to the bin/nutch dump output to help 
 avoid naming conflicts in output files. It would be nice to be able to 
 specify that you want the older flat directory output as an optional 
 parameter.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-1927) Create a whitelist of IPs/hostnames to allow skipping of RobotRules parsing

2015-04-15 Thread Sebastian Nagel (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14497251#comment-14497251
 ] 

Sebastian Nagel commented on NUTCH-1927:


Hi Chris, the class WhiteListRobotRules seems to me still overly complex. It 
should be possible to keep the cache as is and only put a reference to 
light-weight singleton RobotRules object (such as created by the default 
constructor of the WhiteListRobotRules) in case a host is whitelisted.
Also I do not understand why getCrawlDelay() needs to store the last URL: the 
Crawl-Delay specified in the robots.txt can be used to override the default 
delay/interval when a robot/crawler accesses the same host successively: it's a 
fixed value and does not depend on any previous fetches.
Don't know whether this is a problem: we (almost) everywhere use 
org.slf4j.Logger and not java.util.logging.Logger.

 Create a whitelist of IPs/hostnames to allow skipping of RobotRules parsing
 ---

 Key: NUTCH-1927
 URL: https://issues.apache.org/jira/browse/NUTCH-1927
 Project: Nutch
  Issue Type: New Feature
  Components: fetcher
Reporter: Chris A. Mattmann
Assignee: Chris A. Mattmann
  Labels: available, patch
 Fix For: 1.10

 Attachments: NUTCH-1927.Mattmann.041115.patch.txt, 
 NUTCH-1927.Mattmann.041215.patch.txt, NUTCH-1927.Mattmann.041415.patch.txt


 Based on discussion on the dev list, to use Nutch for some security research 
 valid use cases (DDoS; DNS and other testing), I am going to create a patch 
 that allows a whitelist:
 {code:xml}
 property
   namerobot.rules.whitelist/name
   value132.54.99.22,hostname.apache.org,foo.jpl.nasa.gov/value
   descriptionComma separated list of hostnames or IP addresses to ignore 
 robot rules parsing for.
   /description
 /property
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: Review Request 33112: NUTCH-1927: Create a whitelist of IPs/hostnames to allow skipping of RobotRules parsing

2015-04-15 Thread Chris Mattmann

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/33112/
---

(Updated April 16, 2015, 2:19 a.m.)


Review request for nutch.


Bugs: NUTCH-1927
https://issues.apache.org/jira/browse/NUTCH-1927


Repository: nutch


Description
---

Based on discussion on the dev list, to use Nutch for some security research 
valid use cases (DDoS; DNS and other testing), I am going to create a patch 
that allows a whitelist:
property
  namerobot.rules.whitelist/name
  value132.54.99.22,hostname.apache.org,foo.jpl.nasa.gov/value
  descriptionComma separated list of hostnames or IP addresses to ignore 
robot rules parsing for.
  /description
/property


Diffs (updated)
-

  ./trunk/CHANGES.txt 1673623 
  ./trunk/conf/nutch-default.xml 1673623 
  ./trunk/src/java/org/apache/nutch/protocol/RobotRules.java 1673623 
  ./trunk/src/java/org/apache/nutch/protocol/RobotRulesParser.java 1673623 
  ./trunk/src/java/org/apache/nutch/protocol/WhiteListRobotRules.java 
PRE-CREATION 
  
./trunk/src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpRobotRulesParser.java
 1673623 
  
./trunk/src/plugin/protocol-ftp/src/java/org/apache/nutch/protocol/ftp/FtpRobotRulesParser.java
 1673623 

Diff: https://reviews.apache.org/r/33112/diff/


Testing
---

Tested using: RobotRulesParser in the o.a.n.protocol package against my home 
server. Robots.txt looks like:

[chipotle:~/src/nutch] mattmann% more robots.txt 
User-agent: *
Disallow: /
[chipotle:~/src/nutch] mattmann% 

urls file:

[chipotle:~/src/nutch] mattmann% more urls 
http://baron.pagemewhen.com/~chris/foo1.txt
http://baron.pagemewhen.com/~chris/
[chipotle:~/src/nutch] mattmann% 

[chipotle:~/src/nutch] mattmann% java -cp 
build/apache-nutch-1.10-SNAPSHOT.job:build/apache-nutch-1.10-SNAPSHOT.jar:runtime/local/lib/hadoop-core-1.2.0.jar:runtime/local/lib/crawler-commons-0.5.jar:runtime/local/lib/slf4j-log4j12-1.6.1.jar:runtime/local/lib/slf4j-api-1.7.9.jar:runtime/local/lib/log4j-1.2.15.jar:runtime/local/lib/guava-11.0.2.jar:runtime/local/lib/commons-logging-1.1.1.jar
 org.apache.nutch.protocol.RobotRulesParser robots.txt urls Nutch-crawler
Apr 12, 2015 9:22:50 AM org.apache.nutch.protocol.WhiteListRobotRules 
isWhiteListed
INFO: Host: [baron.pagemewhen.com] is whitelisted and robots.txt rules parsing 
will be ignored
allowed:http://baron.pagemewhen.com/~chris/foo1.txt
Apr 12, 2015 9:22:50 AM org.apache.nutch.protocol.WhiteListRobotRules 
isWhiteListed
INFO: Host: [baron.pagemewhen.com] is whitelisted and robots.txt rules parsing 
will be ignored
allowed:http://baron.pagemewhen.com/~chris/
[chipotle:~/src/nutch] mattmann%


Thanks,

Chris Mattmann



[jira] [Commented] (NUTCH-1927) Create a whitelist of IPs/hostnames to allow skipping of RobotRules parsing

2015-04-15 Thread Chris A. Mattmann (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14497476#comment-14497476
 ] 

Chris A. Mattmann commented on NUTCH-1927:
--

Hi Seb!

Comments:

bq. Hi Chris, the class WhiteListRobotRules seems to me still overly complex. 
It should be possible to keep the cache as is and only put a reference to 
light-weight singleton RobotRules object (such as created by the default 
constructor of the WhiteListRobotRules) in case a host is whitelisted.

I don't understand this. Can you please reply with code? For example, 
WhiteListRobotRules *does* in fact simply store a singleton reference to a 
RobotRules object, under the premises for which it's constructed (no longer in 
the Fetcher but really only in the Protocol Layers by way of the 
RobotRulesParser base class). I did add a constructor for constructing blank 
WhiteListRobotRules in which it does construct a new default RobotRules 
instance - is that what you are objecting to? Do you want me to remove the 
constructor that takes no parameters?

bq. Also I do not understand why getCrawlDelay() needs to store the last URL: 
the Crawl-Delay specified in the robots.txt can be used to override the default 
delay/interval when a robot/crawler accesses the same host successively: it's a 
fixed value and does not depend on any previous fetches.

Right - and all I'm doing is to ensure that when it's first called in 
Fetcher.java#L722 when it's going to get a WhiteListRobotsRule Decorator from 
the CACHE, that in Fetcher.java#L735 (where it doesn't pass the URL again) that 
it remembers the URL that it was constructed with (when it was created in the 
cache in RobotsRuleParser.java#L179 in my patch).

bq. Don't know whether this is a problem: we (almost) everywhere use 
org.slf4j.Logger and not java.util.logging.Logger.

Happy to change this.

So, new patch to change to sl4fj Logger; other than that we OK?

 Create a whitelist of IPs/hostnames to allow skipping of RobotRules parsing
 ---

 Key: NUTCH-1927
 URL: https://issues.apache.org/jira/browse/NUTCH-1927
 Project: Nutch
  Issue Type: New Feature
  Components: fetcher
Reporter: Chris A. Mattmann
Assignee: Chris A. Mattmann
  Labels: available, patch
 Fix For: 1.10

 Attachments: NUTCH-1927.Mattmann.041115.patch.txt, 
 NUTCH-1927.Mattmann.041215.patch.txt, NUTCH-1927.Mattmann.041415.patch.txt


 Based on discussion on the dev list, to use Nutch for some security research 
 valid use cases (DDoS; DNS and other testing), I am going to create a patch 
 that allows a whitelist:
 {code:xml}
 property
   namerobot.rules.whitelist/name
   value132.54.99.22,hostname.apache.org,foo.jpl.nasa.gov/value
   descriptionComma separated list of hostnames or IP addresses to ignore 
 robot rules parsing for.
   /description
 /property
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-1987) Make bin/crawl indexer agnostic

2015-04-15 Thread Sebastian Nagel (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14496986#comment-14496986
 ] 

Sebastian Nagel commented on NUTCH-1987:


Agreed: it's time to skip the Solr-URL because we support alternative indexing 
back-ends. And it's good to add a default Sorl-URL to nutch-default.xml and 
document the property this way.
Whether or not to run the indexer is an option. Instead of still relying on a 
magic positional parameter, wouldn't it be more natural to do this by 
command-line options:
{code:none}
# -i  index crawled content
# -D  property=value  passed to Nutch commands/tools
bin/crawl -i -D solr.server.url=http://.../solr/  urls/ crawl/ 3
# equivalent if solr.server.url is default or defined in nutch-site.xml:
bin/crawl -i urls/ crawl/ 3
# does not harm to keep this for back-ward compatibility:
bin/crawl urls/ crawl/ http://.../solr/ 3
{code}
This would make the options extensible and allows to add new ones, e.g., to 
enable/disable link inversion or webgraph creation.

 Make bin/crawl indexer agnostic
 ---

 Key: NUTCH-1987
 URL: https://issues.apache.org/jira/browse/NUTCH-1987
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.9
Reporter: Michael Joyce
 Fix For: 1.10


 The crawl script makes it a bit challenging to use an indexer that isn't 
 Solr. For instance, when I want to use the indexer-elastic plugin I still 
 need to call the crawler script with a fake Solr URL otherwise it will skip 
 the indexing step all together.
 {code}
 bin/crawl urls/ crawl/ http://fakeurl.com:9200; 1
 {code}
 It would be nice to keep configuration for the Solr indexer in the conf files 
 (to mirror the elastic search indexer conf and others) and to make the 
 indexing parameter simply toggle whether indexing does or doesn't occur 
 instead of also trying to configure the indexer at the same time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-1986) Clarify Elastic Search Indexer Plugin Settings

2015-04-15 Thread Sebastian Nagel (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1986?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14497000#comment-14497000
 ] 

Sebastian Nagel commented on NUTCH-1986:


+1  that's the default values you have to start with

 Clarify Elastic Search Indexer Plugin Settings
 --

 Key: NUTCH-1986
 URL: https://issues.apache.org/jira/browse/NUTCH-1986
 Project: Nutch
  Issue Type: Improvement
  Components: documentation, indexer, plugin
Affects Versions: 1.9
Reporter: Michael Joyce
 Fix For: 1.10


 Was working on getting indexing into elastic search working and realized that 
 the majority of my difficulties were simply me misunderstanding what the 
 config needed. Patch incoming to hopefully clarify what is needed by default, 
 what each option does, and add any helpful defaults.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-1987) Make bin/crawl indexer agnostic

2015-04-15 Thread Michael Joyce (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14497027#comment-14497027
 ] 

Michael Joyce commented on NUTCH-1987:
--

Hey Sebastian, thanks for the feedback.

I agree the positional argument handling is a bit daft. I was aiming more for a 
quick intermediate solution that didn't disrupt too much while getting this 
functionality in there. I'm happy to update this patch with a bit nicer 
handling of arguments or waiting and doing a quick follow-on patch if this gets 
merged. Whatever works for everyone is fine with me.

 Make bin/crawl indexer agnostic
 ---

 Key: NUTCH-1987
 URL: https://issues.apache.org/jira/browse/NUTCH-1987
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.9
Reporter: Michael Joyce
 Fix For: 1.10


 The crawl script makes it a bit challenging to use an indexer that isn't 
 Solr. For instance, when I want to use the indexer-elastic plugin I still 
 need to call the crawler script with a fake Solr URL otherwise it will skip 
 the indexing step all together.
 {code}
 bin/crawl urls/ crawl/ http://fakeurl.com:9200; 1
 {code}
 It would be nice to keep configuration for the Solr indexer in the conf files 
 (to mirror the elastic search indexer conf and others) and to make the 
 indexing parameter simply toggle whether indexing does or doesn't occur 
 instead of also trying to configure the indexer at the same time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)