[jira] [Commented] (NUTCH-1854) ./bin/crawl fails with a parsing fetcher

2015-04-14 Thread Asitang Mishra (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14495687#comment-14495687
 ] 

Asitang Mishra commented on NUTCH-1854:
---

okay done Lewis..

> ./bin/crawl fails with a parsing fetcher
> 
>
> Key: NUTCH-1854
> URL: https://issues.apache.org/jira/browse/NUTCH-1854
> Project: Nutch
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.9
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
> Fix For: 1.11
>
> Attachments: NUTCH-1854ver1.patch, NUTCH-1854ver2.patch, 
> NUTCH-1854ver3.patch, NUTCH-1854ver4.patch
>
>
> If you run ./bin/crawl with a parsing fetcher e.g.
> 
> >   fetcher.parse
> >   false
> >   If true, fetcher will parse content. Default is false,
> > which means
> >   that a separate parsing step is required after fetching is
> > finished.
> > 
> we get a horrible message as follows
> Exception in thread "main" java.io.IOException: Segment already parsed!
> We could improve this by making logging more complete and by adding a trigger 
> to the crawl script which would check for crawl_parse for a given segment and 
> then skip parsing if this is present.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-1854) ./bin/crawl fails with a parsing fetcher

2015-04-14 Thread Asitang Mishra (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1854?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Asitang Mishra updated NUTCH-1854:
--
Attachment: NUTCH-1854ver4.patch

Added NUTCH-1854ver4.patch : formatted the NUTCH-1854ver3.patch

> ./bin/crawl fails with a parsing fetcher
> 
>
> Key: NUTCH-1854
> URL: https://issues.apache.org/jira/browse/NUTCH-1854
> Project: Nutch
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.9
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
> Fix For: 1.11
>
> Attachments: NUTCH-1854ver1.patch, NUTCH-1854ver2.patch, 
> NUTCH-1854ver3.patch, NUTCH-1854ver4.patch
>
>
> If you run ./bin/crawl with a parsing fetcher e.g.
> 
> >   fetcher.parse
> >   false
> >   If true, fetcher will parse content. Default is false,
> > which means
> >   that a separate parsing step is required after fetching is
> > finished.
> > 
> we get a horrible message as follows
> Exception in thread "main" java.io.IOException: Segment already parsed!
> We could improve this by making logging more complete and by adding a trigger 
> to the crawl script which would check for crawl_parse for a given segment and 
> then skip parsing if this is present.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[Nutch Wiki] Update of "FrontPage" by ChrisMattmann

2015-04-14 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.

The "FrontPage" page has been changed by ChrisMattmann:
https://wiki.apache.org/nutch/FrontPage?action=diff&rev1=296&rev2=297

Comment:
- whitelist tutorial

   * [[NutchMavenSupport|Using Nutch as a Maven dependency]]
   * GoogleSummerOfCode - An area dedicated to GSoC projects and student/mentor 
development/documentation sandbox.
   * AdvancedAjaxInteraction - Discussion centered on enabling Nutch to not 
only fetch, but also interact with JavaScript
+  * WhiteListRobots - User guide for the new host robots.txt whitelist 
capability
  
  == Nutch 2.x ==
   * Nutch2Crawling - A description of the crawling jobs and field to database 
mappings.


Re: Review Request 33112: NUTCH-1927: Create a whitelist of IPs/hostnames to allow skipping of RobotRules parsing

2015-04-14 Thread Chris Mattmann

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/33112/
---

(Updated April 15, 2015, 3:56 a.m.)


Review request for nutch.


Bugs: NUTCH-1927
https://issues.apache.org/jira/browse/NUTCH-1927


Repository: nutch


Description
---

Based on discussion on the dev list, to use Nutch for some security research 
valid use cases (DDoS; DNS and other testing), I am going to create a patch 
that allows a whitelist:

  robot.rules.whitelist
  132.54.99.22,hostname.apache.org,foo.jpl.nasa.gov
  Comma separated list of hostnames or IP addresses to ignore 
robot rules parsing for.
  



Diffs (updated)
-

  ./trunk/CHANGES.txt 1673623 
  ./trunk/conf/nutch-default.xml 1673623 
  ./trunk/src/java/org/apache/nutch/protocol/RobotRules.java 1673623 
  ./trunk/src/java/org/apache/nutch/protocol/RobotRulesParser.java 1673623 
  ./trunk/src/java/org/apache/nutch/protocol/WhiteListRobotRules.java 
PRE-CREATION 
  
./trunk/src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpRobotRulesParser.java
 1673623 
  
./trunk/src/plugin/protocol-ftp/src/java/org/apache/nutch/protocol/ftp/FtpRobotRulesParser.java
 1673623 

Diff: https://reviews.apache.org/r/33112/diff/


Testing
---

Tested using: RobotRulesParser in the o.a.n.protocol package against my home 
server. Robots.txt looks like:

[chipotle:~/src/nutch] mattmann% more robots.txt 
User-agent: *
Disallow: /
[chipotle:~/src/nutch] mattmann% 

urls file:

[chipotle:~/src/nutch] mattmann% more urls 
http://baron.pagemewhen.com/~chris/foo1.txt
http://baron.pagemewhen.com/~chris/
[chipotle:~/src/nutch] mattmann% 

[chipotle:~/src/nutch] mattmann% java -cp 
build/apache-nutch-1.10-SNAPSHOT.job:build/apache-nutch-1.10-SNAPSHOT.jar:runtime/local/lib/hadoop-core-1.2.0.jar:runtime/local/lib/crawler-commons-0.5.jar:runtime/local/lib/slf4j-log4j12-1.6.1.jar:runtime/local/lib/slf4j-api-1.7.9.jar:runtime/local/lib/log4j-1.2.15.jar:runtime/local/lib/guava-11.0.2.jar:runtime/local/lib/commons-logging-1.1.1.jar
 org.apache.nutch.protocol.RobotRulesParser robots.txt urls Nutch-crawler
Apr 12, 2015 9:22:50 AM org.apache.nutch.protocol.WhiteListRobotRules 
isWhiteListed
INFO: Host: [baron.pagemewhen.com] is whitelisted and robots.txt rules parsing 
will be ignored
allowed:http://baron.pagemewhen.com/~chris/foo1.txt
Apr 12, 2015 9:22:50 AM org.apache.nutch.protocol.WhiteListRobotRules 
isWhiteListed
INFO: Host: [baron.pagemewhen.com] is whitelisted and robots.txt rules parsing 
will be ignored
allowed:http://baron.pagemewhen.com/~chris/
[chipotle:~/src/nutch] mattmann%


Thanks,

Chris Mattmann



[jira] [Commented] (NUTCH-1927) Create a whitelist of IPs/hostnames to allow skipping of RobotRules parsing

2015-04-14 Thread Chris A. Mattmann (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14495620#comment-14495620
 ] 

Chris A. Mattmann commented on NUTCH-1927:
--

let me know what you guys think. Tested, works fine. Would like to commit in 
next 24 hours.

> Create a whitelist of IPs/hostnames to allow skipping of RobotRules parsing
> ---
>
> Key: NUTCH-1927
> URL: https://issues.apache.org/jira/browse/NUTCH-1927
> Project: Nutch
>  Issue Type: New Feature
>  Components: fetcher
>Reporter: Chris A. Mattmann
>Assignee: Chris A. Mattmann
>  Labels: available, patch
> Fix For: 1.10
>
> Attachments: NUTCH-1927.Mattmann.041115.patch.txt, 
> NUTCH-1927.Mattmann.041215.patch.txt, NUTCH-1927.Mattmann.041415.patch.txt
>
>
> Based on discussion on the dev list, to use Nutch for some security research 
> valid use cases (DDoS; DNS and other testing), I am going to create a patch 
> that allows a whitelist:
> {code:xml}
> 
>   robot.rules.whitelist
>   132.54.99.22,hostname.apache.org,foo.jpl.nasa.gov
>   Comma separated list of hostnames or IP addresses to ignore 
> robot rules parsing for.
>   
> 
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-1927) Create a whitelist of IPs/hostnames to allow skipping of RobotRules parsing

2015-04-14 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1927?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated NUTCH-1927:
-
Attachment: NUTCH-1927.Mattmann.041415.patch.txt

- updated patch addresses comments from Lewis and Seb.

> Create a whitelist of IPs/hostnames to allow skipping of RobotRules parsing
> ---
>
> Key: NUTCH-1927
> URL: https://issues.apache.org/jira/browse/NUTCH-1927
> Project: Nutch
>  Issue Type: New Feature
>  Components: fetcher
>Reporter: Chris A. Mattmann
>Assignee: Chris A. Mattmann
>  Labels: available, patch
> Fix For: 1.10
>
> Attachments: NUTCH-1927.Mattmann.041115.patch.txt, 
> NUTCH-1927.Mattmann.041215.patch.txt, NUTCH-1927.Mattmann.041415.patch.txt
>
>
> Based on discussion on the dev list, to use Nutch for some security research 
> valid use cases (DDoS; DNS and other testing), I am going to create a patch 
> that allows a whitelist:
> {code:xml}
> 
>   robot.rules.whitelist
>   132.54.99.22,hostname.apache.org,foo.jpl.nasa.gov
>   Comma separated list of hostnames or IP addresses to ignore 
> robot rules parsing for.
>   
> 
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-1985) Adding a main() method to the MimeTypeIndexingFilter

2015-04-14 Thread Jorge Luis Betancourt Gonzalez (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jorge Luis Betancourt Gonzalez updated NUTCH-1985:
--
Attachment: NUTCH-1985.patch

> Adding a main() method to the MimeTypeIndexingFilter
> 
>
> Key: NUTCH-1985
> URL: https://issues.apache.org/jira/browse/NUTCH-1985
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer, metadata, plugin
>Affects Versions: 1.10
>Reporter: Jorge Luis Betancourt Gonzalez
>Priority: Minor
>  Labels: features, patch, test
> Fix For: 1.10
>
> Attachments: NUTCH-1985.patch
>
>
> This make very easy the testing of different rules files to check the 
> expressions used to filter the content based on the MIME type detected. Until 
> now the only way to check this was to do test crawls and check the stored 
> data in Solr/Elasticsearch. 
> This allows calling the file using the {{bin/nutch plugin}} command, 
> something like:
> {{bin/nutch plugin mimetype-filter 
> org.apache.nutch.indexer.filter.MimeTypeIndexingFilter -h}}
> Two options are accepted, {{-h, --help}} for showing the help and {{-rules}} 
> for specifying a rules file to be used, this makes easy to play with 
> different rules file until you get the desired behavior. 
> After invoking the class, a valid MIME type must be entered for each line, 
> and the output will be the same MIME type with a {{+}} or {{-}} sign in the 
> beginning, indicating if the given MIME type is allowed or denied 
> respectively.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (NUTCH-1985) Adding a main() method to the MimeTypeIndexingFilter

2015-04-14 Thread Jorge Luis Betancourt Gonzalez (JIRA)
Jorge Luis Betancourt Gonzalez created NUTCH-1985:
-

 Summary: Adding a main() method to the MimeTypeIndexingFilter
 Key: NUTCH-1985
 URL: https://issues.apache.org/jira/browse/NUTCH-1985
 Project: Nutch
  Issue Type: Improvement
  Components: indexer, metadata, plugin
Affects Versions: 1.10
Reporter: Jorge Luis Betancourt Gonzalez
Priority: Minor
 Fix For: 1.10


This make very easy the testing of different rules files to check the 
expressions used to filter the content based on the MIME type detected. Until 
now the only way to check this was to do test crawls and check the stored data 
in Solr/Elasticsearch. 

This allows calling the file using the {{bin/nutch plugin}} command, something 
like:

{{bin/nutch plugin mimetype-filter 
org.apache.nutch.indexer.filter.MimeTypeIndexingFilter -h}}

Two options are accepted, {{-h, --help}} for showing the help and {{-rules}} 
for specifying a rules file to be used, this makes easy to play with different 
rules file until you get the desired behavior. 

After invoking the class, a valid MIME type must be entered for each line, and 
the output will be the same MIME type with a {{+}} or {{-}} sign in the 
beginning, indicating if the given MIME type is allowed or denied respectively.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-1854) ./bin/crawl fails with a parsing fetcher

2015-04-14 Thread Lewis John McGibbney (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14494778#comment-14494778
 ] 

Lewis John McGibbney commented on NUTCH-1854:
-

[~asitang] can you please use the following template to format your code.
http://svn.apache.org/repos/asf/nutch/branches/2.x/eclipse-codeformat.xml
These patches are grand.

> ./bin/crawl fails with a parsing fetcher
> 
>
> Key: NUTCH-1854
> URL: https://issues.apache.org/jira/browse/NUTCH-1854
> Project: Nutch
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.9
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
> Fix For: 1.11
>
> Attachments: NUTCH-1854ver1.patch, NUTCH-1854ver2.patch, 
> NUTCH-1854ver3.patch
>
>
> If you run ./bin/crawl with a parsing fetcher e.g.
> 
> >   fetcher.parse
> >   false
> >   If true, fetcher will parse content. Default is false,
> > which means
> >   that a separate parsing step is required after fetching is
> > finished.
> > 
> we get a horrible message as follows
> Exception in thread "main" java.io.IOException: Segment already parsed!
> We could improve this by making logging more complete and by adding a trigger 
> to the crawl script which would check for crawl_parse for a given segment and 
> then skip parsing if this is present.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[Nutch Wiki] Update of "SumanSaurabh/GSoC2015Nutch" by SumanSaurabh

2015-04-14 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.

The "SumanSaurabh/GSoC2015Nutch" page has been changed by SumanSaurabh:
https://wiki.apache.org/nutch/SumanSaurabh/GSoC2015Nutch?action=diff&rev1=3&rev2=4

   . {{{
  
  
+ }}}
+ 
+  . Dependency ''hadoop-test-1.2.0.jar'' needs to be removed.
+  . {{{
+ 
  }}}
   .
   . '''1.3) Experimental setup with of Nutch with Hadoop and their result:'''


[Nutch Wiki] Update of "SumanSaurabh/GSoC2015Nutch" by SumanSaurabh

2015-04-14 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.

The "SumanSaurabh/GSoC2015Nutch" page has been changed by SumanSaurabh:
https://wiki.apache.org/nutch/SumanSaurabh/GSoC2015Nutch?action=diff&rev1=2&rev2=3

   .
   . '''1.2) Workspace Setup:'''
  
-  . Nutch  workspace it built on Ant+Ivy. I have experience with Ant build  
framework, so workspace setup would be relatively easier. I have forked  the 
Nutch codebase to my Git '''[2]''' and after successful completion I will  
provide the patch. Meanwhile I will also try to resolve issues mentioned  in 
Nutch Jira.
+  . Nutch  workspace it built on Ant+Ivy. I have experience with Ant build  
framework, so workspace setup would be relatively easier. I have forked  the 
Nutch codebase to my Git '''[2]''' and after successful completion I will  
provide the patch. 
+  Nutch dependency on Hadoop: ''hadoop-core.1.x.jar'' is changed in ''Hadoop 
2.x''
+  . {{{
+ 
+
+
+
+
+
+
+ 
+ }}}
  
+  . Following dependency needs to be added for Hadoop 2.6 support instead of 
above.
+  . {{{
+ 
+ 
+ }}}
   .
   . '''1.3) Experimental setup with of Nutch with Hadoop and their result:'''
  
   . I  have been using Hadoop 2.3 for my !MapReduce application and while  
trying to setup Nutch 1.9 with Hadoop 2.3. I ran into following error:
- 
+  . {{{
-  . Injector:
+ Injector:
-  . java.lang.!UnsupportedOperationException: Not implemented by the 
!DistributedFileSystem !FileSystemimplementation
+   java.lang.!UnsupportedOperationException: Not implemented by the 
!DistributedFileSystem !FileSystem implementation
-   . at org.apache.hadoop.fs.!FileSystem.getScheme(!FileSystem.java:214)
+   at org.apache.hadoop.fs.!FileSystem.getScheme(!FileSystem.java:214)
- 
-   . at org.apache.hadoop.fs.!FileSystem.loadFileSystems(!FileSystem.java:2365)
+   at org.apache.hadoop.fs.!FileSystem.loadFileSystems(!FileSystem.java:2365)
+   at 
org.apache.hadoop.fs.!FileSystem.getFileSystemClass(!FileSystem.java:2375) 
+   at org.apache.hadoop.fs.!FileSystem.createFileSystem(!FileSystem.java:2392)
- 
-   . at 
org.apache.hadoop.fs.!FileSystem.getFileSystemClass(!FileSystem.java:2375) at 
org.apache.hadoop.fs.!FileSystem.createFileSystem(!FileSystem.java:2392)
- 
-   . at org.apache.hadoop.fs.!FileSystem.access$200(!FileSystem.java:89)
+   at org.apache.hadoop.fs.!FileSystem.access$200(!FileSystem.java:89)
+   at 
org.apache.hadoop.fs.!FileSystem$Cache.getInternal(!FileSystem.java:2431) 
+   at org.apache.hadoop.fs.!FileSystem$Cache.get(!FileSystem.java:2413)
- 
-   . at 
org.apache.hadoop.fs.!FileSystem$Cache.getInternal(!FileSystem.java:2431) at 
org.apache.hadoop.fs.!FileSystem$Cache.get(!FileSystem.java:2413)
- 
-   . at org.apache.hadoop.fs.!FileSystem.get(!FileSystem.java:368)
+   at org.apache.hadoop.fs.!FileSystem.get(!FileSystem.java:368)
- 
-   . at org.apache.hadoop.fs.!FileSystem.get(!FileSystem.java:167)
+   at org.apache.hadoop.fs.!FileSystem.get(!FileSystem.java:167)
- 
-   . at org.apache.nutch.crawl.Injector.inject(Injector.java:297)
+   at org.apache.nutch.crawl.Injector.inject(Injector.java:297)
-   . at org.apache.nutch.crawl.Injector.run(Injector.java:380)
+   at org.apache.nutch.crawl.Injector.run(Injector.java:380)
-   . at org.apache.hadoop.util.!ToolRunner.run(!ToolRunner.java:70)
+   at org.apache.hadoop.util.!ToolRunner.run(!ToolRunner.java:70)
- 
-   . at org.apache.nutch.crawl.Injector.main(Injector.java:370) .
+   at org.apache.nutch.crawl.Injector.main(Injector.java:370) .
- 
+ }}}
   . May be I will start looking at this point onwards?
  
  == Phase 2 (Coding): ==
-  . 2.1) Migrating from Hadoop 1.x to Hadoop 2.x
+  . '''2.1) Migrating from Hadoop 1.x to Hadoop 2.x'''
. '''Binary Compatibility'' ''':
  
. First, we ensure binary compatibility to the applications that use old 
'''mapred''' APIs. This means that applications which were built against MRv1 
'''mapred''' APIs can run directly on YARN without recompilation, merely by 
pointing them to an Apache Hadoop 2.x cluster via configuration.
@@ -139, +149 @@

  
. '''Source Compatibility:'''
  
-   . One cannot ensure complete binary compatibility with the applications 
that use '''mapreduce''' APIs, as these APIs have evolved a lot since MRv1. 
However, we ensure source compatibility for '''mapreduce''' APIs that break 
binary compatibility. In other words, users should recompile their applications 
that use '''mapreduce''' APIs against MRv2 jars. One notable binary 
incompatibility break is '''Counter''' in
+   . One cannot ensure complete binary compatibility with the applications 
that use '''mapreduce''' APIs, as these APIs have evolved a lot since MRv1. n 
other words, users should recompile their applications that use '''mapreduce''' 
APIs against MRv2 jars. One notable binary incompatibility break is 
'''Counter''' in
  
+   .{{{
+ Package: crawl
-   . <>
- 
-   . Package: '''crawl '''
- 
-   . <>

[jira] [Issue Comment Deleted] (NUTCH-1946) Upgrade to Gora 0.6.1

2015-04-14 Thread Jeroen Vlek (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1946?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeroen Vlek updated NUTCH-1946:
---
Comment: was deleted

(was: Sorry, I'm a bit confused: Is any more action required on my part for the 
pull request to be accepted/rejected?)

> Upgrade to Gora 0.6.1
> -
>
> Key: NUTCH-1946
> URL: https://issues.apache.org/jira/browse/NUTCH-1946
> Project: Nutch
>  Issue Type: Improvement
>  Components: storage
>Affects Versions: 2.3.1
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
> Fix For: 2.3.1
>
> Attachments: NUTCH-1946.patch, NUTCH-1946_Gora_fixes.patch, 
> NUTCH-1946v2.patch, NUTCH-1946v3.patch
>
>
> Apache Gora was released recently.
> We should upgrade before pushing Nutch 2.3.1 as it will come in very handy 
> for the new Docker containers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)