[jira] [Commented] (NUTCH-1854) ./bin/crawl fails with a parsing fetcher
[ https://issues.apache.org/jira/browse/NUTCH-1854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14495687#comment-14495687 ] Asitang Mishra commented on NUTCH-1854: --- okay done Lewis.. > ./bin/crawl fails with a parsing fetcher > > > Key: NUTCH-1854 > URL: https://issues.apache.org/jira/browse/NUTCH-1854 > Project: Nutch > Issue Type: Bug > Components: parser >Affects Versions: 1.9 >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney > Fix For: 1.11 > > Attachments: NUTCH-1854ver1.patch, NUTCH-1854ver2.patch, > NUTCH-1854ver3.patch, NUTCH-1854ver4.patch > > > If you run ./bin/crawl with a parsing fetcher e.g. > > > fetcher.parse > > false > > If true, fetcher will parse content. Default is false, > > which means > > that a separate parsing step is required after fetching is > > finished. > > > we get a horrible message as follows > Exception in thread "main" java.io.IOException: Segment already parsed! > We could improve this by making logging more complete and by adding a trigger > to the crawl script which would check for crawl_parse for a given segment and > then skip parsing if this is present. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (NUTCH-1854) ./bin/crawl fails with a parsing fetcher
[ https://issues.apache.org/jira/browse/NUTCH-1854?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Asitang Mishra updated NUTCH-1854: -- Attachment: NUTCH-1854ver4.patch Added NUTCH-1854ver4.patch : formatted the NUTCH-1854ver3.patch > ./bin/crawl fails with a parsing fetcher > > > Key: NUTCH-1854 > URL: https://issues.apache.org/jira/browse/NUTCH-1854 > Project: Nutch > Issue Type: Bug > Components: parser >Affects Versions: 1.9 >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney > Fix For: 1.11 > > Attachments: NUTCH-1854ver1.patch, NUTCH-1854ver2.patch, > NUTCH-1854ver3.patch, NUTCH-1854ver4.patch > > > If you run ./bin/crawl with a parsing fetcher e.g. > > > fetcher.parse > > false > > If true, fetcher will parse content. Default is false, > > which means > > that a separate parsing step is required after fetching is > > finished. > > > we get a horrible message as follows > Exception in thread "main" java.io.IOException: Segment already parsed! > We could improve this by making logging more complete and by adding a trigger > to the crawl script which would check for crawl_parse for a given segment and > then skip parsing if this is present. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[Nutch Wiki] Update of "FrontPage" by ChrisMattmann
Dear Wiki user, You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification. The "FrontPage" page has been changed by ChrisMattmann: https://wiki.apache.org/nutch/FrontPage?action=diff&rev1=296&rev2=297 Comment: - whitelist tutorial * [[NutchMavenSupport|Using Nutch as a Maven dependency]] * GoogleSummerOfCode - An area dedicated to GSoC projects and student/mentor development/documentation sandbox. * AdvancedAjaxInteraction - Discussion centered on enabling Nutch to not only fetch, but also interact with JavaScript + * WhiteListRobots - User guide for the new host robots.txt whitelist capability == Nutch 2.x == * Nutch2Crawling - A description of the crawling jobs and field to database mappings.
Re: Review Request 33112: NUTCH-1927: Create a whitelist of IPs/hostnames to allow skipping of RobotRules parsing
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/33112/ --- (Updated April 15, 2015, 3:56 a.m.) Review request for nutch. Bugs: NUTCH-1927 https://issues.apache.org/jira/browse/NUTCH-1927 Repository: nutch Description --- Based on discussion on the dev list, to use Nutch for some security research valid use cases (DDoS; DNS and other testing), I am going to create a patch that allows a whitelist: robot.rules.whitelist 132.54.99.22,hostname.apache.org,foo.jpl.nasa.gov Comma separated list of hostnames or IP addresses to ignore robot rules parsing for. Diffs (updated) - ./trunk/CHANGES.txt 1673623 ./trunk/conf/nutch-default.xml 1673623 ./trunk/src/java/org/apache/nutch/protocol/RobotRules.java 1673623 ./trunk/src/java/org/apache/nutch/protocol/RobotRulesParser.java 1673623 ./trunk/src/java/org/apache/nutch/protocol/WhiteListRobotRules.java PRE-CREATION ./trunk/src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpRobotRulesParser.java 1673623 ./trunk/src/plugin/protocol-ftp/src/java/org/apache/nutch/protocol/ftp/FtpRobotRulesParser.java 1673623 Diff: https://reviews.apache.org/r/33112/diff/ Testing --- Tested using: RobotRulesParser in the o.a.n.protocol package against my home server. Robots.txt looks like: [chipotle:~/src/nutch] mattmann% more robots.txt User-agent: * Disallow: / [chipotle:~/src/nutch] mattmann% urls file: [chipotle:~/src/nutch] mattmann% more urls http://baron.pagemewhen.com/~chris/foo1.txt http://baron.pagemewhen.com/~chris/ [chipotle:~/src/nutch] mattmann% [chipotle:~/src/nutch] mattmann% java -cp build/apache-nutch-1.10-SNAPSHOT.job:build/apache-nutch-1.10-SNAPSHOT.jar:runtime/local/lib/hadoop-core-1.2.0.jar:runtime/local/lib/crawler-commons-0.5.jar:runtime/local/lib/slf4j-log4j12-1.6.1.jar:runtime/local/lib/slf4j-api-1.7.9.jar:runtime/local/lib/log4j-1.2.15.jar:runtime/local/lib/guava-11.0.2.jar:runtime/local/lib/commons-logging-1.1.1.jar org.apache.nutch.protocol.RobotRulesParser robots.txt urls Nutch-crawler Apr 12, 2015 9:22:50 AM org.apache.nutch.protocol.WhiteListRobotRules isWhiteListed INFO: Host: [baron.pagemewhen.com] is whitelisted and robots.txt rules parsing will be ignored allowed:http://baron.pagemewhen.com/~chris/foo1.txt Apr 12, 2015 9:22:50 AM org.apache.nutch.protocol.WhiteListRobotRules isWhiteListed INFO: Host: [baron.pagemewhen.com] is whitelisted and robots.txt rules parsing will be ignored allowed:http://baron.pagemewhen.com/~chris/ [chipotle:~/src/nutch] mattmann% Thanks, Chris Mattmann
[jira] [Commented] (NUTCH-1927) Create a whitelist of IPs/hostnames to allow skipping of RobotRules parsing
[ https://issues.apache.org/jira/browse/NUTCH-1927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14495620#comment-14495620 ] Chris A. Mattmann commented on NUTCH-1927: -- let me know what you guys think. Tested, works fine. Would like to commit in next 24 hours. > Create a whitelist of IPs/hostnames to allow skipping of RobotRules parsing > --- > > Key: NUTCH-1927 > URL: https://issues.apache.org/jira/browse/NUTCH-1927 > Project: Nutch > Issue Type: New Feature > Components: fetcher >Reporter: Chris A. Mattmann >Assignee: Chris A. Mattmann > Labels: available, patch > Fix For: 1.10 > > Attachments: NUTCH-1927.Mattmann.041115.patch.txt, > NUTCH-1927.Mattmann.041215.patch.txt, NUTCH-1927.Mattmann.041415.patch.txt > > > Based on discussion on the dev list, to use Nutch for some security research > valid use cases (DDoS; DNS and other testing), I am going to create a patch > that allows a whitelist: > {code:xml} > > robot.rules.whitelist > 132.54.99.22,hostname.apache.org,foo.jpl.nasa.gov > Comma separated list of hostnames or IP addresses to ignore > robot rules parsing for. > > > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (NUTCH-1927) Create a whitelist of IPs/hostnames to allow skipping of RobotRules parsing
[ https://issues.apache.org/jira/browse/NUTCH-1927?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated NUTCH-1927: - Attachment: NUTCH-1927.Mattmann.041415.patch.txt - updated patch addresses comments from Lewis and Seb. > Create a whitelist of IPs/hostnames to allow skipping of RobotRules parsing > --- > > Key: NUTCH-1927 > URL: https://issues.apache.org/jira/browse/NUTCH-1927 > Project: Nutch > Issue Type: New Feature > Components: fetcher >Reporter: Chris A. Mattmann >Assignee: Chris A. Mattmann > Labels: available, patch > Fix For: 1.10 > > Attachments: NUTCH-1927.Mattmann.041115.patch.txt, > NUTCH-1927.Mattmann.041215.patch.txt, NUTCH-1927.Mattmann.041415.patch.txt > > > Based on discussion on the dev list, to use Nutch for some security research > valid use cases (DDoS; DNS and other testing), I am going to create a patch > that allows a whitelist: > {code:xml} > > robot.rules.whitelist > 132.54.99.22,hostname.apache.org,foo.jpl.nasa.gov > Comma separated list of hostnames or IP addresses to ignore > robot rules parsing for. > > > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (NUTCH-1985) Adding a main() method to the MimeTypeIndexingFilter
[ https://issues.apache.org/jira/browse/NUTCH-1985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jorge Luis Betancourt Gonzalez updated NUTCH-1985: -- Attachment: NUTCH-1985.patch > Adding a main() method to the MimeTypeIndexingFilter > > > Key: NUTCH-1985 > URL: https://issues.apache.org/jira/browse/NUTCH-1985 > Project: Nutch > Issue Type: Improvement > Components: indexer, metadata, plugin >Affects Versions: 1.10 >Reporter: Jorge Luis Betancourt Gonzalez >Priority: Minor > Labels: features, patch, test > Fix For: 1.10 > > Attachments: NUTCH-1985.patch > > > This make very easy the testing of different rules files to check the > expressions used to filter the content based on the MIME type detected. Until > now the only way to check this was to do test crawls and check the stored > data in Solr/Elasticsearch. > This allows calling the file using the {{bin/nutch plugin}} command, > something like: > {{bin/nutch plugin mimetype-filter > org.apache.nutch.indexer.filter.MimeTypeIndexingFilter -h}} > Two options are accepted, {{-h, --help}} for showing the help and {{-rules}} > for specifying a rules file to be used, this makes easy to play with > different rules file until you get the desired behavior. > After invoking the class, a valid MIME type must be entered for each line, > and the output will be the same MIME type with a {{+}} or {{-}} sign in the > beginning, indicating if the given MIME type is allowed or denied > respectively. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (NUTCH-1985) Adding a main() method to the MimeTypeIndexingFilter
Jorge Luis Betancourt Gonzalez created NUTCH-1985: - Summary: Adding a main() method to the MimeTypeIndexingFilter Key: NUTCH-1985 URL: https://issues.apache.org/jira/browse/NUTCH-1985 Project: Nutch Issue Type: Improvement Components: indexer, metadata, plugin Affects Versions: 1.10 Reporter: Jorge Luis Betancourt Gonzalez Priority: Minor Fix For: 1.10 This make very easy the testing of different rules files to check the expressions used to filter the content based on the MIME type detected. Until now the only way to check this was to do test crawls and check the stored data in Solr/Elasticsearch. This allows calling the file using the {{bin/nutch plugin}} command, something like: {{bin/nutch plugin mimetype-filter org.apache.nutch.indexer.filter.MimeTypeIndexingFilter -h}} Two options are accepted, {{-h, --help}} for showing the help and {{-rules}} for specifying a rules file to be used, this makes easy to play with different rules file until you get the desired behavior. After invoking the class, a valid MIME type must be entered for each line, and the output will be the same MIME type with a {{+}} or {{-}} sign in the beginning, indicating if the given MIME type is allowed or denied respectively. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-1854) ./bin/crawl fails with a parsing fetcher
[ https://issues.apache.org/jira/browse/NUTCH-1854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14494778#comment-14494778 ] Lewis John McGibbney commented on NUTCH-1854: - [~asitang] can you please use the following template to format your code. http://svn.apache.org/repos/asf/nutch/branches/2.x/eclipse-codeformat.xml These patches are grand. > ./bin/crawl fails with a parsing fetcher > > > Key: NUTCH-1854 > URL: https://issues.apache.org/jira/browse/NUTCH-1854 > Project: Nutch > Issue Type: Bug > Components: parser >Affects Versions: 1.9 >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney > Fix For: 1.11 > > Attachments: NUTCH-1854ver1.patch, NUTCH-1854ver2.patch, > NUTCH-1854ver3.patch > > > If you run ./bin/crawl with a parsing fetcher e.g. > > > fetcher.parse > > false > > If true, fetcher will parse content. Default is false, > > which means > > that a separate parsing step is required after fetching is > > finished. > > > we get a horrible message as follows > Exception in thread "main" java.io.IOException: Segment already parsed! > We could improve this by making logging more complete and by adding a trigger > to the crawl script which would check for crawl_parse for a given segment and > then skip parsing if this is present. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[Nutch Wiki] Update of "SumanSaurabh/GSoC2015Nutch" by SumanSaurabh
Dear Wiki user, You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification. The "SumanSaurabh/GSoC2015Nutch" page has been changed by SumanSaurabh: https://wiki.apache.org/nutch/SumanSaurabh/GSoC2015Nutch?action=diff&rev1=3&rev2=4 . {{{ + }}} + + . Dependency ''hadoop-test-1.2.0.jar'' needs to be removed. + . {{{ + }}} . . '''1.3) Experimental setup with of Nutch with Hadoop and their result:'''
[Nutch Wiki] Update of "SumanSaurabh/GSoC2015Nutch" by SumanSaurabh
Dear Wiki user, You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification. The "SumanSaurabh/GSoC2015Nutch" page has been changed by SumanSaurabh: https://wiki.apache.org/nutch/SumanSaurabh/GSoC2015Nutch?action=diff&rev1=2&rev2=3 . . '''1.2) Workspace Setup:''' - . Nutch workspace it built on Ant+Ivy. I have experience with Ant build framework, so workspace setup would be relatively easier. I have forked the Nutch codebase to my Git '''[2]''' and after successful completion I will provide the patch. Meanwhile I will also try to resolve issues mentioned in Nutch Jira. + . Nutch workspace it built on Ant+Ivy. I have experience with Ant build framework, so workspace setup would be relatively easier. I have forked the Nutch codebase to my Git '''[2]''' and after successful completion I will provide the patch. + Nutch dependency on Hadoop: ''hadoop-core.1.x.jar'' is changed in ''Hadoop 2.x'' + . {{{ + + + + + + + + + }}} + . Following dependency needs to be added for Hadoop 2.6 support instead of above. + . {{{ + + + }}} . . '''1.3) Experimental setup with of Nutch with Hadoop and their result:''' . I have been using Hadoop 2.3 for my !MapReduce application and while trying to setup Nutch 1.9 with Hadoop 2.3. I ran into following error: - + . {{{ - . Injector: + Injector: - . java.lang.!UnsupportedOperationException: Not implemented by the !DistributedFileSystem !FileSystemimplementation + java.lang.!UnsupportedOperationException: Not implemented by the !DistributedFileSystem !FileSystem implementation - . at org.apache.hadoop.fs.!FileSystem.getScheme(!FileSystem.java:214) + at org.apache.hadoop.fs.!FileSystem.getScheme(!FileSystem.java:214) - - . at org.apache.hadoop.fs.!FileSystem.loadFileSystems(!FileSystem.java:2365) + at org.apache.hadoop.fs.!FileSystem.loadFileSystems(!FileSystem.java:2365) + at org.apache.hadoop.fs.!FileSystem.getFileSystemClass(!FileSystem.java:2375) + at org.apache.hadoop.fs.!FileSystem.createFileSystem(!FileSystem.java:2392) - - . at org.apache.hadoop.fs.!FileSystem.getFileSystemClass(!FileSystem.java:2375) at org.apache.hadoop.fs.!FileSystem.createFileSystem(!FileSystem.java:2392) - - . at org.apache.hadoop.fs.!FileSystem.access$200(!FileSystem.java:89) + at org.apache.hadoop.fs.!FileSystem.access$200(!FileSystem.java:89) + at org.apache.hadoop.fs.!FileSystem$Cache.getInternal(!FileSystem.java:2431) + at org.apache.hadoop.fs.!FileSystem$Cache.get(!FileSystem.java:2413) - - . at org.apache.hadoop.fs.!FileSystem$Cache.getInternal(!FileSystem.java:2431) at org.apache.hadoop.fs.!FileSystem$Cache.get(!FileSystem.java:2413) - - . at org.apache.hadoop.fs.!FileSystem.get(!FileSystem.java:368) + at org.apache.hadoop.fs.!FileSystem.get(!FileSystem.java:368) - - . at org.apache.hadoop.fs.!FileSystem.get(!FileSystem.java:167) + at org.apache.hadoop.fs.!FileSystem.get(!FileSystem.java:167) - - . at org.apache.nutch.crawl.Injector.inject(Injector.java:297) + at org.apache.nutch.crawl.Injector.inject(Injector.java:297) - . at org.apache.nutch.crawl.Injector.run(Injector.java:380) + at org.apache.nutch.crawl.Injector.run(Injector.java:380) - . at org.apache.hadoop.util.!ToolRunner.run(!ToolRunner.java:70) + at org.apache.hadoop.util.!ToolRunner.run(!ToolRunner.java:70) - - . at org.apache.nutch.crawl.Injector.main(Injector.java:370) . + at org.apache.nutch.crawl.Injector.main(Injector.java:370) . - + }}} . May be I will start looking at this point onwards? == Phase 2 (Coding): == - . 2.1) Migrating from Hadoop 1.x to Hadoop 2.x + . '''2.1) Migrating from Hadoop 1.x to Hadoop 2.x''' . '''Binary Compatibility'' ''': . First, we ensure binary compatibility to the applications that use old '''mapred''' APIs. This means that applications which were built against MRv1 '''mapred''' APIs can run directly on YARN without recompilation, merely by pointing them to an Apache Hadoop 2.x cluster via configuration. @@ -139, +149 @@ . '''Source Compatibility:''' - . One cannot ensure complete binary compatibility with the applications that use '''mapreduce''' APIs, as these APIs have evolved a lot since MRv1. However, we ensure source compatibility for '''mapreduce''' APIs that break binary compatibility. In other words, users should recompile their applications that use '''mapreduce''' APIs against MRv2 jars. One notable binary incompatibility break is '''Counter''' in + . One cannot ensure complete binary compatibility with the applications that use '''mapreduce''' APIs, as these APIs have evolved a lot since MRv1. n other words, users should recompile their applications that use '''mapreduce''' APIs against MRv2 jars. One notable binary incompatibility break is '''Counter''' in + .{{{ + Package: crawl - . <> - - . Package: '''crawl ''' - - . <>
[jira] [Issue Comment Deleted] (NUTCH-1946) Upgrade to Gora 0.6.1
[ https://issues.apache.org/jira/browse/NUTCH-1946?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jeroen Vlek updated NUTCH-1946: --- Comment: was deleted (was: Sorry, I'm a bit confused: Is any more action required on my part for the pull request to be accepted/rejected?) > Upgrade to Gora 0.6.1 > - > > Key: NUTCH-1946 > URL: https://issues.apache.org/jira/browse/NUTCH-1946 > Project: Nutch > Issue Type: Improvement > Components: storage >Affects Versions: 2.3.1 >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney > Fix For: 2.3.1 > > Attachments: NUTCH-1946.patch, NUTCH-1946_Gora_fixes.patch, > NUTCH-1946v2.patch, NUTCH-1946v3.patch > > > Apache Gora was released recently. > We should upgrade before pushing Nutch 2.3.1 as it will come in very handy > for the new Docker containers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)