[jira] [Commented] (NUTCH-2049) Upgrade Trunk to Hadoop > 2.4 stable
[ https://issues.apache.org/jira/browse/NUTCH-2049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14710052#comment-14710052 ] Lewis John McGibbney commented on NUTCH-2049: - Thanks for committing [~jnioche] > Upgrade Trunk to Hadoop > 2.4 stable > > > Key: NUTCH-2049 > URL: https://issues.apache.org/jira/browse/NUTCH-2049 > Project: Nutch > Issue Type: Improvement > Components: build >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney > Labels: memex > Fix For: 1.11 > > Attachments: NUTCH-2049.patch, NUTCH-2049v2.patch, NUTCH-2049v3.patch > > > Convo here - http://www.mail-archive.com/dev%40nutch.apache.org/msg18225.html > I am +1 for taking trunk (or a branch of trunk) to explicit dependency on > > Hadoop 2.6. > We can run our tests, we can validate, we can fix. > I will be doing validation on 2.X in paralegal as this is what I use on my > own projects. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2049) Upgrade Trunk to Hadoop > 2.4 stable
[ https://issues.apache.org/jira/browse/NUTCH-2049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14709912#comment-14709912 ] Hudson commented on NUTCH-2049: --- SUCCESS: Integrated in Nutch-trunk #3261 (See [https://builds.apache.org/job/Nutch-trunk/3261/]) NUTCH-2049 Upgrade Trunk to Hadoop > 2.4 stable (jnioche: http://svn.apache.org/viewvc/nutch/trunk/?view=rev&rev=1697466) * /nutch/trunk/CHANGES.txt * /nutch/trunk/build.xml * /nutch/trunk/conf/nutch-default.xml * /nutch/trunk/ivy/ivy.xml * /nutch/trunk/src/bin/crawl * /nutch/trunk/src/bin/nutch * /nutch/trunk/src/java/org/apache/nutch/crawl/AdaptiveFetchSchedule.java * /nutch/trunk/src/java/org/apache/nutch/crawl/CrawlDbReader.java * /nutch/trunk/src/java/org/apache/nutch/crawl/DeduplicationJob.java * /nutch/trunk/src/java/org/apache/nutch/crawl/MimeAdaptiveFetchSchedule.java * /nutch/trunk/src/java/org/apache/nutch/fetcher/FetcherOutputFormat.java * /nutch/trunk/src/java/org/apache/nutch/indexer/IndexingFiltersChecker.java * /nutch/trunk/src/java/org/apache/nutch/parse/OutlinkExtractor.java * /nutch/trunk/src/java/org/apache/nutch/parse/ParseOutputFormat.java * /nutch/trunk/src/java/org/apache/nutch/parse/ParseSegment.java * /nutch/trunk/src/java/org/apache/nutch/parse/ParserChecker.java * /nutch/trunk/src/java/org/apache/nutch/plugin/Extension.java * /nutch/trunk/src/java/org/apache/nutch/plugin/PluginManifestParser.java * /nutch/trunk/src/java/org/apache/nutch/segment/ContentAsTextInputFormat.java * /nutch/trunk/src/java/org/apache/nutch/segment/SegmentMerger.java * /nutch/trunk/src/java/org/apache/nutch/segment/SegmentReader.java * /nutch/trunk/src/java/org/apache/nutch/service/JobManager.java * /nutch/trunk/src/java/org/apache/nutch/tools/CommonCrawlDataDumper.java * /nutch/trunk/src/java/org/apache/nutch/tools/FileDumper.java * /nutch/trunk/src/java/org/apache/nutch/util/HadoopFSUtil.java * /nutch/trunk/src/java/org/apache/nutch/util/LockUtil.java * /nutch/trunk/src/java/org/apache/nutch/util/domain/DomainStatistics.java * /nutch/trunk/src/plugin/parsefilter-naivebayes/src/java/org/apache/nutch/parsefilter/naivebayes/NaiveBayesClassifier.java * /nutch/trunk/src/plugin/parsefilter-naivebayes/src/java/org/apache/nutch/parsefilter/naivebayes/NaiveBayesParseFilter.java * /nutch/trunk/src/test/crawl-tests.xml * /nutch/trunk/src/test/org/apache/nutch/crawl/CrawlDBTestUtil.java * /nutch/trunk/src/test/org/apache/nutch/crawl/TestCrawlDbFilter.java * /nutch/trunk/src/test/org/apache/nutch/crawl/TestCrawlDbMerger.java * /nutch/trunk/src/test/org/apache/nutch/crawl/TestGenerator.java * /nutch/trunk/src/test/org/apache/nutch/crawl/TestInjector.java * /nutch/trunk/src/test/org/apache/nutch/crawl/TestLinkDbMerger.java * /nutch/trunk/src/test/org/apache/nutch/fetcher/TestFetcher.java * /nutch/trunk/src/test/org/apache/nutch/net/TestURLNormalizers.java * /nutch/trunk/src/test/org/apache/nutch/segment/TestSegmentMerger.java * /nutch/trunk/src/test/org/apache/nutch/segment/TestSegmentMergerCrawlDatums.java * /nutch/trunk/src/test/org/apache/nutch/tools/proxy/SegmentHandler.java > Upgrade Trunk to Hadoop > 2.4 stable > > > Key: NUTCH-2049 > URL: https://issues.apache.org/jira/browse/NUTCH-2049 > Project: Nutch > Issue Type: Improvement > Components: build >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney > Labels: memex > Fix For: 1.11 > > Attachments: NUTCH-2049.patch, NUTCH-2049v2.patch, NUTCH-2049v3.patch > > > Convo here - http://www.mail-archive.com/dev%40nutch.apache.org/msg18225.html > I am +1 for taking trunk (or a branch of trunk) to explicit dependency on > > Hadoop 2.6. > We can run our tests, we can validate, we can fix. > I will be doing validation on 2.X in paralegal as this is what I use on my > own projects. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2049) Upgrade Trunk to Hadoop > 2.4 stable
[ https://issues.apache.org/jira/browse/NUTCH-2049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14706884#comment-14706884 ] Chris A. Mattmann commented on NUTCH-2049: -- +1 to commit this. Great work team. > Upgrade Trunk to Hadoop > 2.4 stable > > > Key: NUTCH-2049 > URL: https://issues.apache.org/jira/browse/NUTCH-2049 > Project: Nutch > Issue Type: Improvement > Components: build >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney > Labels: memex > Fix For: 1.11 > > Attachments: NUTCH-2049.patch, NUTCH-2049v2.patch, NUTCH-2049v3.patch > > > Convo here - http://www.mail-archive.com/dev%40nutch.apache.org/msg18225.html > I am +1 for taking trunk (or a branch of trunk) to explicit dependency on > > Hadoop 2.6. > We can run our tests, we can validate, we can fix. > I will be doing validation on 2.X in paralegal as this is what I use on my > own projects. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2049) Upgrade Trunk to Hadoop > 2.4 stable
[ https://issues.apache.org/jira/browse/NUTCH-2049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14706789#comment-14706789 ] Sebastian Nagel commented on NUTCH-2049: +1 to commit, as said, looking on performance of the unit tests can be done later, some details below. {noformat} % time ant clean runtime test (before) Total time: 5 minutes 34 seconds real5m35.133s user7m30.968s sys 0m21.528s (after patching) Total time: 6 minutes 39 seconds real6m39.794s user9m31.444s sys 0m26.780s {noformat} These tests show significant differences, `-' before, `+' after patching: {noformat} [junit] Running org.apache.nutch.crawl.TestGenerator -[junit] Tests run: 4, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 32.846 sec +[junit] Tests run: 4, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 36.279 sec ... [junit] Running org.apache.nutch.fetcher.TestFetcher -[junit] Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 12.068 sec +[junit] Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 14.273 sec ... [junit] Running org.apache.nutch.parse.TestParserFactory -[junit] Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 2.783 sec +[junit] Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 4.038 sec ... [junit] Running org.apache.nutch.segment.TestSegmentMerger -[junit] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 75.408 sec +[junit] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 91.652 sec [junit] Running org.apache.nutch.segment.TestSegmentMergerCrawlDatums -[junit] Tests run: 7, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 69.821 sec +[junit] Tests run: 7, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 84.443 sec {noformat} > Upgrade Trunk to Hadoop > 2.4 stable > > > Key: NUTCH-2049 > URL: https://issues.apache.org/jira/browse/NUTCH-2049 > Project: Nutch > Issue Type: Improvement > Components: build >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney > Labels: memex > Fix For: 1.11 > > Attachments: NUTCH-2049.patch, NUTCH-2049v2.patch, NUTCH-2049v3.patch > > > Convo here - http://www.mail-archive.com/dev%40nutch.apache.org/msg18225.html > I am +1 for taking trunk (or a branch of trunk) to explicit dependency on > > Hadoop 2.6. > We can run our tests, we can validate, we can fix. > I will be doing validation on 2.X in paralegal as this is what I use on my > own projects. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2049) Upgrade Trunk to Hadoop > 2.4 stable
[ https://issues.apache.org/jira/browse/NUTCH-2049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14706402#comment-14706402 ] Julien Nioche commented on NUTCH-2049: -- Fantastic work [~lewismc]! I think this is one of the most important changes to Nutch in recent years. Well done. Compilation and tests all fine, crawl in local mode OK. +1 to commit > Upgrade Trunk to Hadoop > 2.4 stable > > > Key: NUTCH-2049 > URL: https://issues.apache.org/jira/browse/NUTCH-2049 > Project: Nutch > Issue Type: Improvement > Components: build >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney > Labels: memex > Fix For: 1.11 > > Attachments: NUTCH-2049.patch, NUTCH-2049v2.patch, NUTCH-2049v3.patch > > > Convo here - http://www.mail-archive.com/dev%40nutch.apache.org/msg18225.html > I am +1 for taking trunk (or a branch of trunk) to explicit dependency on > > Hadoop 2.6. > We can run our tests, we can validate, we can fix. > I will be doing validation on 2.X in paralegal as this is what I use on my > own projects. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2049) Upgrade Trunk to Hadoop > 2.4 stable
[ https://issues.apache.org/jira/browse/NUTCH-2049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14705904#comment-14705904 ] Lewis John McGibbney commented on NUTCH-2049: - Hi [~wastl-nagel] thanks for comment. bq. Interestingly, the unit tests seem to take longer (5 -> 6 min. on my laptop). That's not a blocker, but would be good to know why? Excellent observation... I need to be honest and say that I didn't even notice. Maybe some [tracing|https://issues.apache.org/jira/browse/NUTCH-2005] would be a good idea ;) > Upgrade Trunk to Hadoop > 2.4 stable > > > Key: NUTCH-2049 > URL: https://issues.apache.org/jira/browse/NUTCH-2049 > Project: Nutch > Issue Type: Improvement > Components: build >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney > Labels: memex > Fix For: 1.11 > > Attachments: NUTCH-2049.patch, NUTCH-2049v2.patch, NUTCH-2049v3.patch > > > Convo here - http://www.mail-archive.com/dev%40nutch.apache.org/msg18225.html > I am +1 for taking trunk (or a branch of trunk) to explicit dependency on > > Hadoop 2.6. > We can run our tests, we can validate, we can fix. > I will be doing validation on 2.X in paralegal as this is what I use on my > own projects. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2049) Upgrade Trunk to Hadoop > 2.4 stable
[ https://issues.apache.org/jira/browse/NUTCH-2049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14705895#comment-14705895 ] Sebastian Nagel commented on NUTCH-2049: Great job, Lewis! No time to test with real crawls now. Interestingly, the unit tests seem to take longer (5 -> 6 min. on my laptop). That's not a blocker, but would be good to know why? > Upgrade Trunk to Hadoop > 2.4 stable > > > Key: NUTCH-2049 > URL: https://issues.apache.org/jira/browse/NUTCH-2049 > Project: Nutch > Issue Type: Improvement > Components: build >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney > Labels: memex > Fix For: 1.11 > > Attachments: NUTCH-2049.patch, NUTCH-2049v2.patch, NUTCH-2049v3.patch > > > Convo here - http://www.mail-archive.com/dev%40nutch.apache.org/msg18225.html > I am +1 for taking trunk (or a branch of trunk) to explicit dependency on > > Hadoop 2.6. > We can run our tests, we can validate, we can fix. > I will be doing validation on 2.X in paralegal as this is what I use on my > own projects. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2049) Upgrade Trunk to Hadoop > 2.4 stable
[ https://issues.apache.org/jira/browse/NUTCH-2049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14701707#comment-14701707 ] Michael Joyce commented on NUTCH-2049: -- Great stuff Lewis. Builds and runs cleanly locally for me. I also scoped a test that was run on EMR with 2.4.0 and all looks good. > Upgrade Trunk to Hadoop > 2.4 stable > > > Key: NUTCH-2049 > URL: https://issues.apache.org/jira/browse/NUTCH-2049 > Project: Nutch > Issue Type: Improvement > Components: build >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney > Labels: memex > Fix For: 1.11 > > Attachments: NUTCH-2049.patch, NUTCH-2049v2.patch > > > Convo here - http://www.mail-archive.com/dev%40nutch.apache.org/msg18225.html > I am +1 for taking trunk (or a branch of trunk) to explicit dependency on > > Hadoop 2.6. > We can run our tests, we can validate, we can fix. > I will be doing validation on 2.X in paralegal as this is what I use on my > own projects. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2049) Upgrade Trunk to Hadoop > 2.4 stable
[ https://issues.apache.org/jira/browse/NUTCH-2049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14701514#comment-14701514 ] Asitang Mishra commented on NUTCH-2049: --- Ack!! > Upgrade Trunk to Hadoop > 2.4 stable > > > Key: NUTCH-2049 > URL: https://issues.apache.org/jira/browse/NUTCH-2049 > Project: Nutch > Issue Type: Improvement > Components: build >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney > Labels: memex > Fix For: 1.11 > > Attachments: NUTCH-2049.patch, NUTCH-2049v2.patch > > > Convo here - http://www.mail-archive.com/dev%40nutch.apache.org/msg18225.html > I am +1 for taking trunk (or a branch of trunk) to explicit dependency on > > Hadoop 2.6. > We can run our tests, we can validate, we can fix. > I will be doing validation on 2.X in paralegal as this is what I use on my > own projects. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2049) Upgrade Trunk to Hadoop > 2.4 stable
[ https://issues.apache.org/jira/browse/NUTCH-2049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14701497#comment-14701497 ] Chris A. Mattmann commented on NUTCH-2049: -- Asitang, if you recall, we discussed simply figuring out the Hadoop cluster's server name - there is nothing stopping us from a Hadoop job inside a Hadoop job. I would suggest you try going down that path to sense the Hadoop TaskTracker host (via Context or other properties) and to pass that down to Mahout. Also I think a good improvement would be to separate out the training tool too. Can you please work on both? > Upgrade Trunk to Hadoop > 2.4 stable > > > Key: NUTCH-2049 > URL: https://issues.apache.org/jira/browse/NUTCH-2049 > Project: Nutch > Issue Type: Improvement > Components: build >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney > Labels: memex > Fix For: 1.11 > > Attachments: NUTCH-2049.patch, NUTCH-2049v2.patch > > > Convo here - http://www.mail-archive.com/dev%40nutch.apache.org/msg18225.html > I am +1 for taking trunk (or a branch of trunk) to explicit dependency on > > Hadoop 2.6. > We can run our tests, we can validate, we can fix. > I will be doing validation on 2.X in paralegal as this is what I use on my > own projects. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2049) Upgrade Trunk to Hadoop > 2.4 stable
[ https://issues.apache.org/jira/browse/NUTCH-2049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14701498#comment-14701498 ] Asitang Mishra commented on NUTCH-2049: --- Hi Lewis, Had some issues applying your patch the last time. Will again give it a try with the latest one and tell if the plugin works fine with it. Cheers > Upgrade Trunk to Hadoop > 2.4 stable > > > Key: NUTCH-2049 > URL: https://issues.apache.org/jira/browse/NUTCH-2049 > Project: Nutch > Issue Type: Improvement > Components: build >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney > Labels: memex > Fix For: 1.11 > > Attachments: NUTCH-2049.patch, NUTCH-2049v2.patch > > > Convo here - http://www.mail-archive.com/dev%40nutch.apache.org/msg18225.html > I am +1 for taking trunk (or a branch of trunk) to explicit dependency on > > Hadoop 2.6. > We can run our tests, we can validate, we can fix. > I will be doing validation on 2.X in paralegal as this is what I use on my > own projects. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2049) Upgrade Trunk to Hadoop > 2.4 stable
[ https://issues.apache.org/jira/browse/NUTCH-2049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14701495#comment-14701495 ] Asitang Mishra commented on NUTCH-2049: --- Hi Chris, The Naive Bayes plugin, since has a hadoop job of it's own. does only work in local mode and not distributed. Because, the Parse job of which this plugin is a part, is also a hadoop job. So, it becomes a nested hadoop job. Since, the training part of the plugin is the only one that is a hadoop job (and not the classification). I can make a separate tool for training. And keep only the classification part in the plugin, which is not a hadoop job (And have tested this in distributed mode). > Upgrade Trunk to Hadoop > 2.4 stable > > > Key: NUTCH-2049 > URL: https://issues.apache.org/jira/browse/NUTCH-2049 > Project: Nutch > Issue Type: Improvement > Components: build >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney > Labels: memex > Fix For: 1.11 > > Attachments: NUTCH-2049.patch, NUTCH-2049v2.patch > > > Convo here - http://www.mail-archive.com/dev%40nutch.apache.org/msg18225.html > I am +1 for taking trunk (or a branch of trunk) to explicit dependency on > > Hadoop 2.6. > We can run our tests, we can validate, we can fix. > I will be doing validation on 2.X in paralegal as this is what I use on my > own projects. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2049) Upgrade Trunk to Hadoop > 2.4 stable
[ https://issues.apache.org/jira/browse/NUTCH-2049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14701315#comment-14701315 ] Chris A. Mattmann commented on NUTCH-2049: -- Great, thanks Lewis. The introduction into core ivy/ivy.xml was because for whatever reason, putting it in the plugin ivy/ivy.xml wouldn't work for some reason. So anyways all I'm saying is that we shouldn't trade functionality A for B - I want A & B :-) So Asitang and Lewis please help make sure A & B stay. > Upgrade Trunk to Hadoop > 2.4 stable > > > Key: NUTCH-2049 > URL: https://issues.apache.org/jira/browse/NUTCH-2049 > Project: Nutch > Issue Type: Improvement > Components: build >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney > Labels: memex > Fix For: 1.11 > > Attachments: NUTCH-2049.patch, NUTCH-2049v2.patch > > > Convo here - http://www.mail-archive.com/dev%40nutch.apache.org/msg18225.html > I am +1 for taking trunk (or a branch of trunk) to explicit dependency on > > Hadoop 2.6. > We can run our tests, we can validate, we can fix. > I will be doing validation on 2.X in paralegal as this is what I use on my > own projects. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2049) Upgrade Trunk to Hadoop > 2.4 stable
[ https://issues.apache.org/jira/browse/NUTCH-2049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14700895#comment-14700895 ] Lewis John McGibbney commented on NUTCH-2049: - Hi [~chrismattmann] please see NUTCH-1486. That patch should act as somewhat of a prerequisite to sorting out the dependency soup issue which was introduced in NUTCH-2038 via introduction of old mahout-core, mahout-cli and transitive lucene-* dependencies which were defined within core ivy/ivy.xml as oppose to ivy.xml at plugin level. These dependency issues are proposed to be resolved in NUTCH-2056 however I've resolved them within NUTCH-1486 so if anything I would suggest that if [~asitang] could scope the latest NUTCH-1486 patch I've posted on NUTCH-1486 then this would be best use of his time. For clarification the reason I've removed the plugin in the most recent patch on this issue, is that the dependency soup is finally biting us on the back side. It is now time to sort it out with a fix. > Upgrade Trunk to Hadoop > 2.4 stable > > > Key: NUTCH-2049 > URL: https://issues.apache.org/jira/browse/NUTCH-2049 > Project: Nutch > Issue Type: Improvement > Components: build >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney > Fix For: 1.11 > > Attachments: NUTCH-2049.patch, NUTCH-2049v2.patch > > > Convo here - http://www.mail-archive.com/dev%40nutch.apache.org/msg18225.html > I am +1 for taking trunk (or a branch of trunk) to explicit dependency on > > Hadoop 2.6. > We can run our tests, we can validate, we can fix. > I will be doing validation on 2.X in paralegal as this is what I use on my > own projects. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2049) Upgrade Trunk to Hadoop > 2.4 stable
[ https://issues.apache.org/jira/browse/NUTCH-2049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14700551#comment-14700551 ] Chris A. Mattmann commented on NUTCH-2049: -- Thanks Lewis. [~asitang] please create an issue to upgrade your plugin to work on Hadoop 2.4 or in distributed and/or local model. Lewis, this patch can't take away functionality - it should only add it. Therefore, let's get Asitang's thing upgraded before committing this patch. > Upgrade Trunk to Hadoop > 2.4 stable > > > Key: NUTCH-2049 > URL: https://issues.apache.org/jira/browse/NUTCH-2049 > Project: Nutch > Issue Type: Improvement > Components: build >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney > Fix For: 1.11 > > Attachments: NUTCH-2049.patch, NUTCH-2049v2.patch > > > Convo here - http://www.mail-archive.com/dev%40nutch.apache.org/msg18225.html > I am +1 for taking trunk (or a branch of trunk) to explicit dependency on > > Hadoop 2.6. > We can run our tests, we can validate, we can fix. > I will be doing validation on 2.X in paralegal as this is what I use on my > own projects. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2049) Upgrade Trunk to Hadoop > 2.4 stable
[ https://issues.apache.org/jira/browse/NUTCH-2049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14700547#comment-14700547 ] Lewis John McGibbney commented on NUTCH-2049: - Update, tested on * Amazon EMR's Hadoop 2.4.0 * Apache Hadoop 2.4.0 running psudo distrib and * Apache Hadoop 2.4.0 running Nutch in local mode. All tests pass, all jobs are successful and I am able to complete full crawls on Hadoop 2.4.0. Would be great if we could get further validation of this patch. [~asitang] please note that this patch CANNOT be run with your parsefilter-naivebayes activated, take a look into the patch to see that it has been deactivated. As I stated above, _hopefully_ this is addressed in NUTCH-1486... if not, then we need to look at making it work seamlessly. > Upgrade Trunk to Hadoop > 2.4 stable > > > Key: NUTCH-2049 > URL: https://issues.apache.org/jira/browse/NUTCH-2049 > Project: Nutch > Issue Type: Improvement > Components: build >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney > Fix For: 1.11 > > Attachments: NUTCH-2049.patch, NUTCH-2049v2.patch > > > Convo here - http://www.mail-archive.com/dev%40nutch.apache.org/msg18225.html > I am +1 for taking trunk (or a branch of trunk) to explicit dependency on > > Hadoop 2.6. > We can run our tests, we can validate, we can fix. > I will be doing validation on 2.X in paralegal as this is what I use on my > own projects. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2049) Upgrade Trunk to Hadoop > 2.4 stable
[ https://issues.apache.org/jira/browse/NUTCH-2049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14694210#comment-14694210 ] Michael Joyce commented on NUTCH-2049: -- Hey [~lewismc], Tried your patch here. Seems I have to add the following to the ivy.xml file to get this to work at all {code} {code} Otherwise, I end up getting the following when I try to run a test crawl {code} Injector: starting at 2015-08-12 15:04:42 Injector: crawlDb: crawl/crawldb Injector: urlDir: ../../urls_test Injector: Converting injected urls to crawl db entries. Injector: java.io.IOException: Cannot initialize Cluster. Please check your configuration for mapreduce.framework.name and the correspond server addresses. at org.apache.hadoop.mapreduce.Cluster.initialize(Cluster.java:120) at org.apache.hadoop.mapreduce.Cluster.(Cluster.java:82) at org.apache.hadoop.mapreduce.Cluster.(Cluster.java:75) at org.apache.hadoop.mapred.JobClient.init(JobClient.java:470) at org.apache.hadoop.mapred.JobClient.(JobClient.java:449) at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:832) at org.apache.nutch.crawl.Injector.inject(Injector.java:323) at org.apache.nutch.crawl.Injector.run(Injector.java:379) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) at org.apache.nutch.crawl.Injector.main(Injector.java:369) {code} However, after addressing that concern I end up runnign into the following on the test crawl {code} java.lang.Exception: java.lang.ClassCastException: org.apache.hadoop.io.SequenceFile$Writer$KeyClassOption cannot be cast to org.apache.hadoop.io.MapFile$Writer$Option at org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:529) Caused by: java.lang.ClassCastException: org.apache.hadoop.io.SequenceFile$Writer$KeyClassOption cannot be cast to org.apache.hadoop.io.MapFile$Writer$Option at org.apache.nutch.fetcher.FetcherOutputFormat.getRecordWriter(FetcherOutputFormat.java:70) at org.apache.hadoop.mapred.ReduceTask$OldTrackingRecordWriter.(ReduceTask.java:484) at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:414) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:392) at org.apache.hadoop.mapred.LocalJobRunner$Job$ReduceTaskRunnable.run(LocalJobRunner.java:319) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) 2015-08-12 14:24:39,906 ERROR fetcher.Fetcher - Fetcher: java.io.IOException: Job failed! at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:836) at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:496) at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:532) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:505) {code} > Upgrade Trunk to Hadoop > 2.4 stable > > > Key: NUTCH-2049 > URL: https://issues.apache.org/jira/browse/NUTCH-2049 > Project: Nutch > Issue Type: Improvement > Components: build >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney > Fix For: 1.11 > > Attachments: NUTCH-2049.patch > > > Convo here - http://www.mail-archive.com/dev%40nutch.apache.org/msg18225.html > I am +1 for taking trunk (or a branch of trunk) to explicit dependency on > > Hadoop 2.6. > We can run our tests, we can validate, we can fix. > I will be doing validation on 2.X in paralegal as this is what I use on my > own projects. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2049) Upgrade Trunk to Hadoop > 2.4 stable
[ https://issues.apache.org/jira/browse/NUTCH-2049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14641402#comment-14641402 ] Lewis John McGibbney commented on NUTCH-2049: - BTW, this is only for 2.4.0 for same reason as explained at last issue. Thsi is an upgrade of dependencies and API usage NOT mapred --> mapreduce API's for each NutchJob. [~markus.jel...@openindex.io] had a great crack at trying to upgrade some... I would also join his ranks and make best efforts to make all jobs 2.X mapreduce API if it makes sense. It would be nice to have a Nutch roadMap TBH. Team, how do we feel here? Tests are broken as follows {code} 1 Testsuite: org.apache.nutch.crawl.TestCrawlDbFilter 2 Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 0.986 sec 3 - Standard Output --- 4 2015-07-25 01:29:50,852 WARN util.NativeCodeLoader (NativeCodeLoader.java:(62)) - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 5 2015-07-25 01:29:51,215 INFO compress.CodecPool (CodecPool.java:getCompressor(151)) - Got brand-new compressor [.deflate] 6 2015-07-25 01:29:51,231 INFO compress.CodecPool (CodecPool.java:getCompressor(151)) - Got brand-new compressor [.deflate] 7 2015-07-25 01:29:51,231 INFO crawl.CrawlDBTestUtil (CrawlDBTestUtil.java:createCrawlDb(67)) - adding:http://www.example.com 8 2015-07-25 01:29:51,232 INFO crawl.CrawlDBTestUtil (CrawlDBTestUtil.java:createCrawlDb(67)) - adding:http://www.example1.com 9 2015-07-25 01:29:51,235 INFO crawl.CrawlDBTestUtil (CrawlDBTestUtil.java:createCrawlDb(67)) - adding:http://www.example2.com 10 - --- 11 - Standard Error - 12 SLF4J: Class path contains multiple SLF4J bindings. 13 SLF4J: Found binding in [jar:file:/usr/local/trunk_clean/build/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class] 14 SLF4J: Found binding in [jar:file:/usr/local/trunk_clean/build/test/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class] 15 SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. 16 SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory] 17 - --- 18 19 Testcase: testUrl404Purging took 0.969 sec 20 Caused an ERROR 21 Cannot initialize Cluster. Please check your configuration for mapreduce.framework.name and the correspond server addresses. 22 java.io.IOException: Cannot initialize Cluster. Please check your configuration for mapreduce.framework.name and the correspond server addresses. 23 at org.apache.hadoop.mapreduce.Cluster.initialize(Cluster.java:120) 24 at org.apache.hadoop.mapreduce.Cluster.(Cluster.java:82) 25 at org.apache.hadoop.mapreduce.Cluster.(Cluster.java:75) 26 at org.apache.hadoop.mapred.JobClient.init(JobClient.java:470) 27 at org.apache.hadoop.mapred.JobClient.(JobClient.java:449) 28 at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:832) 29 at org.apache.nutch.crawl.TestCrawlDbFilter.testUrl404Purging(TestCrawlDbFilter.java:107) {code} > Upgrade Trunk to Hadoop > 2.4 stable > > > Key: NUTCH-2049 > URL: https://issues.apache.org/jira/browse/NUTCH-2049 > Project: Nutch > Issue Type: Improvement > Components: build >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney > Fix For: 1.11 > > Attachments: NUTCH-2049.patch > > > Convo here - http://www.mail-archive.com/dev%40nutch.apache.org/msg18225.html > I am +1 for taking trunk (or a branch of trunk) to explicit dependency on > > Hadoop 2.6. > We can run our tests, we can validate, we can fix. > I will be doing validation on 2.X in paralegal as this is what I use on my > own projects. -- This message was sent by Atlassian JIRA (v6.3.4#6332)