[jira] [Commented] (NUTCH-2049) Upgrade Trunk to Hadoop > 2.4 stable

2015-08-24 Thread Lewis John McGibbney (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14710052#comment-14710052
 ] 

Lewis John McGibbney commented on NUTCH-2049:
-

Thanks for committing [~jnioche]

> Upgrade Trunk to Hadoop > 2.4 stable
> 
>
> Key: NUTCH-2049
> URL: https://issues.apache.org/jira/browse/NUTCH-2049
> Project: Nutch
>  Issue Type: Improvement
>  Components: build
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>  Labels: memex
> Fix For: 1.11
>
> Attachments: NUTCH-2049.patch, NUTCH-2049v2.patch, NUTCH-2049v3.patch
>
>
> Convo here - http://www.mail-archive.com/dev%40nutch.apache.org/msg18225.html
> I am +1 for taking trunk (or a branch of trunk) to explicit dependency on > 
> Hadoop 2.6.
> We can run our tests, we can validate, we can fix.
> I will be doing validation on 2.X in paralegal as this is what I use on my 
> own projects. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2049) Upgrade Trunk to Hadoop > 2.4 stable

2015-08-24 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14709912#comment-14709912
 ] 

Hudson commented on NUTCH-2049:
---

SUCCESS: Integrated in Nutch-trunk #3261 (See 
[https://builds.apache.org/job/Nutch-trunk/3261/])
NUTCH-2049 Upgrade Trunk to Hadoop > 2.4 stable (jnioche: 
http://svn.apache.org/viewvc/nutch/trunk/?view=rev&rev=1697466)
* /nutch/trunk/CHANGES.txt
* /nutch/trunk/build.xml
* /nutch/trunk/conf/nutch-default.xml
* /nutch/trunk/ivy/ivy.xml
* /nutch/trunk/src/bin/crawl
* /nutch/trunk/src/bin/nutch
* /nutch/trunk/src/java/org/apache/nutch/crawl/AdaptiveFetchSchedule.java
* /nutch/trunk/src/java/org/apache/nutch/crawl/CrawlDbReader.java
* /nutch/trunk/src/java/org/apache/nutch/crawl/DeduplicationJob.java
* /nutch/trunk/src/java/org/apache/nutch/crawl/MimeAdaptiveFetchSchedule.java
* /nutch/trunk/src/java/org/apache/nutch/fetcher/FetcherOutputFormat.java
* /nutch/trunk/src/java/org/apache/nutch/indexer/IndexingFiltersChecker.java
* /nutch/trunk/src/java/org/apache/nutch/parse/OutlinkExtractor.java
* /nutch/trunk/src/java/org/apache/nutch/parse/ParseOutputFormat.java
* /nutch/trunk/src/java/org/apache/nutch/parse/ParseSegment.java
* /nutch/trunk/src/java/org/apache/nutch/parse/ParserChecker.java
* /nutch/trunk/src/java/org/apache/nutch/plugin/Extension.java
* /nutch/trunk/src/java/org/apache/nutch/plugin/PluginManifestParser.java
* /nutch/trunk/src/java/org/apache/nutch/segment/ContentAsTextInputFormat.java
* /nutch/trunk/src/java/org/apache/nutch/segment/SegmentMerger.java
* /nutch/trunk/src/java/org/apache/nutch/segment/SegmentReader.java
* /nutch/trunk/src/java/org/apache/nutch/service/JobManager.java
* /nutch/trunk/src/java/org/apache/nutch/tools/CommonCrawlDataDumper.java
* /nutch/trunk/src/java/org/apache/nutch/tools/FileDumper.java
* /nutch/trunk/src/java/org/apache/nutch/util/HadoopFSUtil.java
* /nutch/trunk/src/java/org/apache/nutch/util/LockUtil.java
* /nutch/trunk/src/java/org/apache/nutch/util/domain/DomainStatistics.java
* 
/nutch/trunk/src/plugin/parsefilter-naivebayes/src/java/org/apache/nutch/parsefilter/naivebayes/NaiveBayesClassifier.java
* 
/nutch/trunk/src/plugin/parsefilter-naivebayes/src/java/org/apache/nutch/parsefilter/naivebayes/NaiveBayesParseFilter.java
* /nutch/trunk/src/test/crawl-tests.xml
* /nutch/trunk/src/test/org/apache/nutch/crawl/CrawlDBTestUtil.java
* /nutch/trunk/src/test/org/apache/nutch/crawl/TestCrawlDbFilter.java
* /nutch/trunk/src/test/org/apache/nutch/crawl/TestCrawlDbMerger.java
* /nutch/trunk/src/test/org/apache/nutch/crawl/TestGenerator.java
* /nutch/trunk/src/test/org/apache/nutch/crawl/TestInjector.java
* /nutch/trunk/src/test/org/apache/nutch/crawl/TestLinkDbMerger.java
* /nutch/trunk/src/test/org/apache/nutch/fetcher/TestFetcher.java
* /nutch/trunk/src/test/org/apache/nutch/net/TestURLNormalizers.java
* /nutch/trunk/src/test/org/apache/nutch/segment/TestSegmentMerger.java
* 
/nutch/trunk/src/test/org/apache/nutch/segment/TestSegmentMergerCrawlDatums.java
* /nutch/trunk/src/test/org/apache/nutch/tools/proxy/SegmentHandler.java


> Upgrade Trunk to Hadoop > 2.4 stable
> 
>
> Key: NUTCH-2049
> URL: https://issues.apache.org/jira/browse/NUTCH-2049
> Project: Nutch
>  Issue Type: Improvement
>  Components: build
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>  Labels: memex
> Fix For: 1.11
>
> Attachments: NUTCH-2049.patch, NUTCH-2049v2.patch, NUTCH-2049v3.patch
>
>
> Convo here - http://www.mail-archive.com/dev%40nutch.apache.org/msg18225.html
> I am +1 for taking trunk (or a branch of trunk) to explicit dependency on > 
> Hadoop 2.6.
> We can run our tests, we can validate, we can fix.
> I will be doing validation on 2.X in paralegal as this is what I use on my 
> own projects. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2049) Upgrade Trunk to Hadoop > 2.4 stable

2015-08-21 Thread Chris A. Mattmann (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14706884#comment-14706884
 ] 

Chris A. Mattmann commented on NUTCH-2049:
--

+1 to commit this. Great work team.

> Upgrade Trunk to Hadoop > 2.4 stable
> 
>
> Key: NUTCH-2049
> URL: https://issues.apache.org/jira/browse/NUTCH-2049
> Project: Nutch
>  Issue Type: Improvement
>  Components: build
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>  Labels: memex
> Fix For: 1.11
>
> Attachments: NUTCH-2049.patch, NUTCH-2049v2.patch, NUTCH-2049v3.patch
>
>
> Convo here - http://www.mail-archive.com/dev%40nutch.apache.org/msg18225.html
> I am +1 for taking trunk (or a branch of trunk) to explicit dependency on > 
> Hadoop 2.6.
> We can run our tests, we can validate, we can fix.
> I will be doing validation on 2.X in paralegal as this is what I use on my 
> own projects. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2049) Upgrade Trunk to Hadoop > 2.4 stable

2015-08-21 Thread Sebastian Nagel (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14706789#comment-14706789
 ] 

Sebastian Nagel commented on NUTCH-2049:


+1 to commit, as said, looking on performance of the unit tests can be done 
later, some details below.

{noformat}
% time ant clean runtime test
(before)
Total time: 5 minutes 34 seconds
real5m35.133s
user7m30.968s
sys 0m21.528s

(after patching)
Total time: 6 minutes 39 seconds
real6m39.794s
user9m31.444s
sys 0m26.780s
{noformat}

These tests show significant differences, `-' before, `+' after patching:
{noformat}
 [junit] Running org.apache.nutch.crawl.TestGenerator
-[junit] Tests run: 4, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 
32.846 sec
+[junit] Tests run: 4, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 
36.279 sec
...
 [junit] Running org.apache.nutch.fetcher.TestFetcher
-[junit] Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 
12.068 sec
+[junit] Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 
14.273 sec
...
 [junit] Running org.apache.nutch.parse.TestParserFactory
-[junit] Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 
2.783 sec
+[junit] Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 
4.038 sec
...
 [junit] Running org.apache.nutch.segment.TestSegmentMerger
-[junit] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 
75.408 sec
+[junit] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 
91.652 sec
 [junit] Running org.apache.nutch.segment.TestSegmentMergerCrawlDatums
-[junit] Tests run: 7, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 
69.821 sec
+[junit] Tests run: 7, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 
84.443 sec
{noformat}

> Upgrade Trunk to Hadoop > 2.4 stable
> 
>
> Key: NUTCH-2049
> URL: https://issues.apache.org/jira/browse/NUTCH-2049
> Project: Nutch
>  Issue Type: Improvement
>  Components: build
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>  Labels: memex
> Fix For: 1.11
>
> Attachments: NUTCH-2049.patch, NUTCH-2049v2.patch, NUTCH-2049v3.patch
>
>
> Convo here - http://www.mail-archive.com/dev%40nutch.apache.org/msg18225.html
> I am +1 for taking trunk (or a branch of trunk) to explicit dependency on > 
> Hadoop 2.6.
> We can run our tests, we can validate, we can fix.
> I will be doing validation on 2.X in paralegal as this is what I use on my 
> own projects. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2049) Upgrade Trunk to Hadoop > 2.4 stable

2015-08-21 Thread Julien Nioche (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14706402#comment-14706402
 ] 

Julien Nioche commented on NUTCH-2049:
--

Fantastic work [~lewismc]! I think this is one of the most important changes to 
Nutch in recent years. Well done.
Compilation and tests all fine, crawl in local mode OK. 

+1 to commit 

> Upgrade Trunk to Hadoop > 2.4 stable
> 
>
> Key: NUTCH-2049
> URL: https://issues.apache.org/jira/browse/NUTCH-2049
> Project: Nutch
>  Issue Type: Improvement
>  Components: build
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>  Labels: memex
> Fix For: 1.11
>
> Attachments: NUTCH-2049.patch, NUTCH-2049v2.patch, NUTCH-2049v3.patch
>
>
> Convo here - http://www.mail-archive.com/dev%40nutch.apache.org/msg18225.html
> I am +1 for taking trunk (or a branch of trunk) to explicit dependency on > 
> Hadoop 2.6.
> We can run our tests, we can validate, we can fix.
> I will be doing validation on 2.X in paralegal as this is what I use on my 
> own projects. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2049) Upgrade Trunk to Hadoop > 2.4 stable

2015-08-20 Thread Lewis John McGibbney (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14705904#comment-14705904
 ] 

Lewis John McGibbney commented on NUTCH-2049:
-

Hi [~wastl-nagel] thanks for comment.
bq.  Interestingly, the unit tests seem to take longer (5 -> 6 min. on my 
laptop). That's not a blocker, but would be good to know why?
Excellent observation... I need to be honest and say that I didn't even notice. 
Maybe some [tracing|https://issues.apache.org/jira/browse/NUTCH-2005] would be 
a good idea ;)

> Upgrade Trunk to Hadoop > 2.4 stable
> 
>
> Key: NUTCH-2049
> URL: https://issues.apache.org/jira/browse/NUTCH-2049
> Project: Nutch
>  Issue Type: Improvement
>  Components: build
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>  Labels: memex
> Fix For: 1.11
>
> Attachments: NUTCH-2049.patch, NUTCH-2049v2.patch, NUTCH-2049v3.patch
>
>
> Convo here - http://www.mail-archive.com/dev%40nutch.apache.org/msg18225.html
> I am +1 for taking trunk (or a branch of trunk) to explicit dependency on > 
> Hadoop 2.6.
> We can run our tests, we can validate, we can fix.
> I will be doing validation on 2.X in paralegal as this is what I use on my 
> own projects. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2049) Upgrade Trunk to Hadoop > 2.4 stable

2015-08-20 Thread Sebastian Nagel (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14705895#comment-14705895
 ] 

Sebastian Nagel commented on NUTCH-2049:


Great job, Lewis! No time to test with real crawls now. Interestingly, the unit 
tests seem to take longer (5 -> 6 min. on my laptop). That's not a blocker, but 
would be good to know why?

> Upgrade Trunk to Hadoop > 2.4 stable
> 
>
> Key: NUTCH-2049
> URL: https://issues.apache.org/jira/browse/NUTCH-2049
> Project: Nutch
>  Issue Type: Improvement
>  Components: build
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>  Labels: memex
> Fix For: 1.11
>
> Attachments: NUTCH-2049.patch, NUTCH-2049v2.patch, NUTCH-2049v3.patch
>
>
> Convo here - http://www.mail-archive.com/dev%40nutch.apache.org/msg18225.html
> I am +1 for taking trunk (or a branch of trunk) to explicit dependency on > 
> Hadoop 2.6.
> We can run our tests, we can validate, we can fix.
> I will be doing validation on 2.X in paralegal as this is what I use on my 
> own projects. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2049) Upgrade Trunk to Hadoop > 2.4 stable

2015-08-18 Thread Michael Joyce (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14701707#comment-14701707
 ] 

Michael Joyce commented on NUTCH-2049:
--

Great stuff Lewis. Builds and runs cleanly locally for me. I also scoped a test 
that was run on EMR with 2.4.0 and all looks good.

> Upgrade Trunk to Hadoop > 2.4 stable
> 
>
> Key: NUTCH-2049
> URL: https://issues.apache.org/jira/browse/NUTCH-2049
> Project: Nutch
>  Issue Type: Improvement
>  Components: build
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>  Labels: memex
> Fix For: 1.11
>
> Attachments: NUTCH-2049.patch, NUTCH-2049v2.patch
>
>
> Convo here - http://www.mail-archive.com/dev%40nutch.apache.org/msg18225.html
> I am +1 for taking trunk (or a branch of trunk) to explicit dependency on > 
> Hadoop 2.6.
> We can run our tests, we can validate, we can fix.
> I will be doing validation on 2.X in paralegal as this is what I use on my 
> own projects. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2049) Upgrade Trunk to Hadoop > 2.4 stable

2015-08-18 Thread Asitang Mishra (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14701514#comment-14701514
 ] 

Asitang Mishra commented on NUTCH-2049:
---

Ack!!

> Upgrade Trunk to Hadoop > 2.4 stable
> 
>
> Key: NUTCH-2049
> URL: https://issues.apache.org/jira/browse/NUTCH-2049
> Project: Nutch
>  Issue Type: Improvement
>  Components: build
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>  Labels: memex
> Fix For: 1.11
>
> Attachments: NUTCH-2049.patch, NUTCH-2049v2.patch
>
>
> Convo here - http://www.mail-archive.com/dev%40nutch.apache.org/msg18225.html
> I am +1 for taking trunk (or a branch of trunk) to explicit dependency on > 
> Hadoop 2.6.
> We can run our tests, we can validate, we can fix.
> I will be doing validation on 2.X in paralegal as this is what I use on my 
> own projects. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2049) Upgrade Trunk to Hadoop > 2.4 stable

2015-08-18 Thread Chris A. Mattmann (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14701497#comment-14701497
 ] 

Chris A. Mattmann commented on NUTCH-2049:
--

Asitang, if you recall, we discussed simply figuring out the Hadoop cluster's 
server name - there is nothing stopping us from a Hadoop job inside a Hadoop 
job. I would suggest you try going down that path to sense the Hadoop 
TaskTracker host (via Context or other properties) and to pass that down to 
Mahout.

Also I think a good improvement would be to separate out the training tool too. 

Can you please work on both?

> Upgrade Trunk to Hadoop > 2.4 stable
> 
>
> Key: NUTCH-2049
> URL: https://issues.apache.org/jira/browse/NUTCH-2049
> Project: Nutch
>  Issue Type: Improvement
>  Components: build
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>  Labels: memex
> Fix For: 1.11
>
> Attachments: NUTCH-2049.patch, NUTCH-2049v2.patch
>
>
> Convo here - http://www.mail-archive.com/dev%40nutch.apache.org/msg18225.html
> I am +1 for taking trunk (or a branch of trunk) to explicit dependency on > 
> Hadoop 2.6.
> We can run our tests, we can validate, we can fix.
> I will be doing validation on 2.X in paralegal as this is what I use on my 
> own projects. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2049) Upgrade Trunk to Hadoop > 2.4 stable

2015-08-18 Thread Asitang Mishra (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14701498#comment-14701498
 ] 

Asitang Mishra commented on NUTCH-2049:
---

Hi Lewis,

Had some issues applying your patch the last time. Will again give it a try 
with the latest one and tell if the plugin works fine with it.
 
Cheers

> Upgrade Trunk to Hadoop > 2.4 stable
> 
>
> Key: NUTCH-2049
> URL: https://issues.apache.org/jira/browse/NUTCH-2049
> Project: Nutch
>  Issue Type: Improvement
>  Components: build
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>  Labels: memex
> Fix For: 1.11
>
> Attachments: NUTCH-2049.patch, NUTCH-2049v2.patch
>
>
> Convo here - http://www.mail-archive.com/dev%40nutch.apache.org/msg18225.html
> I am +1 for taking trunk (or a branch of trunk) to explicit dependency on > 
> Hadoop 2.6.
> We can run our tests, we can validate, we can fix.
> I will be doing validation on 2.X in paralegal as this is what I use on my 
> own projects. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2049) Upgrade Trunk to Hadoop > 2.4 stable

2015-08-18 Thread Asitang Mishra (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14701495#comment-14701495
 ] 

Asitang Mishra commented on NUTCH-2049:
---

Hi Chris,

The Naive Bayes plugin, since has a hadoop job of it's own. does only work in 
local mode and not distributed. Because, the Parse job of which this plugin is 
a part, is also a hadoop job. So, it becomes a nested hadoop job. 

Since, the training part of the plugin is the only one that is a hadoop job 
(and not the classification). I can make a separate tool for training. And keep 
only the classification part in the plugin, which is not a hadoop job (And have 
tested this in distributed mode).

 

> Upgrade Trunk to Hadoop > 2.4 stable
> 
>
> Key: NUTCH-2049
> URL: https://issues.apache.org/jira/browse/NUTCH-2049
> Project: Nutch
>  Issue Type: Improvement
>  Components: build
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>  Labels: memex
> Fix For: 1.11
>
> Attachments: NUTCH-2049.patch, NUTCH-2049v2.patch
>
>
> Convo here - http://www.mail-archive.com/dev%40nutch.apache.org/msg18225.html
> I am +1 for taking trunk (or a branch of trunk) to explicit dependency on > 
> Hadoop 2.6.
> We can run our tests, we can validate, we can fix.
> I will be doing validation on 2.X in paralegal as this is what I use on my 
> own projects. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2049) Upgrade Trunk to Hadoop > 2.4 stable

2015-08-18 Thread Chris A. Mattmann (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14701315#comment-14701315
 ] 

Chris A. Mattmann commented on NUTCH-2049:
--

Great, thanks Lewis. The introduction into core ivy/ivy.xml was because for 
whatever reason, putting it in the plugin ivy/ivy.xml wouldn't work for some 
reason. So anyways all I'm saying is that we shouldn't trade functionality A 
for B - I want A & B :-) So Asitang and Lewis please help make sure A & B stay.

> Upgrade Trunk to Hadoop > 2.4 stable
> 
>
> Key: NUTCH-2049
> URL: https://issues.apache.org/jira/browse/NUTCH-2049
> Project: Nutch
>  Issue Type: Improvement
>  Components: build
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>  Labels: memex
> Fix For: 1.11
>
> Attachments: NUTCH-2049.patch, NUTCH-2049v2.patch
>
>
> Convo here - http://www.mail-archive.com/dev%40nutch.apache.org/msg18225.html
> I am +1 for taking trunk (or a branch of trunk) to explicit dependency on > 
> Hadoop 2.6.
> We can run our tests, we can validate, we can fix.
> I will be doing validation on 2.X in paralegal as this is what I use on my 
> own projects. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2049) Upgrade Trunk to Hadoop > 2.4 stable

2015-08-18 Thread Lewis John McGibbney (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14700895#comment-14700895
 ] 

Lewis John McGibbney commented on NUTCH-2049:
-

Hi [~chrismattmann] please see NUTCH-1486. That patch should act as somewhat of 
a prerequisite to sorting out the dependency soup issue which was introduced in 
NUTCH-2038 via introduction of old mahout-core, mahout-cli and transitive 
lucene-* dependencies which were defined within core ivy/ivy.xml as oppose to 
ivy.xml at plugin level. 
These dependency issues are proposed to be resolved in NUTCH-2056 however I've 
resolved them within NUTCH-1486 so if anything I would suggest that if 
[~asitang] could scope the latest NUTCH-1486 patch I've posted on NUTCH-1486 
then this would be best use of his time.
For clarification the reason I've removed the plugin in the most recent patch 
on this issue, is that the dependency soup is finally biting us on the back 
side. It is now time to sort it out with a fix.

> Upgrade Trunk to Hadoop > 2.4 stable
> 
>
> Key: NUTCH-2049
> URL: https://issues.apache.org/jira/browse/NUTCH-2049
> Project: Nutch
>  Issue Type: Improvement
>  Components: build
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
> Fix For: 1.11
>
> Attachments: NUTCH-2049.patch, NUTCH-2049v2.patch
>
>
> Convo here - http://www.mail-archive.com/dev%40nutch.apache.org/msg18225.html
> I am +1 for taking trunk (or a branch of trunk) to explicit dependency on > 
> Hadoop 2.6.
> We can run our tests, we can validate, we can fix.
> I will be doing validation on 2.X in paralegal as this is what I use on my 
> own projects. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2049) Upgrade Trunk to Hadoop > 2.4 stable

2015-08-17 Thread Chris A. Mattmann (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14700551#comment-14700551
 ] 

Chris A. Mattmann commented on NUTCH-2049:
--

Thanks Lewis. [~asitang] please create an issue to upgrade your plugin to work 
on Hadoop 2.4 or in distributed and/or local model. Lewis, this patch can't 
take away functionality - it should only add it. Therefore, let's get Asitang's 
thing upgraded before committing this patch.

> Upgrade Trunk to Hadoop > 2.4 stable
> 
>
> Key: NUTCH-2049
> URL: https://issues.apache.org/jira/browse/NUTCH-2049
> Project: Nutch
>  Issue Type: Improvement
>  Components: build
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
> Fix For: 1.11
>
> Attachments: NUTCH-2049.patch, NUTCH-2049v2.patch
>
>
> Convo here - http://www.mail-archive.com/dev%40nutch.apache.org/msg18225.html
> I am +1 for taking trunk (or a branch of trunk) to explicit dependency on > 
> Hadoop 2.6.
> We can run our tests, we can validate, we can fix.
> I will be doing validation on 2.X in paralegal as this is what I use on my 
> own projects. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2049) Upgrade Trunk to Hadoop > 2.4 stable

2015-08-17 Thread Lewis John McGibbney (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14700547#comment-14700547
 ] 

Lewis John McGibbney commented on NUTCH-2049:
-

Update, tested on 
* Amazon EMR's Hadoop 2.4.0
* Apache Hadoop 2.4.0 running psudo distrib and 
* Apache Hadoop 2.4.0 running Nutch in local mode. 
All tests pass, all jobs are successful and I am able to complete full crawls 
on Hadoop 2.4.0.
Would be great if we could get further validation of this patch.
[~asitang] please note that this patch CANNOT be run with your 
parsefilter-naivebayes activated, take a look into the patch to see that it has 
been deactivated. As I stated above, _hopefully_ this is addressed in 
NUTCH-1486... if not, then we need to look at making it work seamlessly.

> Upgrade Trunk to Hadoop > 2.4 stable
> 
>
> Key: NUTCH-2049
> URL: https://issues.apache.org/jira/browse/NUTCH-2049
> Project: Nutch
>  Issue Type: Improvement
>  Components: build
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
> Fix For: 1.11
>
> Attachments: NUTCH-2049.patch, NUTCH-2049v2.patch
>
>
> Convo here - http://www.mail-archive.com/dev%40nutch.apache.org/msg18225.html
> I am +1 for taking trunk (or a branch of trunk) to explicit dependency on > 
> Hadoop 2.6.
> We can run our tests, we can validate, we can fix.
> I will be doing validation on 2.X in paralegal as this is what I use on my 
> own projects. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2049) Upgrade Trunk to Hadoop > 2.4 stable

2015-08-12 Thread Michael Joyce (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14694210#comment-14694210
 ] 

Michael Joyce commented on NUTCH-2049:
--

Hey [~lewismc],

Tried your patch here. Seems I have to add the following to the ivy.xml file to 
get this to work at all

{code}

{code}

Otherwise, I end up getting the following when I try to run a test crawl

{code}
Injector: starting at 2015-08-12 15:04:42
Injector: crawlDb: crawl/crawldb
Injector: urlDir: ../../urls_test
Injector: Converting injected urls to crawl db entries.
Injector: java.io.IOException: Cannot initialize Cluster. Please check your 
configuration for mapreduce.framework.name and the correspond server addresses.
at org.apache.hadoop.mapreduce.Cluster.initialize(Cluster.java:120)
at org.apache.hadoop.mapreduce.Cluster.(Cluster.java:82)
at org.apache.hadoop.mapreduce.Cluster.(Cluster.java:75)
at org.apache.hadoop.mapred.JobClient.init(JobClient.java:470)
at org.apache.hadoop.mapred.JobClient.(JobClient.java:449)
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:832)
at org.apache.nutch.crawl.Injector.inject(Injector.java:323)
at org.apache.nutch.crawl.Injector.run(Injector.java:379)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.nutch.crawl.Injector.main(Injector.java:369)
{code}

However, after addressing that concern I end up runnign into the following on 
the test crawl

{code}
java.lang.Exception: java.lang.ClassCastException: 
org.apache.hadoop.io.SequenceFile$Writer$KeyClassOption cannot be cast to 
org.apache.hadoop.io.MapFile$Writer$Option
at 
org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462)
at 
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:529)
Caused by: java.lang.ClassCastException: 
org.apache.hadoop.io.SequenceFile$Writer$KeyClassOption cannot be cast to 
org.apache.hadoop.io.MapFile$Writer$Option
at 
org.apache.nutch.fetcher.FetcherOutputFormat.getRecordWriter(FetcherOutputFormat.java:70)
at 
org.apache.hadoop.mapred.ReduceTask$OldTrackingRecordWriter.(ReduceTask.java:484)
at 
org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:414)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:392)
at 
org.apache.hadoop.mapred.LocalJobRunner$Job$ReduceTaskRunnable.run(LocalJobRunner.java:319)
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
2015-08-12 14:24:39,906 ERROR fetcher.Fetcher - Fetcher: java.io.IOException: 
Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:836)
at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:496)
at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:532)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:505)
{code}

> Upgrade Trunk to Hadoop > 2.4 stable
> 
>
> Key: NUTCH-2049
> URL: https://issues.apache.org/jira/browse/NUTCH-2049
> Project: Nutch
>  Issue Type: Improvement
>  Components: build
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
> Fix For: 1.11
>
> Attachments: NUTCH-2049.patch
>
>
> Convo here - http://www.mail-archive.com/dev%40nutch.apache.org/msg18225.html
> I am +1 for taking trunk (or a branch of trunk) to explicit dependency on > 
> Hadoop 2.6.
> We can run our tests, we can validate, we can fix.
> I will be doing validation on 2.X in paralegal as this is what I use on my 
> own projects. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2049) Upgrade Trunk to Hadoop > 2.4 stable

2015-07-24 Thread Lewis John McGibbney (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14641402#comment-14641402
 ] 

Lewis John McGibbney commented on NUTCH-2049:
-

BTW, this is only for 2.4.0 for same reason as explained at last issue. 
Thsi is an upgrade of dependencies and API usage NOT mapred --> mapreduce 
API's for each NutchJob.
[~markus.jel...@openindex.io] had a great crack at trying to upgrade some... I 
would also join his ranks and make best efforts to make all jobs 2.X mapreduce 
API if it makes sense. It would be nice to have a Nutch roadMap TBH.
Team, how do we feel here?
Tests are broken as follows
{code}
  1 Testsuite: org.apache.nutch.crawl.TestCrawlDbFilter
  2 Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 0.986 sec
  3 - Standard Output ---
  4 2015-07-25 01:29:50,852 WARN  util.NativeCodeLoader 
(NativeCodeLoader.java:(62)) - Unable to load native-hadoop library for 
your platform... using builtin-java classes where applicable
  5 2015-07-25 01:29:51,215 INFO  compress.CodecPool 
(CodecPool.java:getCompressor(151)) - Got brand-new compressor [.deflate]
  6 2015-07-25 01:29:51,231 INFO  compress.CodecPool 
(CodecPool.java:getCompressor(151)) - Got brand-new compressor [.deflate]
  7 2015-07-25 01:29:51,231 INFO  crawl.CrawlDBTestUtil 
(CrawlDBTestUtil.java:createCrawlDb(67)) - adding:http://www.example.com
  8 2015-07-25 01:29:51,232 INFO  crawl.CrawlDBTestUtil 
(CrawlDBTestUtil.java:createCrawlDb(67)) - adding:http://www.example1.com
  9 2015-07-25 01:29:51,235 INFO  crawl.CrawlDBTestUtil 
(CrawlDBTestUtil.java:createCrawlDb(67)) - adding:http://www.example2.com
 10 -  ---
 11 - Standard Error -
 12 SLF4J: Class path contains multiple SLF4J bindings.
 13 SLF4J: Found binding in 
[jar:file:/usr/local/trunk_clean/build/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
 14 SLF4J: Found binding in 
[jar:file:/usr/local/trunk_clean/build/test/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
 15 SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an 
explanation.
 16 SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
 17 -  ---
 18
 19 Testcase: testUrl404Purging took 0.969 sec
 20 Caused an ERROR
 21 Cannot initialize Cluster. Please check your configuration for 
mapreduce.framework.name and the correspond server addresses.
 22 java.io.IOException: Cannot initialize Cluster. Please check your 
configuration for mapreduce.framework.name and the correspond server addresses.
 23 at org.apache.hadoop.mapreduce.Cluster.initialize(Cluster.java:120)
 24 at org.apache.hadoop.mapreduce.Cluster.(Cluster.java:82)
 25 at org.apache.hadoop.mapreduce.Cluster.(Cluster.java:75)
 26 at org.apache.hadoop.mapred.JobClient.init(JobClient.java:470)
 27 at org.apache.hadoop.mapred.JobClient.(JobClient.java:449)
 28 at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:832)
 29 at 
org.apache.nutch.crawl.TestCrawlDbFilter.testUrl404Purging(TestCrawlDbFilter.java:107)
{code} 

> Upgrade Trunk to Hadoop > 2.4 stable
> 
>
> Key: NUTCH-2049
> URL: https://issues.apache.org/jira/browse/NUTCH-2049
> Project: Nutch
>  Issue Type: Improvement
>  Components: build
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
> Fix For: 1.11
>
> Attachments: NUTCH-2049.patch
>
>
> Convo here - http://www.mail-archive.com/dev%40nutch.apache.org/msg18225.html
> I am +1 for taking trunk (or a branch of trunk) to explicit dependency on > 
> Hadoop 2.6.
> We can run our tests, we can validate, we can fix.
> I will be doing validation on 2.X in paralegal as this is what I use on my 
> own projects. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)