[ https://issues.apache.org/jira/browse/NUTCH-634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12604667#action_12604667 ]
Andrzej Bialecki commented on NUTCH-634: ----------------------------------------- The attached diff is not a valid patch created with 'svn diff'. Please create a patch using 'svn diff', from the top of the source tree of Nutch trunk/. I'm not sure whether the FileOnlySequenceFileOutputFormat is the right answer to the problem of _logs directories ... I think the existence of these directories is caused by a setting in Hadoop contiguration, hadoop.job.history.user.location, which defaults to the output directory (which sounds awfully strange to me to use this as a default!). Further investigation is needed before we mess up things on our side. ;) The code formatting on these two new files and in some other places doesn't conform to the Nutch formatting, which is basically the Sun style with 2 space indents. Please note also that you use different curly brace placement than the Sun style advises. Generics on the CrawlDbReducer are too general, instead of bq. implements Reducer<WritableComparable,Writable,WritableComparable,Writable> it should be bq. implements Reducer<Text, CrawlDatum, Text, CrawlDatum> Similar tightening should be done in other places where you added generics. The CrawlDatum.shallowCopy() method is dangerous IMHO - newly created copies still contain references to the same metaData instance, which may be modified any time by the framework as you iterate through the input items. We should do a deep clone using WritableUtils.clone(). IndexDoc.copyConstructor() should be replaced by a deep clone(). > Patch - Nutch - Hadoop 0.17.0 > ----------------------------- > > Key: NUTCH-634 > URL: https://issues.apache.org/jira/browse/NUTCH-634 > Project: Nutch > Issue Type: Improvement > Affects Versions: 0.9.0 > Reporter: Michael Gottesman > Assignee: Andrzej Bialecki > Fix For: 0.9.0 > > Attachments: diff, hadoop-0.17.patch > > > This is a patch so that Nutch can be used with Hadoop 0.17.0. The patch is > located at http://pastie.org/212001 > The patch compiles and passes all current Nutch unit tests. > I have tested that the crawler side of Nutch (i.e. inject, generate, fetch, > parse, merge w/crawldb) definetly works, but have not tested the lucene > indexing part. It might work, but it might not. > *NOTE* - the two main bugs that had to be overcome were not noticed by any of > the unit tests. The bugs only came up during actual testing. The bugs were: > 1. Changes to the Hadoop Iterator > 2. Addition of Serialization to MapReduce Framework -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.