Re: [jira] Commented: (NUTCH-634) Patch - Nutch - Hadoop 0.17.0

Lincoln Ritter Thu, 12 Jun 2008 15:40:03 -0700

Which patch are you referring to?  The patch I just added *only*
addressed the index/segments confusion and was created by executing
'svn diff' from the trunk root.


-lincoln

--
lincolnritter.com



On Thu, Jun 12, 2008 at 3:32 PM, Andrzej Bialecki  (JIRA)
<[EMAIL PROTECTED]> wrote:
>
>    [ 
> https://issues.apache.org/jira/browse/NUTCH-634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12604667#action_12604667
>  ]
>
> Andrzej Bialecki  commented on NUTCH-634:
> -----------------------------------------
>
> The attached diff is not a valid patch created with 'svn diff'. Please create 
> a patch using 'svn diff', from the top of the source tree of Nutch trunk/.
>
> I'm not sure whether the FileOnlySequenceFileOutputFormat is the right answer 
> to the problem of _logs directories ... I think the existence of these 
> directories is caused by a setting in Hadoop contiguration, 
> hadoop.job.history.user.location, which defaults to the output directory 
> (which sounds awfully strange to me to use this as a default!). Further 
> investigation is needed before we mess up things on our side. ;)
>
> The code formatting on these two new files and in some other places doesn't 
> conform to the Nutch formatting, which is basically the Sun style with 2 
> space indents. Please note also that you use different curly brace placement 
> than the Sun style advises.
>
> Generics on the CrawlDbReducer are too general, instead of
>
> bq. implements 
> Reducer<WritableComparable,Writable,WritableComparable,Writable>
>
> it should be
>
> bq. implements Reducer<Text, CrawlDatum, Text, CrawlDatum>
>
> Similar tightening should be done in other places where you added generics.
>
> The CrawlDatum.shallowCopy() method is dangerous IMHO - newly created copies 
> still contain references to the same metaData instance, which may be modified 
> any time by the framework as you iterate through the input items. We should 
> do a deep clone using WritableUtils.clone().
>
> IndexDoc.copyConstructor() should be replaced by a deep clone().
>
>
>
>
>
>> Patch - Nutch - Hadoop 0.17.0
>> -----------------------------
>>
>>                 Key: NUTCH-634
>>                 URL: https://issues.apache.org/jira/browse/NUTCH-634
>>             Project: Nutch
>>          Issue Type: Improvement
>>    Affects Versions: 0.9.0
>>            Reporter: Michael Gottesman
>>            Assignee: Andrzej Bialecki
>>             Fix For: 0.9.0
>>
>>         Attachments: diff, hadoop-0.17.patch
>>
>>
>> This is a patch so that Nutch can be used with Hadoop 0.17.0. The patch is 
>> located at http://pastie.org/212001
>> The patch compiles and passes all current Nutch unit tests.
>> I have tested that the crawler side of Nutch (i.e. inject, generate, fetch, 
>> parse, merge w/crawldb) definetly works, but have not tested the lucene 
>> indexing part. It might work, but it might not.
>> *NOTE* - the two main bugs that had to be overcome were not noticed by any 
>> of the unit tests. The bugs only came up during actual testing. The bugs 
>> were:
>> 1. Changes to the Hadoop Iterator
>> 2. Addition of Serialization to MapReduce Framework
>
> --
> This message is automatically generated by JIRA.
> -
> You can reply to this email to add a comment to the issue online.
>
>

Re: [jira] Commented: (NUTCH-634) Patch - Nutch - Hadoop 0.17.0

Reply via email to