Re: Using JavaSerialzation and SequenceFileInput

Jason Grey Wed, 17 Sep 2008 07:28:55 -0700

I read HADOOP-3413 <https://issues.apache.org/jira/browse/HADOOP-3413> a bit
more closely - it updates SequenceFile.Reader, not SequenceFileInputFormat,
which is what M.R. framework uses... looks like you have to write your own
input format, or have your mappers/reducers take raw bytes, and deserialize
within...



On Wed, Sep 17, 2008 at 9:04 AM, Jason Grey <[EMAIL PROTECTED]>wrote:

> I just found this one this morning, looks like a fix should be in 0.18.0
> according to the bug tracker:
>
> https://issues.apache.org/jira/browse/HADOOP-3413
>
> I'm going to go double check all my code, as I'm pretty sure I am on 0.18.0
> already
>
> -jg-
>
>
>
> On Tue, Sep 16, 2008 at 9:10 PM, Alex Loddengaard <[EMAIL PROTECTED]>wrote:
>
>> Unfortunately I don't know of a solution to your problem, but I've been
>> experiencing the exact same issues while trying to implement a Protocol
>> Buffer serialization.  Take a look:
>>
>> <https://issues.apache.org/jira/browse/HADOOP-3788>
>>
>> I hope this helps others to diagnose your problem.
>>
>> Alex
>>
>> On Wed, Sep 17, 2008 at 12:47 AM, Jason Grey <[EMAIL PROTECTED]
>> >wrote:
>>
>> > *HeadlineDocument *in the code below is equivalent to *MyObject* - I
>> forgot
>> > to obfuscate that one... opps...
>> >
>> > On Tue, Sep 16, 2008 at 11:46 AM, Jason Grey <[EMAIL PROTECTED]
>> > >wrote:
>> >
>> > > I'm trying to use JavaSerialization for a series of MapReduce jobs,
>> and
>> > > when it comes to reading a SequenceFile using SequenceFileInputFormat
>> > with
>> > > JavaSerialized objects, something breaks down.
>> > >
>> > > I've added "org.apache.hadoop.io.serializer.JavaSerialization" to the
>> > > io.serializations property in my config, and using native java types
>> in
>> > my
>> > > mapper and reducer implementations, like so:
>> > >
>> > > MyMapper implements Mapper<String,MyObject,String,MyObject>
>> > > MyReducer implements Reducer<String,MyObject,String,MyObject>
>> > >
>> > > in my job configuration, i"m doing this:
>> > >
>> > > conf.setInputFormat(SequenceFileInputFormat.class);
>> > > FileInputFormat.setInputPaths(conf, path1, path2);
>> > > conf.setOutputFormat(SequenceFileOutputFormat.class);
>> > > FileOutputFormat.setOutputPath(conf, path3);
>> > > conf.setOutputKeyClass(String.class);
>> > > conf.setOutputKeyComparatorClass(JavaSerializationComparator.class);
>> > > conf.setOutputValueClass(MyObject.class);
>> > > conf.setMapperClass(MyMapper.class);
>> > > conf.setReducerClass(MyReducer.class);
>> > >
>> > > When I run the job, and output the keys & values from the mapper to
>> > > System.out, it doesn't seem like the key & value are getting populated
>> > > correctly - the key is NULL, and the value is a new, empty instance of
>> > > MyObject.
>> > >
>> > > The files this job is reading were output by another job that used a
>> > custom
>> > > InputFormat, and so it didn't have the same problem, and I have
>> validated
>> > > using a SequenceFile.Reader that the data is actually there, and
>> > non-null.
>> > > One strange thing i had to do to get the reader to work is this (see
>> > *BOLD
>> > > * text - I had to add that in order for the values to show up - I
>> think
>> > > this may have something to do with why SequenceFileInputFormat is
>> having
>> > > trouble as well...)
>> > >
>> > > String key = new String();
>> > > while (*(key = (String) *r.next(key)) != null) {
>> > >      HeadlineDocument value = new HeadlineDocument();
>> > >      *value = (HeadlineDocument) *r.getCurrentValue(value);
>> > >      System.out.println("Key: " + key.toString());
>> > >      System.out.println("Value: " + value.toString());
>> > > }
>> > >
>> > > Anyone got any hints as to how one uses JavaSerialization properly in
>> the
>> > > INPUT phase of a MapReduce job?
>> > >
>> > > Thanks for any help
>> > >
>> > > -jg-
>> > >
>> >
>>
>
>

Re: Using JavaSerialzation and SequenceFileInput

Reply via email to