Re: Avro MR job problem with empty strings

Friso van Vollenhoven Mon, 05 Sep 2011 06:32:10 -0700

Thanks for the ideas!

It turns out it was my bad after all. Silly mistake: I was feeding the 
comparator with the full map output schema, not just the key part of the pair 
schema. So it was expecting more than just the key when comparing. Also, it 
surfaces for more than just blanks, but not always. Which is why I was 
suspecting empty Utf8's were the problem.


Sorry for thinking it could be a bug outside of my own code…


Thanks,
Friso




On 3 sep. 2011, at 02:38, Scott Carey wrote:

> Some ideas:
> 
> A String is encoded as a Long length, followed by that number of bytes in
> Utf8.
> An empty string is therefore encoded as the number 0L -- which is one
> byte, 0x00.
> It appears that it is trying to skip a string or Long, but it is the end
> of the byte[].
> 
> So either it is expecting a Long or String to skip, and there is nothing
> there.  Perhaps the empty String was not encoded as an empty string, but
> skipped.  Perhaps a Long count or other number (What is the Schema being
> compared?)  
> 
> WordCount is often key = word, val = count, and so it would need to read
> the string word, and skip the long count.  If either of these is left out
> and not written, I would expect the sort of error below.
> 
> I hope that helps,
> 
> -Scott
> 
> On 9/1/11 5:42 AM, "Friso van Vollenhoven" <fvanvollenho...@xebia.com>
> wrote:
> 
>> Hi All,
>> 
>> I am working on a modified version of the Avro MapReduce support to make
>> it play nice with the new Hadoop API (0.20.2). Most of the code if
>> borrowed from the Avro mapred package, but I decided not to fully
>> abstract away the Mapper and Reducer classes (like Avro does now using
>> HadoopMapper and HadoopReducer classes). All else is much the same as the
>> mapred implementation.
>> 
>> When testing, I ran into a issues when emitting empty strings (empty
>> Utf8) from the mapper as key. I get the following:
>> org.apache.avro.AvroRuntimeException: java.io.EOFException
>>      at org.apache.avro.io.BinaryData.compare(BinaryData.java:74)
>>      at org.apache.avro.io.BinaryData.compare(BinaryData.java:60)
>>      at 
>> org.apache.avro.mapreduce.AvroKeyComparator.compare(AvroKeyComparator.java
>> :45)        <== this is my own code
>>      at 
>> org.apache.hadoop.mapreduce.ReduceContext.nextKeyValue(ReduceContext.java:
>> 120)
>>      at 
>> org.apache.hadoop.mapreduce.ReduceContext.nextKey(ReduceContext.java:92)
>>      at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:175)
>>      at 
>> org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:572)
>>      at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:414)
>>      at 
>> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:256)
>> Caused by: java.io.EOFException
>>      at org.apache.avro.io.BinaryDecoder.readLong(BinaryDecoder.java:182)
>>      at 
>> org.apache.avro.generic.GenericDatumReader.skip(GenericDatumReader.java:38
>> 9)
>>      at org.apache.avro.io.BinaryData.compare(BinaryData.java:86)
>>      at org.apache.avro.io.BinaryData.compare(BinaryData.java:72)
>>      ... 8 more
>> 
>> 
>> The root cause stack trace is as follows (taken from debugger, breakpoint
>> on the throw new EOFException(); line):
>> Thread [Thread-11] (Suspended (breakpoint at line 182 in BinaryDecoder))     
>>      BinaryDecoder.readLong() line: 182      
>>      GenericDatumReader<D>.skip(Schema, Decoder) line: 389   
>>      BinaryData.compare(BinaryData$Decoders, Schema) line: 86        
>>      BinaryData.compare(byte[], int, int, byte[], int, int, Schema) line: 72 
>>      BinaryData.compare(byte[], int, byte[], int, Schema) line: 60   
>>      AvroKeyComparator<T>.compare(byte[], int, int, byte[], int, int) line:
>> 45   
>>      
>> Reducer$Context(ReduceContext<KEYIN,VALUEIN,KEYOUT,VALUEOUT>).nextKeyValu
>> e() line: 120        
>>      Reducer$Context(ReduceContext<KEYIN,VALUEIN,KEYOUT,VALUEOUT>).nextKey()
>> line: 92     
>>      
>> AvroMapReduceTest$WordCountingAvroReducer(Reducer<KEYIN,VALUEIN,KEYOUT,VA
>> LUEOUT>).run(Reducer<KEYIN,VALUEIN,KEYOUT,Contex>) line: 175 
>>      ReduceTask.runNewReducer(JobConf, TaskUmbilicalProtocol, TaskReporter,
>> RawKeyValueIterator, RawComparator<INKEY>, Class<INKEY>, Class<INVALUE>)
>> line: 572    
>>      ReduceTask.run(JobConf, TaskUmbilicalProtocol) line: 414        
>>      LocalJobRunner$Job.run() line: 256      
>> 
>> I went through the decoding code to see where this comes from, but I
>> can't immediately spot where it goes wrong. I am guessing the actual
>> problem is earlier during execution where it possibly increases pos too
>> often.
>> 
>> Has anyone experienced this? I can live without emitting empty keys from
>> MR jobs, but I ran into this implementing a word count job on a text file
>> with empty lines (counting those could be a valid use case). I am using
>> Avro 1.5.2.
>> 
>> Thanks for any clues.
>> 
>> 
>> Cheers,
>> Friso
>> 
> 
>

Re: Avro MR job problem with empty strings

Reply via email to