Hello,

We are starting up a project using map/reduce to produce avro files. In short, 
our job produces avro records which can contain very large arrays. In effect, 
we really can't practically predict how large some of them can get. 

When we hit one of these "very large" records, the BufferedBinaryEncoder seems 
to blow out the heap when calling 
org.apache.avro.mapred.AvroMultipleOutputs$1.collect() from a reducer (see 
stack trace below).

Browsing through the avro code and the Jira's, it seems that AVRO-105  could be 
part of the solution here, as I believe we would probably want to be able to 
use the BlockingBinaryEncoder (or perhaps even the DirectBinaryEncoder?? ) to 
be able to write these large arrays in a memory-efficient manner. 

Am I on the right track here? If so, it also seems that we would  need an 
additional feature to be able to configure/enable this from mapred via the  
JobConf etc.. 

Since I'm as-of-yet not that familiar with the internals of avro, I would 
appreciate it if anyone could give me a sanity check, and/or potentially offer 
other suggestions as to how we may be able to work around this problem.

Thanks in advance for your help,
-Mike


Error running child : java.lang.OutOfMemoryError: Java heap space
         at java.util.Arrays.copyOf(Arrays.java:2786)
         at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:94)
         at 
org.apache.avro.io.BufferedBinaryEncoder$OutputStreamSink.innerWrite(BufferedBinaryEncoder.java:216)
         at 
org.apache.avro.io.BufferedBinaryEncoder.flushBuffer(BufferedBinaryEncoder.java:93)
         at 
org.apache.avro.io.BufferedBinaryEncoder.ensureBounds(BufferedBinaryEncoder.java:108)
         at 
org.apache.avro.io.BufferedBinaryEncoder.writeFixed(BufferedBinaryEncoder.java:153)
         at org.apache.avro.io.Encoder.writeFixed(Encoder.java:174)
         at 
org.apache.avro.io.BufferedBinaryEncoder.writeFixed(BufferedBinaryEncoder.java:164)
         at org.apache.avro.io.BinaryEncoder.writeBytes(BinaryEncoder.java:65)
         at 
org.apache.avro.generic.GenericDatumWriter.writeBytes(GenericDatumWriter.java:212)
         at 
org.apache.avro.reflect.ReflectDatumWriter.writeBytes(ReflectDatumWriter.java:93)
         at 
org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:77)
         at 
org.apache.avro.reflect.ReflectDatumWriter.write(ReflectDatumWriter.java:104)
         at 
org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:73)
         at 
org.apache.avro.reflect.ReflectDatumWriter.write(ReflectDatumWriter.java:104)
         at 
org.apache.avro.generic.GenericDatumWriter.writeRecord(GenericDatumWriter.java:106)
         at 
org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:66)
         at 
org.apache.avro.reflect.ReflectDatumWriter.write(ReflectDatumWriter.java:104)
         at 
org.apache.avro.generic.GenericDatumWriter.writeArray(GenericDatumWriter.java:131)
         at 
org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:68)
         at 
org.apache.avro.reflect.ReflectDatumWriter.write(ReflectDatumWriter.java:104)
         at 
org.apache.avro.generic.GenericDatumWriter.writeRecord(GenericDatumWriter.java:106)
         at 
org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:66)
         at 
org.apache.avro.reflect.ReflectDatumWriter.write(ReflectDatumWriter.java:104)
         at 
org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:58)
         at org.apache.avro.file.DataFileWriter.append(DataFileWriter.java:257)
         at 
org.apache.avro.mapred.AvroOutputFormat$1.write(AvroOutputFormat.java:160)
         at 
org.apache.avro.mapred.AvroOutputFormat$1.write(AvroOutputFormat.java:157)
         at 
org.apache.avro.mapred.AvroMultipleOutputs$RecordWriterWithCounter.write(AvroMultipleOutputs.java:436)
         at 
org.apache.avro.mapred.AvroMultipleOutputs$1.collect(AvroMultipleOutputs.java:499)

> 

Reply via email to