For each data piece that we are processing, we need to convert it to  {avro
data + schema},  except that instead of writing the results to a file, we
need each piece of data to end up in a byte[].

Logically, the equivalent would be to writing each piece of data to a file
and then reading it in as a byte array for further processing - but that's
obviously inefficient.

I see examples of writing to byte array without using DataFileWriter (see
below), but the result doesn't include schema.

ByteArrayOutputStream out = new ByteArrayOutputStream();
BinaryEncoder encoder = EncoderFactory.get().binaryEncoder(out, null);
DatumWriter<User> writer = new SpecificDatumWriter<User>(User.getClassSchema());

writer.write(user, encoder);
encoder.flush();
out.close();
byte[] serializedBytes = out.toByteArray();



On Wed, May 17, 2017 at 10:33 PM, Nishanth S <nishanth.2...@gmail.com>
wrote:

> Why do  you need multiple instances of datafilewriter.Are you writing to
> different files or paths else you just need to instantiate the filewriter
> once  ,keep appending and  do a close
>
>
> On May 18, 2017 4:12 AM, "Svetlana Shnitser" <svetashnit...@gmail.com>
> wrote:
>
> Hello,
>
> We are attempting to use DataFileWriter to generate avro content, write it
> to a byte[] and subsequently process. While each chunk of avro data is
> small, we are generating about 5M of those. Here's the code we are using:
>
> DatumWriter<HfReadData> writer = new 
> SpecificDatumWriter<>(HfReadData.getClassSchema());
> DataFileWriter<HfReadData> dataFileWriter = new 
> DataFileWriter<HfReadData>(writer);
> dataFileWriter.setCodec(CodecFactory.deflateCodec(9));
> dataFileWriter.create(HfReadData.getClassSchema(), byteStream);
> dataFileWriter.append(hfReadData);
> dataFileWriter.close();
>
>
> byte[] messageBytes = byteStream.toByteArray();
> byteStream.close();
>
> // further processing of messageBytes
>
> ...
>
>
>
> Unfortunately, when ran with a 5M data points, we noticed a big spike in
> heap usage, and the profiler points at numerous instances of
> DataFileWriter.buffer from the line below:
>
> this.buffer = new DataFileWriter.NonCopyingByteA
> rrayOutputStream(Math.min((int)((double)this.syncInterval * 1.25D),
> 1073741822));
>
> This output stream doesn't seem to be closed on DataFileWriter.close().
>
> Are we using DataFileWriter in a way that it was not intended to be used?
> Is there an assumption that there won't be numerous instances of
> DataFileWriter created, but instead one can be used (with appropriate
> syncInterval and flush() calls) to generate multiple chunks of avro data?
> Please advise!
>
>
> Thanks!
>
> -- Svetlana Shnitser
>
>
>
>

Reply via email to