For each data piece that we are processing, we need to convert it to {avro data + schema}, except that instead of writing the results to a file, we need each piece of data to end up in a byte[].
Logically, the equivalent would be to writing each piece of data to a file and then reading it in as a byte array for further processing - but that's obviously inefficient. I see examples of writing to byte array without using DataFileWriter (see below), but the result doesn't include schema. ByteArrayOutputStream out = new ByteArrayOutputStream(); BinaryEncoder encoder = EncoderFactory.get().binaryEncoder(out, null); DatumWriter<User> writer = new SpecificDatumWriter<User>(User.getClassSchema()); writer.write(user, encoder); encoder.flush(); out.close(); byte[] serializedBytes = out.toByteArray(); On Wed, May 17, 2017 at 10:33 PM, Nishanth S <nishanth.2...@gmail.com> wrote: > Why do you need multiple instances of datafilewriter.Are you writing to > different files or paths else you just need to instantiate the filewriter > once ,keep appending and do a close > > > On May 18, 2017 4:12 AM, "Svetlana Shnitser" <svetashnit...@gmail.com> > wrote: > > Hello, > > We are attempting to use DataFileWriter to generate avro content, write it > to a byte[] and subsequently process. While each chunk of avro data is > small, we are generating about 5M of those. Here's the code we are using: > > DatumWriter<HfReadData> writer = new > SpecificDatumWriter<>(HfReadData.getClassSchema()); > DataFileWriter<HfReadData> dataFileWriter = new > DataFileWriter<HfReadData>(writer); > dataFileWriter.setCodec(CodecFactory.deflateCodec(9)); > dataFileWriter.create(HfReadData.getClassSchema(), byteStream); > dataFileWriter.append(hfReadData); > dataFileWriter.close(); > > > byte[] messageBytes = byteStream.toByteArray(); > byteStream.close(); > > // further processing of messageBytes > > ... > > > > Unfortunately, when ran with a 5M data points, we noticed a big spike in > heap usage, and the profiler points at numerous instances of > DataFileWriter.buffer from the line below: > > this.buffer = new DataFileWriter.NonCopyingByteA > rrayOutputStream(Math.min((int)((double)this.syncInterval * 1.25D), > 1073741822)); > > This output stream doesn't seem to be closed on DataFileWriter.close(). > > Are we using DataFileWriter in a way that it was not intended to be used? > Is there an assumption that there won't be numerous instances of > DataFileWriter created, but instead one can be used (with appropriate > syncInterval and flush() calls) to generate multiple chunks of avro data? > Please advise! > > > Thanks! > > -- Svetlana Shnitser > > > >