Hi Yibing, Thank you for your reply.
I'm starting to think Avro may not be right for our needs. Looks like Avro isn't designed to handle one very large Datum in a file. So I should probably break the data into many smaller pieces. But that seems to limit the diversity of data that an Avro file can hold. I guess that mean several Avro files together could represent a composite data set. Also Avro isn't designed for random access, and I'd hoped to have a parallel caching system that caches pieces as they are requested. I know I could probably store indices in another file for use with seek() but that doesn't seem great. Thank you, Terry On Mon, Nov 21, 2016 at 9:33 PM, Yibing Shi <y...@cloudera.com> wrote: > > On Sun, Nov 20, 2016 at 3:11 PM, Terry Casstevens <tcasstev...@gmail.com> > wrote: > >> I tried a combination of the two technics above which wrote the Schema >> with DataFileWriter and then used BlockingBinaryEncoder to write the >> Datums. But upon reading the file I get "Invalid Sync!" >> > > DataFileWriter maintains a buffer inside, and syncs it to file > periodically. This might be why you got "invalid sync" error. Have you > tried to call DataFileWriter.sync() before switch to write with > BlockingBinaryEncoder.? > > Seems like what I need is a way to pass DataFileWriter a >> BlockingBinaryEncoder for it to use. Because it is automatically >> using a BinaryEncoder. And the API has no way to pass it a different >> > I agree that current API doesn't allow this. Create a JIRA to track this > requirement? > > > *Yibing Shi* > *Customer Operations Engineer* > <http://www.cloudera.com> > > *Yibing Shi* > *Customer Operations Engineer* > <http://www.cloudera.com> > > On Sun, Nov 20, 2016 at 3:11 PM, Terry Casstevens <tcasstev...@gmail.com> > wrote: > >> Dear Avro Community, >> >> I'm having problems writing large Datums to an Avro file. Can someone >> please advise? >> >> Normally what is done.. >> - Create DatumWriter >> - Create DataFileWriter(DatumWriter) >> - Open file with DataFileWriter.create(Schema, File) >> - When the file was open, it wrote the Schema to the file. >> - Then you can DataFileWriter.append(Datum) many times. >> >> Problem is DataFileWriter.append() doesn't handle very large Datum. >> >> And apparently the solution is to use a BlockingBinaryEncoder, which >> does solve the OutOfMemoryError. >> - Create DatumWriter >> - Create OutputStream -> File >> - Create EncoderFactory.blockingBinaryEncoder(OutputStream) >> - DatumWriter.write(Datum, BlockingBinaryEncoder) >> >> But that BlockingBinaryEncoder solution doesn't write the Schema to >> the beginning of the file. >> - Making it not work with DataFileReader. >> - Plus these Schemas are different, so needs to be there >> >> I tried a combination of the two technics above which wrote the Schema >> with DataFileWriter and then used BlockingBinaryEncoder to write the >> Datums. But upon reading the file I get "Invalid Sync!" >> >> Seems like what I need is a way to pass DataFileWriter a >> BlockingBinaryEncoder for it to use. Because it is automatically >> using a BinaryEncoder. And the API has no way to pass it a different >> one. >> >> >> Thank you, >> >> Terry >> > >