Hi Yibing,

Thank you for your reply.

I'm starting to think Avro may not be right for our needs.

Looks like Avro isn't designed to handle one very large Datum in a file.
So I should probably break the data into many smaller pieces.  But that
seems to limit the diversity of data that an Avro file can hold.  I guess
that mean several Avro files together could represent a composite data set.

Also Avro isn't designed for random access, and I'd hoped to have a
parallel caching system that caches pieces as they are requested.  I know I
could probably store indices in another file for use with seek() but that
doesn't seem great.

Thank you,

Terry


On Mon, Nov 21, 2016 at 9:33 PM, Yibing Shi <y...@cloudera.com> wrote:

>
> On Sun, Nov 20, 2016 at 3:11 PM, Terry Casstevens <tcasstev...@gmail.com>
> wrote:
>
>> I tried a combination of the two technics above which wrote the Schema
>> with DataFileWriter and then used BlockingBinaryEncoder to write the
>> Datums.  But upon reading the file I get "Invalid Sync!"
>>
>
> ​DataFileWriter maintains a buffer inside, and syncs it to file
> periodically. This might be why you got "invalid sync" error. Have you
> tried to call DataFileWriter.sync() before switch to write with
> BlockingBinaryEncoder.?
>
> Seems like what I need is a way to pass DataFileWriter a
>> BlockingBinaryEncoder for it to use.  Because it is automatically
>> using a BinaryEncoder.  And the API has no way to pass it a different
>>
> ​I agree that current API doesn't allow this. Create a JIRA to track this
> requirement?
>
>
> *Yibing Shi*
> *Customer Operations Engineer*
> <http://www.cloudera.com>
>
> *Yibing Shi*
> *Customer Operations Engineer*
> <http://www.cloudera.com>
>
> On Sun, Nov 20, 2016 at 3:11 PM, Terry Casstevens <tcasstev...@gmail.com>
> wrote:
>
>> Dear Avro Community,
>>
>> I'm having problems writing large Datums to an Avro file.  Can someone
>> please advise?
>>
>> Normally what is done..
>> - Create DatumWriter
>> - Create DataFileWriter(DatumWriter)
>> - Open file with DataFileWriter.create(Schema, File)
>> - When the file was open, it wrote the Schema to the file.
>> - Then you can DataFileWriter.append(Datum) many times.
>>
>> Problem is DataFileWriter.append() doesn't handle very large Datum.
>>
>> And apparently the solution is to use a BlockingBinaryEncoder, which
>> does solve the OutOfMemoryError.
>> - Create DatumWriter
>> - Create OutputStream -> File
>> - Create EncoderFactory.blockingBinaryEncoder(OutputStream)
>> - DatumWriter.write(Datum, BlockingBinaryEncoder)
>>
>> But that BlockingBinaryEncoder solution doesn't write the Schema to
>> the beginning of the file.
>> - Making it not work with DataFileReader.
>> - Plus these Schemas are different, so needs to be there
>>
>> I tried a combination of the two technics above which wrote the Schema
>> with DataFileWriter and then used BlockingBinaryEncoder to write the
>> Datums.  But upon reading the file I get "Invalid Sync!"
>>
>> Seems like what I need is a way to pass DataFileWriter a
>> BlockingBinaryEncoder for it to use.  Because it is automatically
>> using a BinaryEncoder.  And the API has no way to pass it a different
>> one.
>>
>>
>> Thank you,
>>
>> Terry
>>
>
>

Reply via email to