[
https://issues.apache.org/jira/browse/AVRO-724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12976904#action_12976904
]
Matt Massie commented on AVRO-724:
----------------------------------
Jeremy-
I'm familiar with the code for the C implementation since I wrote most of it.
If you look at around line 277 of datafile.c you'll find the file_write_block()
function. This function writes a "block" of datum to a datafile. From the
spec:
{quote}
A file data block consists of:
A long indicating the count of objects in this block.
A long indicating the size in bytes of the serialized objects in the current
block, after any codec is applied
The serialized objects. If a codec is specified, this is compressed by that
codec.
The file's 16-byte sync marker.
{quote}
Unfortunately, the longs in the block header are zig-zag encoded and variable
length (sigh), so it makes it impossible simply write the objects to disk
immediately and then seek back to the block count/size offset and update it
later. To work around the file format, I just kept everything in memory and
then flushed the block to disk all at once.
The fastest workaround to the problem (assuming your objects are small enough
to fit in memory), would be to simply increase the size of the memory buffer.
The avro_file_writer_t in datafile.c looks like the following:
{code}
struct avro_file_writer_t_ {
avro_schema_t writers_schema;
avro_writer_t writer;
char sync[16];
int block_count;
avro_writer_t datum_writer;
char datum_buffer[16 * 1024];
};
{code}
This is where the 16K limitation comes from. Instead of having this static
buffer, you could redefine it to be a 'char *' and allocate the memory from
heap using malloc() in avro_file_writer_create().
If your objects are too big for memory, an alternative (and better) solution
would be to use a temporary file-backed datum_writer in the avro_file_writer_t
instead of a memory-backed one. When you were later ready to incorporate the
objects in the datafile, you would (1) write the block count/size, (2) do a
byte copy of the temporary file to the datafile, (3) write the sync marker and
(4) truncate the temporary file for use in the next block.
The ideal solution would be to have fixed length block header fields but that
would require a change to the spec. I may open a discussion on the developers
mailing list soon regarding this limitation but please feel free to do so
yourself if you like.
Good luck with writing a fix. I look forward to seeing it. It will be a nice
contribution to the implementation.
> C implementation does not write datum values that are larger than the memory
> write buffer (currently 16K)
> ---------------------------------------------------------------------------------------------------------
>
> Key: AVRO-724
> URL: https://issues.apache.org/jira/browse/AVRO-724
> Project: Avro
> Issue Type: Bug
> Components: c
> Affects Versions: 1.4.1
> Reporter: Jeremy Hinegardner
>
> The current C implementation does not allow for datum values greater than 16K.
> The {{avro_file_writer_append}} flushes blocks to disk over time, but does
> not deal with the single case of a single datum being larger than
> {{avro_file_writer_t.datum_buffer}}. This is noted in the source code:
> {code:title=datafile.c:294-313}
> int avro_file_writer_append(avro_file_writer_t w, avro_datum_t datum)
> {
> int rval;
> if (!w || !datum) {
> return EINVAL;
> }
> rval = avro_write_data(w->datum_writer, w->writers_schema, datum);
> if (rval) {
> check(rval, file_write_block(w));
> rval =
> avro_write_data(w->datum_writer, w->writers_schema, datum);
> if (rval) {
> /* TODO: if the datum encoder larger than our buffer,
> just write a single large datum */
> return rval;
> }
> }
> w->block_count++;
> return 0;
> }
> {code}
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.