[ 
https://issues.apache.org/jira/browse/AVRO-160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12804653#action_12804653
 ] 

Scott Carey commented on AVRO-160:
----------------------------------

bq. No, the spec currently says each block is prefixed by "a long indicating 
the count of objects in this block". This is as it was before, without a byte 
count. Byte counts are left to codec implementations on an as-needed basis.

While working through some use cases, I think it would make sense to have each 
block have both the record count, and the size in bytes (encoded) of the block.

Use cases:

* Concatenate two avro files with the same schema (and codec).  To do this 
efficiently, one would want to simply copy the bytes in each block, and not 
decode any records at all.
* Convert the codec in a file (read file A with codec X and output file B with 
codec Y -- for example to compress a file)  In this use case one wants access 
to the raw bytes in a block, but again decoding and re-encoding the records is 
a waste of time.

Several other use cases can take advantage of knowing the block size and avoid 
decoding and encoding records.

Without the size, one could scan for the sync marker to find the end of the 
block, but this is both much slower, and unsafe.  A sync marker collision (as 
rare as that may be) can only be detected by validating the record count, which 
requires decoding the records.  With the size of the block in the format, use 
cases where the raw binary block is copied around are simple and safer.

Furthermore, having the length of the block will allow the Codec interface to 
perhaps just take the (byte[], offset, length) of the block rather than an 
Input/Output stream which would improve performance.  

A byte count of the uncompressed size should be left to the codec.

Thoughts?

> file format should be friendly to streaming
> -------------------------------------------
>
>                 Key: AVRO-160
>                 URL: https://issues.apache.org/jira/browse/AVRO-160
>             Project: Avro
>          Issue Type: Improvement
>          Components: spec
>            Reporter: Doug Cutting
>            Assignee: Doug Cutting
>             Fix For: 1.3.0
>
>         Attachments: AVRO-160-python.patch, AVRO-160.patch, AVRO-160.patch, 
> AVRO-160.patch, AVRO-160.patch
>
>
> It should be possible to stream through an Avro data file without seeking to 
> the end.
> Currently the interpretation is that schemas written to the file apply to all 
> entries before them.  If this were changed so that they instead apply to all 
> entries that follow, and the initial schema is written at the start of the 
> file, then streaming could be supported.
> Note that the only change permitted to a schema as a file is written is to, 
> if it is a union, to add new branches at the end of that union.  If it is not 
> a union, no changes may be made.  So it is still the case that the final 
> schema in a file can read every entry in the file and thus may be used to 
> randomly access the file.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to