[jira] Commented: (AVRO-160) file format should be friendly to streaming

Doug Cutting (JIRA) Fri, 23 Oct 2009 15:58:25 -0700

    [ 
https://issues.apache.org/jira/browse/AVRO-160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12769519#action_12769519
 ]


Doug Cutting commented on AVRO-160:
-----------------------------------

> At what size does the metadata block overhead represent larger overhead than 
> storing extra information per tuple like Thrift/Protobuf?

That's a hard to calculate, but I think we're a ways from that, especially if 
we write the schema in binary.

> At what size does compression become less effective?

I think we've found that, after ~64k, the compression ratio does not typically 
significantly improve.

> Are larger blocks better for streaming read/write performance?

In, e.g., mapreduce, we stream through a series of blocks, so we're still 
sequentially accessing ~64MB chunks, regardless of the compression-block size.


> file format should be friendly to streaming
> -------------------------------------------
>
>                 Key: AVRO-160
>                 URL: https://issues.apache.org/jira/browse/AVRO-160
>             Project: Avro
>          Issue Type: Improvement
>          Components: spec
>            Reporter: Doug Cutting
>
> It should be possible to stream through an Avro data file without seeking to 
> the end.
> Currently the interpretation is that schemas written to the file apply to all 
> entries before them.  If this were changed so that they instead apply to all 
> entries that follow, and the initial schema is written at the start of the 
> file, then streaming could be supported.
> Note that the only change permitted to a schema as a file is written is to, 
> if it is a union, to add new branches at the end of that union.  If it is not 
> a union, no changes may be made.  So it is still the case that the final 
> schema in a file can read every entry in the file and thus may be used to 
> randomly access the file.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (AVRO-160) file format should be friendly to streaming

Reply via email to