[ 
https://issues.apache.org/jira/browse/AVRO-160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12805252#action_12805252
 ] 

Philip Zeyliger commented on AVRO-160:
--------------------------------------

Hi Scott,

I don't have strong opinions either way on storing another length in there.  To 
be clear, I think you mean "size in bytes (after codec compression) of the 
block".  "encoded" might mean Avro-encoded, which isn't what you mean, I think.

For the re-compression use case, codecs need to know when the stream ends 
anyway, so I'm not sure there's a big win of having the length.  Though most 
codecs will be (byte[], offset, length), I would like to leave the door open 
for codecs operating on the encoder/decoder level (instead of the byte[] 
level), because they might be able to do more clever things (like columnar 
storage).

Another use case for having the block length is being able to do parallel 
de-compression at the framework, rather than codec, level.  You can read 
several blocks into memory, and then start threads to decompress or what have 
you.  Hard to do that if you rely on the codec to tell you where the boundaries 
are.

-- Philip

> file format should be friendly to streaming
> -------------------------------------------
>
>                 Key: AVRO-160
>                 URL: https://issues.apache.org/jira/browse/AVRO-160
>             Project: Avro
>          Issue Type: Improvement
>          Components: spec
>            Reporter: Doug Cutting
>            Assignee: Doug Cutting
>             Fix For: 1.3.0
>
>         Attachments: AVRO-160-python.patch, AVRO-160.patch, AVRO-160.patch, 
> AVRO-160.patch, AVRO-160.patch
>
>
> It should be possible to stream through an Avro data file without seeking to 
> the end.
> Currently the interpretation is that schemas written to the file apply to all 
> entries before them.  If this were changed so that they instead apply to all 
> entries that follow, and the initial schema is written at the start of the 
> file, then streaming could be supported.
> Note that the only change permitted to a schema as a file is written is to, 
> if it is a union, to add new branches at the end of that union.  If it is not 
> a union, no changes may be made.  So it is still the case that the final 
> schema in a file can read every entry in the file and thus may be used to 
> randomly access the file.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to