[
https://issues.apache.org/jira/browse/AVRO-160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12805252#action_12805252
]
Philip Zeyliger commented on AVRO-160:
--------------------------------------
Hi Scott,
I don't have strong opinions either way on storing another length in there. To
be clear, I think you mean "size in bytes (after codec compression) of the
block". "encoded" might mean Avro-encoded, which isn't what you mean, I think.
For the re-compression use case, codecs need to know when the stream ends
anyway, so I'm not sure there's a big win of having the length. Though most
codecs will be (byte[], offset, length), I would like to leave the door open
for codecs operating on the encoder/decoder level (instead of the byte[]
level), because they might be able to do more clever things (like columnar
storage).
Another use case for having the block length is being able to do parallel
de-compression at the framework, rather than codec, level. You can read
several blocks into memory, and then start threads to decompress or what have
you. Hard to do that if you rely on the codec to tell you where the boundaries
are.
-- Philip
> file format should be friendly to streaming
> -------------------------------------------
>
> Key: AVRO-160
> URL: https://issues.apache.org/jira/browse/AVRO-160
> Project: Avro
> Issue Type: Improvement
> Components: spec
> Reporter: Doug Cutting
> Assignee: Doug Cutting
> Fix For: 1.3.0
>
> Attachments: AVRO-160-python.patch, AVRO-160.patch, AVRO-160.patch,
> AVRO-160.patch, AVRO-160.patch
>
>
> It should be possible to stream through an Avro data file without seeking to
> the end.
> Currently the interpretation is that schemas written to the file apply to all
> entries before them. If this were changed so that they instead apply to all
> entries that follow, and the initial schema is written at the start of the
> file, then streaming could be supported.
> Note that the only change permitted to a schema as a file is written is to,
> if it is a union, to add new branches at the end of that union. If it is not
> a union, no changes may be made. So it is still the case that the final
> schema in a file can read every entry in the file and thus may be used to
> randomly access the file.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.