[
https://issues.apache.org/jira/browse/AVRO-160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12805287#action_12805287
]
Scott Carey commented on AVRO-160:
----------------------------------
bq. To be clear, I think you mean "size in bytes (after codec compression) of
the block".
Yes.
bq. I would like to leave the door open for codecs operating on the
encoder/decoder level (instead of the byte[] level), because they might be able
to do more clever things (like columnar storage).
Isn't that more of a different encoder/decoder implementation than a codec?
Where do we draw that line? It seems like a fundamentally different layer. if
you wanted to do a columnar storage optimization, would you want to have:
codec: gzip
codec: fastlz
codec: columnar
codec: columnar-gzip
codec: columnar-fastlz?
I feel that the layer that does blind lossless compression or other work
(crc's, etc) on the binary data should have one API, and anything that is some
sort of schema-aware transform of the data should have another.
All codecs aren't stream based or naturally define when their stream ends like
gzip does either, if we depend on the codec defining where the boundary of the
block is, we are forcing all codecs to implement that feature. The file format
already defines the block boundary markers, why not also define their
boundaries more explicitly? The drawback is a copule extra bytes per block
(usually 2 or 3), and the requirement of knowing the size of the block before
writing, which is similar to the requirement of knowing the record count before
writing that already exists.
> file format should be friendly to streaming
> -------------------------------------------
>
> Key: AVRO-160
> URL: https://issues.apache.org/jira/browse/AVRO-160
> Project: Avro
> Issue Type: Improvement
> Components: spec
> Reporter: Doug Cutting
> Assignee: Doug Cutting
> Fix For: 1.3.0
>
> Attachments: AVRO-160-python.patch, AVRO-160.patch, AVRO-160.patch,
> AVRO-160.patch, AVRO-160.patch
>
>
> It should be possible to stream through an Avro data file without seeking to
> the end.
> Currently the interpretation is that schemas written to the file apply to all
> entries before them. If this were changed so that they instead apply to all
> entries that follow, and the initial schema is written at the start of the
> file, then streaming could be supported.
> Note that the only change permitted to a schema as a file is written is to,
> if it is a union, to add new branches at the end of that union. If it is not
> a union, no changes may be made. So it is still the case that the final
> schema in a file can read every entry in the file and thus may be used to
> randomly access the file.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.