[
https://issues.apache.org/jira/browse/AVRO-806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Thiruvalluvan M. G. updated AVRO-806:
-------------------------------------
Attachment: avro-file-columnar.pdf
AVRO-806-v2.patch
Here is a patch which implements the same idea, more formally. Please see the
attached file that Raymie Stata wrote about the approach we are proposing. Some
comments on the patch:
* The implementation basically works. Further optimizations are possible. One
pending work is the optimization to avoid looking into columns that are needed.
At present the traditional schema resolution works, but if the reader's and
writer's schemas are such that certain column can be completely skipped, they
are not skipped.
* Data file writer uses a trick that in case of exceptions, it flushes things
up to the previous record. That trick won't work with columnar storage because
it can be flushed just once per block. (I've commented out the test case for
that.)
* Raymie and I have designed the solution in layers - one layer that is
capable of writing and reading columnar store. The column to data item
assignment are a separate layer. We have a default that assigns each item to a
separate column up to a depth. But the users can supply their own custom column
assignments. Yet another layer is the ability to put the columns in a block. In
case someone wants to use file-based columnar storage, they can do so easily on
top the first two layers.
* For now, unions have a single column, irrespective of the number and type
of branches.
* I don't think extending the codec mechanism for supporting columnar store
will work. Columnar store is orthogonal to codecs. Codecs are about storing
blocks with compression. Decoders decide how the contents should be
interpreted. I think, the way to support columnar storage is to replace binary
encoder/decoder with columnar encoder/decoders. I've demonstrated in the Data
file reader and writer. In order to support both binary encoder/decoder and
columnar encoder decoder, I pushes bytesBuffered() to Encoder from
BinaryEncoder and isEnd() to Decoder from BinaryDecoder. I don't think these
changes will break anything.
* The implementation is not complete, We have to let the user of data file
writer choose columnar storage instead of binarystorage. I've not implemeted it
yet.
* One test in FileSpanStorage is failing possibly because, I think, it
assumes something about the way data file stores using binary encoder. I'm not
sure.
* If we make Data file writer/reader handle different encoders/decoders, then
changes to columnar storage is reasonably well-isolated.
* There are a couple of unrelated changes (in tools) that were required to
make the tests pass on my machine. Please ignore them.
> add a column-major codec for data files
> ---------------------------------------
>
> Key: AVRO-806
> URL: https://issues.apache.org/jira/browse/AVRO-806
> Project: Avro
> Issue Type: New Feature
> Components: java, spec
> Reporter: Doug Cutting
> Assignee: Doug Cutting
> Attachments: AVRO-806-v2.patch, AVRO-806.patch, avro-file-columnar.pdf
>
>
> Define a codec that, when a data file's schema is a record schema, writes
> blocks within the file in column-major order. This would permit better
> compression and also permit efficient skipping of fields that are not of
> interest.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira