[jira] [Updated] (AVRO-806) add a column-major codec for data files

Thiruvalluvan M. G. (JIRA) Thu, 16 Jun 2011 08:55:01 -0700

     [ 
https://issues.apache.org/jira/browse/AVRO-806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Thiruvalluvan M. G. updated AVRO-806:
-------------------------------------

    Attachment: avro-file-columnar.pdf
                AVRO-806-v2.patch

Here is a patch which implements the same idea, more formally. Please see the 
attached file that Raymie Stata wrote about the approach we are proposing. Some 
comments on the patch:
  * The implementation basically works. Further optimizations are possible. One 
pending work is the optimization to avoid looking into columns that are needed. 
At present the traditional schema resolution works, but if the reader's and 
writer's schemas are such that certain column can be completely skipped, they 
are not skipped.
 * Data file writer uses a trick that in case of exceptions, it flushes things 
up to the previous record. That trick won't work with columnar storage because 
it can be flushed just once per block. (I've commented out the test case for 
that.)
 * Raymie and I have designed the solution in layers - one layer that is 
capable of writing and reading columnar store. The column to data item 
assignment are a separate layer. We have a default that assigns each item to a 
separate column up to a depth. But the users can supply their own custom column 
assignments. Yet another layer is the ability to put the columns in a block. In 
case someone wants to use file-based columnar storage, they can do so easily on 
top the first two layers.
  * For now, unions have a single column, irrespective of the number and type 
of branches.
  * I don't think extending the codec mechanism for supporting columnar store 
will work. Columnar store is orthogonal to codecs. Codecs are about storing 
blocks with compression. Decoders decide how the contents should be 
interpreted. I think, the way to support columnar storage is to replace binary 
encoder/decoder with columnar encoder/decoders. I've demonstrated in the Data 
file reader and writer. In order to support both binary encoder/decoder and 
columnar encoder decoder, I pushes bytesBuffered() to Encoder from 
BinaryEncoder and isEnd() to Decoder from BinaryDecoder. I don't think these 
changes will break anything.
  * The implementation is not complete, We have to let the user of data file 
writer choose columnar storage instead of binarystorage. I've not implemeted it 
yet.
  * One test in FileSpanStorage is failing possibly because, I think, it 
assumes something about the way data file stores using binary encoder. I'm not 
sure.
  * If we make Data file writer/reader handle different encoders/decoders, then 
changes to columnar storage is reasonably well-isolated.
  * There are a couple of unrelated changes (in tools) that were required to 
make the tests pass on my machine. Please ignore them.


> add a column-major codec for data files
> ---------------------------------------
>
>                 Key: AVRO-806
>                 URL: https://issues.apache.org/jira/browse/AVRO-806
>             Project: Avro
>          Issue Type: New Feature
>          Components: java, spec
>            Reporter: Doug Cutting
>            Assignee: Doug Cutting
>         Attachments: AVRO-806-v2.patch, AVRO-806.patch, avro-file-columnar.pdf
>
>
> Define a codec that, when a data file's schema is a record schema, writes 
> blocks within the file in column-major order.  This would permit better 
> compression and also permit efficient skipping of fields that are not of 
> interest.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (AVRO-806) add a column-major codec for data files

Reply via email to