[ 
https://issues.apache.org/jira/browse/AVRO-806?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13069792#comment-13069792
 ] 

Doug Cutting commented on AVRO-806:
-----------------------------------

Yes, CIF file looks promising.  It's great to see all the benchmarks!

I wonder if the advantages of CIF could be had without a custom HDFS block 
placement strategy?  For example, one might pack the files of a split directory 
into a single file whose block size was set to the size of the file, forcing it 
into a single block.  This would guarantee locality for the columns of a split.

In other words, instead of groups of column-major records within a file ("block 
columnar" in Raymie's document) on one hand or a file-per-column on the other 
("file columnar"), we have a single group per file.  Since splits might often 
be bigger than RAM, creating these would probably require two steps: writing a 
set of temporary local files, one per column, then appending these into the 
final output.  The file would have an index indicating where each column lies, 
and each column within the file would permit efficient skipping, in the style 
of CIF.

> add a column-major codec for data files
> ---------------------------------------
>
>                 Key: AVRO-806
>                 URL: https://issues.apache.org/jira/browse/AVRO-806
>             Project: Avro
>          Issue Type: New Feature
>          Components: java, spec
>            Reporter: Doug Cutting
>            Assignee: Doug Cutting
>         Attachments: AVRO-806-v2.patch, AVRO-806.patch, avro-file-columnar.pdf
>
>
> Define a codec that, when a data file's schema is a record schema, writes 
> blocks within the file in column-major order.  This would permit better 
> compression and also permit efficient skipping of fields that are not of 
> interest.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to