[ 
https://issues.apache.org/jira/browse/AVRO-806?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13449265#comment-13449265
 ] 

Doug Cutting commented on AVRO-806:
-----------------------------------

Jakob, I think the more common case will be that fields whose values are small 
will produce small columns where seek time becomes significant.  When seek time 
is significant the returns of greater parallelism are diminished unless 
replication is also increased, which is unlikely.

With multiple row groups per file you have to choose a size for the row groups. 
 Would you ever choose a size smaller than 64MB, the typical HDFS block size?  
Column files are only an advantage when there are multiple columns, so the 
amount read will typically be a fraction of the row group size.

What cases do you imagine where having a row group size less than a file is 
useful?
                
> add a column-major codec for data files
> ---------------------------------------
>
>                 Key: AVRO-806
>                 URL: https://issues.apache.org/jira/browse/AVRO-806
>             Project: Avro
>          Issue Type: New Feature
>          Components: java, spec
>            Reporter: Doug Cutting
>            Assignee: Doug Cutting
>         Attachments: AVRO-806.patch, AVRO-806.patch, AVRO-806-v2.patch, 
> avro-file-columnar.pdf
>
>
> Define a codec that, when a data file's schema is a record schema, writes 
> blocks within the file in column-major order.  This would permit better 
> compression and also permit efficient skipping of fields that are not of 
> interest.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to