[jira] [Commented] (HIVE-2097) Explore mechanisms for better compression with RC Files

Krishna Kumar (JIRA) Wed, 06 Apr 2011 10:29:46 -0700

    [ 
https://issues.apache.org/jira/browse/HIVE-2097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13016451#comment-13016451
 ]


Krishna Kumar commented on HIVE-2097:
-------------------------------------

Comment hijacked from HIVE-2065:

He Yongqiang added a comment - 31/Mar/11 23:13

we examined column groups, and sort the data internally based on one column in 
one column group. (But we did not try different compressions across column 
groups.) Tried this with 3-4 tables, and we see ~20% storage savings on one 
table compared the previous RCFile. The main problems for this approach is that 
it is hard to find out the correct/most efficient column group definitions.
One example, table tbl_1 has 20 columns, and user can define:

col_1,col_2,col_11,col_13:0;col_3,col_4,col_15,col_16:1;

This will put col_1, col_2,col_11, col_13 into one column group, and reorder 
that column group based on sorting col_1 (0 is the first column in this column 
group), and put col_3, col_4, col_15,col_16 into another column group, and 
reorder this column group based on sorting col_4, and finally put all other 
columns into the default column group with original order.
And should be easy to allow different compression codec for different column 
groups.

The main block issue for this approach is have a full set of utils to find out 
the best column group definition.



> Explore mechanisms for better compression with RC Files
> -------------------------------------------------------
>
>                 Key: HIVE-2097
>                 URL: https://issues.apache.org/jira/browse/HIVE-2097
>             Project: Hive
>          Issue Type: Improvement
>          Components: Query Processor, Serializers/Deserializers
>            Reporter: Krishna Kumar
>            Assignee: Krishna Kumar
>            Priority: Minor
>
> Optimization of the compression mechanisms used by RC File to be explored.
> Some initial ideas
>  
> 1. More efficient serialization/deserialization based on type-specific and 
> storage-specific knowledge.
>  
>    For instance, storing sorted numeric values efficiently using some delta 
> coding techniques
> 2. More efficient compression based on type-specific and storage-specific 
> knowledge
>    Enable compression codecs to be specified based on types or individual 
> columns
> 3. Reordering the on-disk storage for better compression efficiency.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HIVE-2097) Explore mechanisms for better compression with RC Files

Reply via email to