[jira] Commented: (HIVE-352) Make Hive support column based storage

he yongqiang (JIRA) Tue, 17 Mar 2009 11:17:18 -0700

    [ 
https://issues.apache.org/jira/browse/HIVE-352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12682740#action_12682740
 ]


he yongqiang commented on HIVE-352:
-----------------------------------

Thanks, Joydeep Sen Sarma. Your feedback is really important.

1. store schema.  block-wise column store or one file per column.
Our current implementation stores each column in one file. And the most 
annoying part for us, just as you said, is that currently and even in near 
future, hdfs does not support to colocate different file segements for columns 
in a same table.  So some operations need to fetch data from a new file(like a 
mapside hash join, a join with CompositeInputFormat) or need to add new map 
reduce job to merge data together.  Some operations are pretty good for this. 
I think block-wise column is a good point. I will try to imprement it nearly. 
With different columns collocated in a single block, some operations do not 
need a reduce part(which is really time-consuming).

2. compression
With different columns in different files, some light weight compressions,such 
as RLE, dictionay and bit vector encoding, can be used. One benefit of these 
light weight compression algorithms is that some operations does not need to 
decompression the data.
If we implement the block-wise column storage, should we also need to specify 
the light weight compression algorithm for each column or we choose one( like 
RLE) internally if the data is of good cluster nature? Since dictionary and bit 
vector should also be supported, the comlumns with these compression algorithms 
should be also placed in the block-wise columnar file? I think placing these 
columns in seperate files can be handled more easily? But i do not know whether 
it can fit into Hive. I am new to Hive.
{quote}
having a number of open codecs can hurt in memory usage
{quote}
currently I can not think up a solution to avoid this for column per file store.

3.file format
yeah. i think we need to add new file formats and their corresponding 
InputFormats. Currently, we have implemented the VFile(Value File, we do not 
need to store a key part), and BitMapFile. We have not implemented a 
DictionayFile, instead we use a header file for VFile to store dictionary 
entries. The header file for VFile is not needed for some columns and sometimes 
it is must. 
I think the refactor of file formats should be the start for this issue.

Thanks again.


> Make Hive support column based storage
> --------------------------------------
>
>                 Key: HIVE-352
>                 URL: https://issues.apache.org/jira/browse/HIVE-352
>             Project: Hadoop Hive
>          Issue Type: New Feature
>            Reporter: he yongqiang
>
> column based storage has been proven a better storage layout for OLAP. 
> Hive does a great job on raw row oriented storage. In this issue, we will 
> enhance hive to support column based storage. 
> Acctually we have done some work on column based storage on top of hdfs, i 
> think it will need some review and refactoring to port it to Hive.
> Any thoughts?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-352) Make Hive support column based storage

Reply via email to