[
https://issues.apache.org/jira/browse/HIVE-352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12682740#action_12682740
]
he yongqiang commented on HIVE-352:
-----------------------------------
Thanks, Joydeep Sen Sarma. Your feedback is really important.
1. store schema. block-wise column store or one file per column.
Our current implementation stores each column in one file. And the most
annoying part for us, just as you said, is that currently and even in near
future, hdfs does not support to colocate different file segements for columns
in a same table. So some operations need to fetch data from a new file(like a
mapside hash join, a join with CompositeInputFormat) or need to add new map
reduce job to merge data together. Some operations are pretty good for this.
I think block-wise column is a good point. I will try to imprement it nearly.
With different columns collocated in a single block, some operations do not
need a reduce part(which is really time-consuming).
2. compression
With different columns in different files, some light weight compressions,such
as RLE, dictionay and bit vector encoding, can be used. One benefit of these
light weight compression algorithms is that some operations does not need to
decompression the data.
If we implement the block-wise column storage, should we also need to specify
the light weight compression algorithm for each column or we choose one( like
RLE) internally if the data is of good cluster nature? Since dictionary and bit
vector should also be supported, the comlumns with these compression algorithms
should be also placed in the block-wise columnar file? I think placing these
columns in seperate files can be handled more easily? But i do not know whether
it can fit into Hive. I am new to Hive.
{quote}
having a number of open codecs can hurt in memory usage
{quote}
currently I can not think up a solution to avoid this for column per file store.
3.file format
yeah. i think we need to add new file formats and their corresponding
InputFormats. Currently, we have implemented the VFile(Value File, we do not
need to store a key part), and BitMapFile. We have not implemented a
DictionayFile, instead we use a header file for VFile to store dictionary
entries. The header file for VFile is not needed for some columns and sometimes
it is must.
I think the refactor of file formats should be the start for this issue.
Thanks again.
> Make Hive support column based storage
> --------------------------------------
>
> Key: HIVE-352
> URL: https://issues.apache.org/jira/browse/HIVE-352
> Project: Hadoop Hive
> Issue Type: New Feature
> Reporter: he yongqiang
>
> column based storage has been proven a better storage layout for OLAP.
> Hive does a great job on raw row oriented storage. In this issue, we will
> enhance hive to support column based storage.
> Acctually we have done some work on column based storage on top of hdfs, i
> think it will need some review and refactoring to port it to Hive.
> Any thoughts?
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.