[ https://issues.apache.org/jira/browse/HIVE-352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12682740#action_12682740 ]
he yongqiang commented on HIVE-352: ----------------------------------- Thanks, Joydeep Sen Sarma. Your feedback is really important. 1. store schema. block-wise column store or one file per column. Our current implementation stores each column in one file. And the most annoying part for us, just as you said, is that currently and even in near future, hdfs does not support to colocate different file segements for columns in a same table. So some operations need to fetch data from a new file(like a mapside hash join, a join with CompositeInputFormat) or need to add new map reduce job to merge data together. Some operations are pretty good for this. I think block-wise column is a good point. I will try to imprement it nearly. With different columns collocated in a single block, some operations do not need a reduce part(which is really time-consuming). 2. compression With different columns in different files, some light weight compressions,such as RLE, dictionay and bit vector encoding, can be used. One benefit of these light weight compression algorithms is that some operations does not need to decompression the data. If we implement the block-wise column storage, should we also need to specify the light weight compression algorithm for each column or we choose one( like RLE) internally if the data is of good cluster nature? Since dictionary and bit vector should also be supported, the comlumns with these compression algorithms should be also placed in the block-wise columnar file? I think placing these columns in seperate files can be handled more easily? But i do not know whether it can fit into Hive. I am new to Hive. {quote} having a number of open codecs can hurt in memory usage {quote} currently I can not think up a solution to avoid this for column per file store. 3.file format yeah. i think we need to add new file formats and their corresponding InputFormats. Currently, we have implemented the VFile(Value File, we do not need to store a key part), and BitMapFile. We have not implemented a DictionayFile, instead we use a header file for VFile to store dictionary entries. The header file for VFile is not needed for some columns and sometimes it is must. I think the refactor of file formats should be the start for this issue. Thanks again. > Make Hive support column based storage > -------------------------------------- > > Key: HIVE-352 > URL: https://issues.apache.org/jira/browse/HIVE-352 > Project: Hadoop Hive > Issue Type: New Feature > Reporter: he yongqiang > > column based storage has been proven a better storage layout for OLAP. > Hive does a great job on raw row oriented storage. In this issue, we will > enhance hive to support column based storage. > Acctually we have done some work on column based storage on top of hdfs, i > think it will need some review and refactoring to port it to Hive. > Any thoughts? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.