[jira] Commented: (HIVE-352) Make Hive support column based storage

He Yongqiang (JIRA) Wed, 25 Mar 2009 02:17:17 -0700

    [ 
https://issues.apache.org/jira/browse/HIVE-352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12689041#action_12689041
 ]


He Yongqiang commented on HIVE-352:
-----------------------------------

Thanks, Joydeep and Prasad.
First i would like to make an update to the recent work:
I had implemented an initial RCFile which was just a wrapper of SequenceFile, 
and it relied on Hadoop-5553. Since it seems Hadoop-5553 will not be resolved, 
I have implemented another RCFile, which copies many code form SequenceFile( 
especially the Writer code), and provides the same on-disk data layout as 
SequenceFile.

Here is a draft description of the new RCFile:
1) Only record compression or no compression at all. 
   
    In B2.2 we store a bunch of raw rows into one record in a columnar way. So 
there is no need for block compression, because block compression will 
decompress all the data.
    
2) In-record compression.
    If the writer is created with compress flag, then the value part in one 
record is compressed but with a column compression style. The layout is like 
this:

Record length
Key length
{the below is the Key part} 
number_of_rows_in_this_record(vint)
column_1_ondisk_length(vint),column_1_row_1_value_plain_length, 
column_1_row_2_value_plain_length,....
column_2_ondisk_length(vint),column_2_row_1_value_plain_length, 
column_2_row_2_value_plain_length,....
..........
{the end of the key part}
{the begin of the value part}
Compressed data or plain data of [column_1_row_1_value, 
column_1_row_2_value,....]
Compressed data or plain data of [column_2_row_1_value, 
column_2_row_2_value,....]
{the end of the value part}

The key part: KeyBuffer
The value part : ValueBuffer

3) the reader

It now only provides 2 API:
next(LongWritable rowID): returns the next rowid number. I think it should be 
refined, because the rowid maybe not real rowid, and it is only the already 
passed rows from the beginning of the reader.

List<Bytes> getCurrentRow() will return all the columns raw bytes of one row. 
Because the reader can let use specify the column ids which should be skipped, 
so the returned List<Bytes> only contains the unskipped columns bytes. Maybe it 
is better to store a NullBytes in the returned list to represent a skipped 
column.

> Make Hive support column based storage
> --------------------------------------
>
>                 Key: HIVE-352
>                 URL: https://issues.apache.org/jira/browse/HIVE-352
>             Project: Hadoop Hive
>          Issue Type: New Feature
>            Reporter: He Yongqiang
>
> column based storage has been proven a better storage layout for OLAP. 
> Hive does a great job on raw row oriented storage. In this issue, we will 
> enhance hive to support column based storage. 
> Acctually we have done some work on column based storage on top of hdfs, i 
> think it will need some review and refactoring to port it to Hive.
> Any thoughts?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-352) Make Hive support column based storage

Reply via email to