[
https://issues.apache.org/jira/browse/HIVE-352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12689041#action_12689041
]
He Yongqiang commented on HIVE-352:
-----------------------------------
Thanks, Joydeep and Prasad.
First i would like to make an update to the recent work:
I had implemented an initial RCFile which was just a wrapper of SequenceFile,
and it relied on Hadoop-5553. Since it seems Hadoop-5553 will not be resolved,
I have implemented another RCFile, which copies many code form SequenceFile(
especially the Writer code), and provides the same on-disk data layout as
SequenceFile.
Here is a draft description of the new RCFile:
1) Only record compression or no compression at all.
In B2.2 we store a bunch of raw rows into one record in a columnar way. So
there is no need for block compression, because block compression will
decompress all the data.
2) In-record compression.
If the writer is created with compress flag, then the value part in one
record is compressed but with a column compression style. The layout is like
this:
Record length
Key length
{the below is the Key part}
number_of_rows_in_this_record(vint)
column_1_ondisk_length(vint),column_1_row_1_value_plain_length,
column_1_row_2_value_plain_length,....
column_2_ondisk_length(vint),column_2_row_1_value_plain_length,
column_2_row_2_value_plain_length,....
..........
{the end of the key part}
{the begin of the value part}
Compressed data or plain data of [column_1_row_1_value,
column_1_row_2_value,....]
Compressed data or plain data of [column_2_row_1_value,
column_2_row_2_value,....]
{the end of the value part}
The key part: KeyBuffer
The value part : ValueBuffer
3) the reader
It now only provides 2 API:
next(LongWritable rowID): returns the next rowid number. I think it should be
refined, because the rowid maybe not real rowid, and it is only the already
passed rows from the beginning of the reader.
List<Bytes> getCurrentRow() will return all the columns raw bytes of one row.
Because the reader can let use specify the column ids which should be skipped,
so the returned List<Bytes> only contains the unskipped columns bytes. Maybe it
is better to store a NullBytes in the returned list to represent a skipped
column.
> Make Hive support column based storage
> --------------------------------------
>
> Key: HIVE-352
> URL: https://issues.apache.org/jira/browse/HIVE-352
> Project: Hadoop Hive
> Issue Type: New Feature
> Reporter: He Yongqiang
>
> column based storage has been proven a better storage layout for OLAP.
> Hive does a great job on raw row oriented storage. In this issue, we will
> enhance hive to support column based storage.
> Acctually we have done some work on column based storage on top of hdfs, i
> think it will need some review and refactoring to port it to Hive.
> Any thoughts?
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.