[
https://issues.apache.org/jira/browse/HIVE-352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12683293#action_12683293
]
Zheng Shao commented on HIVE-352:
---------------------------------
Hi Yongqiang,
Sorry for jumping on this issue late.
Let me summaries the choices that we have to make:
A. Put different columns in different files (we can still have column-set - a
bunch of columns in the same file)
B. Put different columns in the same file, but organize it in a block-based
way. In a single block, the first column of all rows are in the front, then the
second column, etc.
B1: Write a new FileFormat
B2: Continue to use SequenceFileFormat
B2.1: Store a block in multiple records, one record for each column. Use the
key to label the beginning of a block (or column id).
B2.2: Store a block in a single record
Comparing A and B: 1. B is much easier to implement than A. Hadoop jobs take
files as input. If the data is stored in a single file, it's much easier to
either read or write to the file. 2. B may have the advantage of locality. 3. B
may require a little bit more memory buffer for writing. 4. B may not be as
efficient as A in reading since all data need to be read (unless the FileFormat
supports "skip" but that might create more random seeks depending on block
size).
Comparing B1 and B2: 1. B1 is much more flexible since we can do whatever we
want (especially skip-reading etc); 2. B2 is much easier to do and we naturally
enjoy all benefits of SequenceFile: splittable, customizable compression codec.
Comparing B2.1 and B2.2: 1. B2.2 is easier to implement, because we don't have
the problem of splitting different columns of the same block into multiple
mappers. 2. B2.1 is potentially more efficient when we allow SequenceFile to
skip record and ask Hive to tell us which of the columns can be skipped.
As a result, I would suggest to try B2.2 as the first exercise, then try B2.1,
then B1, then A.
The amount of work for each level (B2.2, B2.1, B1, A) will probably differ by a
factor of 3-5. So it does not hurt much by starting from B2.2, and also the
first steps will be good learning steps for the next ones.
Thoughts?
> Make Hive support column based storage
> --------------------------------------
>
> Key: HIVE-352
> URL: https://issues.apache.org/jira/browse/HIVE-352
> Project: Hadoop Hive
> Issue Type: New Feature
> Reporter: He Yongqiang
>
> column based storage has been proven a better storage layout for OLAP.
> Hive does a great job on raw row oriented storage. In this issue, we will
> enhance hive to support column based storage.
> Acctually we have done some work on column based storage on top of hdfs, i
> think it will need some review and refactoring to port it to Hive.
> Any thoughts?
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.