[jira] Commented: (HIVE-352) Make Hive support column based storage

Zheng Shao (JIRA) Thu, 23 Apr 2009 00:48:11 -0700

    [ 
https://issues.apache.org/jira/browse/HIVE-352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12701829#action_12701829
 ]


Zheng Shao commented on HIVE-352:
---------------------------------

The numbers look much reasonable than before. 1.7s to read and decompress 46MB 
data is plausible. But the sequence file's speed - 25s to read and decompress 
51MB data looks a bit too low.

0. Did you try that with hadoop 0.17.0? "ant -Dhadoop.version=0.17.0 test" etc.

1. Can you add your tests to ant, or post the testing scripts so that everybody 
can easily reproduce the test results that you have got?

2. For DistributedFileSystem, how big is the cluster? Is the file (the file 
size is small so it's clearly a single block) local?

3. It seems SequenceFile's compression is not as good as RCFile, although the 
data is the same and also random. What is the exact record format in 
sequencefile? Did you put delimitors or you put length of Strings?

4. 40MB to 50MB is too small for testing. Let's double it to ~100MB but less 
than 128MB to simulate a single file system block.

I think we should compare the following 2 approaches:
BULK. When creating file, store uncompressed data in memory, when limit 
reached, compress and write out; when reading file, do bulk decompression.  
This won't go out of memory because decompressed size is bounded by the limit 
at the file creation;
NONBULK: When creating file, store compressed data in memory, when (compressed 
size) limit reached, compress and write out; when reading file, do small chunk 
decompression to make sure we don't go out of memory.

The approach of store compressed data at creation, and do bulk decompression at 
reading is not practical because it's very easy to go out of memory.

We've done BULK, and it showed great performance (1.6s to read and decompress 
40MB local file), but I suspect the compression ratio will be lower than 
NONBULK.
Can you compare the compression ratio of BULK and NONBULK, given different 
buffer sizes and column numbers?
Also, with NONBULK, we might be able to get bigger compressed blocks, so that 
for each skip we can skip more than BULK, but this is just a minor issue I 
think.

If the compression ratio didn't turn out to be too different, we may just go 
the BULK approach.

> Make Hive support column based storage
> --------------------------------------
>
>                 Key: HIVE-352
>                 URL: https://issues.apache.org/jira/browse/HIVE-352
>             Project: Hadoop Hive
>          Issue Type: New Feature
>            Reporter: He Yongqiang
>            Assignee: He Yongqiang
>         Attachments: 4-22 performace2.txt, 4-22 performance.txt, 4-22 
> progress.txt, hive-352-2009-4-15.patch, hive-352-2009-4-16.patch, 
> hive-352-2009-4-17.patch, hive-352-2009-4-19.patch, 
> hive-352-2009-4-22-2.patch, hive-352-2009-4-22.patch, 
> HIve-352-draft-2009-03-28.patch, Hive-352-draft-2009-03-30.patch
>
>
> column based storage has been proven a better storage layout for OLAP. 
> Hive does a great job on raw row oriented storage. In this issue, we will 
> enhance hive to support column based storage. 
> Acctually we have done some work on column based storage on top of hdfs, i 
> think it will need some review and refactoring to port it to Hive.
> Any thoughts?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-352) Make Hive support column based storage

Reply via email to