[
https://issues.apache.org/jira/browse/HIVE-352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
He Yongqiang updated HIVE-352:
------------------------------
Attachment: 4-22 performace2.txt
hive-352-2009-4-22-2.patch
According to Zheng's suggestions, hive-352-2009-4-22-2.patch made several
improvements againds hive-352-2009-4-22.patch:
1) let each row data randomly produced in the test be string bytes( the
previous one produces binary bytes)
2) add correctness parameter in performance test to allow test what we read are
what we wrote ( in PerformTestRCFileAndSeqFile).
4-22 performace2.txt added more detailed test results:
1. local using bulk decompression,in RCFile->ValueBuffer->readFields(), like:
{noformat}
bufferRef.write(valueIn, columnBlockPlainDataLength);
{noformat}
2. locak not using bulk decompression,in RCFile->ValueBuffer->readFields(),
like:
{noformat}
while(deflateFilter.available()>0)
bufferRef.write(valueIn, 1);
{noformat}
3. using DistributedFileSystem and bulk decompression, the tests are still run
on my local machine
Here are the brief results(more detail, pls take a look attached 4-22
performace2.txt)
1.
(LocalFileSystem)Use Bulk decompression in RCFile->ValueBuffer->ReadFileds, and
adds some noisy between two RCFile reading, and after written to avoid disk
cache.
column number| RCFile size | RCFile read 1 column | RCFile read 2 column |
RCFile read all columns |Sequence file size | sequence file read all
10| 11501112| 259| 181| 498| 13046020| 7002
25| 28725817| 233| 269| 1082| 32246409| 16539
40| 45940679| 261| 301| 1698| 51436799| 25415
2.
(LocalFileSystem)Not bulk decompression in RCFile->ValueBuffer->readFileds, and
the test adds some noisy between two RCFile reading, and after written to avoid
disk cache.
column number| RCFile size | RCFile read 1 column | RCFile read 2 column |
RCFile read all columns |Sequence file size | sequence file read all
10| 11501112| 1804 | 3262 | 15956 | 13046020| 6927
25| 28725817| 1761 | 3310 | 39492 | 32246409| 15983
40| 45940679| 1843 | 3386 | 63759 | 51436799| 25256
3.
(DistributedFileSystem)Use Bulk decompression in
RCFile->ValueBuffer->readFileds, and adds some noisy between two RCFile
reading, and after written to avoid disk cache.
column number| RCFile size | RCFile read 1 column | RCFile read 2 column |
RCFile read all columns |Sequence file size | sequence file read all
10| 11501112| 2381| 3516| 9898| 13046020| 18053
25| 28725817| 3754| 5254| 22521| 32246409| 43258
40| 45940679| 5597| 8225| 40304| 51436799| 69278
> Make Hive support column based storage
> --------------------------------------
>
> Key: HIVE-352
> URL: https://issues.apache.org/jira/browse/HIVE-352
> Project: Hadoop Hive
> Issue Type: New Feature
> Reporter: He Yongqiang
> Assignee: He Yongqiang
> Attachments: 4-22 performace2.txt, 4-22 performance.txt, 4-22
> progress.txt, hive-352-2009-4-15.patch, hive-352-2009-4-16.patch,
> hive-352-2009-4-17.patch, hive-352-2009-4-19.patch,
> hive-352-2009-4-22-2.patch, hive-352-2009-4-22.patch,
> HIve-352-draft-2009-03-28.patch, Hive-352-draft-2009-03-30.patch
>
>
> column based storage has been proven a better storage layout for OLAP.
> Hive does a great job on raw row oriented storage. In this issue, we will
> enhance hive to support column based storage.
> Acctually we have done some work on column based storage on top of hdfs, i
> think it will need some review and refactoring to port it to Hive.
> Any thoughts?
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.