[jira] Updated: (HIVE-461) Optimize RCFile reading by using column pruning results

He Yongqiang (JIRA) Tue, 26 May 2009 05:30:11 -0700

     [ 
https://issues.apache.org/jira/browse/HIVE-461?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


He Yongqiang updated HIVE-461:
------------------------------

    Attachment: hive-461-2009-05-26.patch

A first try. The main modifications lie in ColumnPruner, HiveInputFormat, 
SelectOperator, and ExecDriver( one line). 
Also changed RCFile to set accepted column ids instead of skip column ids, and 
update testcases to pass in accepted column ids.
hive-461-2009-05-26.patch works for simple query like "insert overwrite table 
rc2 select rc1.col1, rc1.col2 from rc1", and have not tested with complex 
queries.

> Optimize RCFile reading by using column pruning results
> -------------------------------------------------------
>
>                 Key: HIVE-461
>                 URL: https://issues.apache.org/jira/browse/HIVE-461
>             Project: Hadoop Hive
>          Issue Type: Improvement
>          Components: Serializers/Deserializers
>    Affects Versions: 0.4.0
>            Reporter: Zheng Shao
>            Assignee: He Yongqiang
>         Attachments: hive-461-2009-05-26.patch
>
>
> RCFile is a column-based file format introduced in HIVE-352. Column-based 
> storage has shown better compression ratio. On our internal data set (30 
> columns, most of them are short integer strings), we are seeing 
> gzip-compressed RCFile to be 20%+ smaller than gzip-compressed SequenceFile.
> RCFIle also has the potential to improve the reading efficiency a lot since 
> it compresses each column separately.
> We should integrate RCFile with the column pruning results from Hive to make 
> the reading faster.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HIVE-461) Optimize RCFile reading by using column pruning results

Reply via email to