[jira] [Commented] (HIVE-4113) Optimize select count(1) with RCFile and Orc

Yin Huai (JIRA) Thu, 19 Sep 2013 05:53:51 -0700

    [ 
https://issues.apache.org/jira/browse/HIVE-4113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13771847#comment-13771847
 ]


Yin Huai commented on HIVE-4113:
--------------------------------

READ_ALL_COLUMNS and READ_ALL_COLUMNS_DEFAULT are mainly created for HCat, 
because I think it is a kind of burden to users if they have to be aware 
ColumnProjectionUtils and use it every time. So, through HCat, if users do not 
use ColumnProjectionUtils to set needed columns, we will read all columns. If 
we set READ_ALL_COLUMNS_DEFAULT=false, no column will be read if a user does 
not use ColumnProjectionUtils.

In Hive, if we get rid off the flag of column pruning, the list of 
neededColumnIDs in TS will not be null. Thus, in Hive, we will always set 
READ_ALL_COLUMNS to false (the .2 patch has an issue on it... I will fix it 
later).

In summary, in Hive, we use neededColumnIDs in TS as the only way to tell a 
underlying recordreader what to read. If neededColumnIDs is an empty list, we 
will know no needed column. Otherwise, we will read columns specified in 
neededColumnIDs (if we have select * in a sub-query, neededColumnIDs should be 
populated to include all columns).

In HCat, if a user wants to use the MapReduce interface, he or she has two ways 
to tell what columns are needed. 1) This user does nothing. In this case, we 
will read all columns. 2) This user uses utility functions in 
ColumnProjectionUtils (e.g. setReadColumnIDs) to specify needed columns. In 
this case, READ_ALL_COLUMNS will be set to false and we only read columns 
specified in READ_COLUMN_IDS_CONF_STR.

I hope what I am proposing makes sense. I am welcome to any suggestion :)
                
> Optimize select count(1) with RCFile and Orc
> --------------------------------------------
>
>                 Key: HIVE-4113
>                 URL: https://issues.apache.org/jira/browse/HIVE-4113
>             Project: Hive
>          Issue Type: Bug
>          Components: File Formats
>            Reporter: Gopal V
>            Assignee: Yin Huai
>             Fix For: 0.12.0
>
>         Attachments: HIVE-4113-0.patch, HIVE-4113.1.patch, HIVE-4113.2.patch, 
> HIVE-4113.patch, HIVE-4113.patch
>
>
> select count(1) loads up every column & every row when used with RCFile.
> "select count(1) from store_sales_10_rc" gives
> {code}
> Job 0: Map: 5  Reduce: 1   Cumulative CPU: 31.73 sec   HDFS Read: 234914410 
> HDFS Write: 8 SUCCESS
> {code}
> Where as, "select count(ss_sold_date_sk) from store_sales_10_rc;" reads far 
> less
> {code}
> Job 0: Map: 5  Reduce: 1   Cumulative CPU: 29.75 sec   HDFS Read: 28145994 
> HDFS Write: 8 SUCCESS
> {code}
> Which is 11% of the data size read by the COUNT(1).
> This was tracked down to the following code in RCFile.java
> {code}
>       } else {
>         // TODO: if no column name is specified e.g, in select count(1) from 
> tt;
>         // skip all columns, this should be distinguished from the case:
>         // select * from tt;
>         for (int i = 0; i < skippedColIDs.length; i++) {
>           skippedColIDs[i] = false;
>         }
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HIVE-4113) Optimize select count(1) with RCFile and Orc

Reply via email to