[ https://issues.apache.org/jira/browse/HIVE-4113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13771847#comment-13771847 ]
Yin Huai commented on HIVE-4113: -------------------------------- READ_ALL_COLUMNS and READ_ALL_COLUMNS_DEFAULT are mainly created for HCat, because I think it is a kind of burden to users if they have to be aware ColumnProjectionUtils and use it every time. So, through HCat, if users do not use ColumnProjectionUtils to set needed columns, we will read all columns. If we set READ_ALL_COLUMNS_DEFAULT=false, no column will be read if a user does not use ColumnProjectionUtils. In Hive, if we get rid off the flag of column pruning, the list of neededColumnIDs in TS will not be null. Thus, in Hive, we will always set READ_ALL_COLUMNS to false (the .2 patch has an issue on it... I will fix it later). In summary, in Hive, we use neededColumnIDs in TS as the only way to tell a underlying recordreader what to read. If neededColumnIDs is an empty list, we will know no needed column. Otherwise, we will read columns specified in neededColumnIDs (if we have select * in a sub-query, neededColumnIDs should be populated to include all columns). In HCat, if a user wants to use the MapReduce interface, he or she has two ways to tell what columns are needed. 1) This user does nothing. In this case, we will read all columns. 2) This user uses utility functions in ColumnProjectionUtils (e.g. setReadColumnIDs) to specify needed columns. In this case, READ_ALL_COLUMNS will be set to false and we only read columns specified in READ_COLUMN_IDS_CONF_STR. I hope what I am proposing makes sense. I am welcome to any suggestion :) > Optimize select count(1) with RCFile and Orc > -------------------------------------------- > > Key: HIVE-4113 > URL: https://issues.apache.org/jira/browse/HIVE-4113 > Project: Hive > Issue Type: Bug > Components: File Formats > Reporter: Gopal V > Assignee: Yin Huai > Fix For: 0.12.0 > > Attachments: HIVE-4113-0.patch, HIVE-4113.1.patch, HIVE-4113.2.patch, > HIVE-4113.patch, HIVE-4113.patch > > > select count(1) loads up every column & every row when used with RCFile. > "select count(1) from store_sales_10_rc" gives > {code} > Job 0: Map: 5 Reduce: 1 Cumulative CPU: 31.73 sec HDFS Read: 234914410 > HDFS Write: 8 SUCCESS > {code} > Where as, "select count(ss_sold_date_sk) from store_sales_10_rc;" reads far > less > {code} > Job 0: Map: 5 Reduce: 1 Cumulative CPU: 29.75 sec HDFS Read: 28145994 > HDFS Write: 8 SUCCESS > {code} > Which is 11% of the data size read by the COUNT(1). > This was tracked down to the following code in RCFile.java > {code} > } else { > // TODO: if no column name is specified e.g, in select count(1) from > tt; > // skip all columns, this should be distinguished from the case: > // select * from tt; > for (int i = 0; i < skippedColIDs.length; i++) { > skippedColIDs[i] = false; > } > {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira