Nita Dembla created HIVE-13995:
----------------------------------
Summary: Hive generates inefficient metastore queries for TPCDS
tables with 1800+ partitions leading to higher compile time
Key: HIVE-13995
URL: https://issues.apache.org/jira/browse/HIVE-13995
Project: Hive
Issue Type: Bug
Components: Hive
Affects Versions: 2.2.0
Reporter: Nita Dembla
TPCDS fact tables (store_sales, catalog_sales) have 1800+ partitions and when
the query does not a filter on the partition column, metastore queries
generated have a large IN clause listing all the partition names. Most RDBMS
systems have issues optimizing large IN clause and even when a good index plan
is chosen , comparing to 1800+ string values will not lead to best execution
time.
When all partitions are chosen, not specifying the partition list and having
filters only on table and column name will generate the same result set as long
as there are no concurrent modifications to partition list of the hive table
(adding/dropping partitions).
For eg: For TPCDS query18, the metastore query gathering partition column
statistics runs in 0.5 secs in Mysql. Following is output from mysql log
{noformat}
-- Query_time: 0.482063 Lock_time: 0.003037 Rows_sent: 1836 Rows_examined:
18360
select count("COLUMN_NAME") from "PART_COL_STATS"
where "DB_NAME" = 'tpcds_bin_partitioned_orc_30000' and "TABLE_NAME" =
'catalog_sales'
and "COLUMN_NAME" in
('cs_bill_customer_sk','cs_bill_cdemo_sk','cs_item_sk','cs_quantity','cs_list_price','cs_sales_price','cs_coupon_amt','cs_net_profit')
and "PARTITION_NAME" in
('cs_sold_date_sk=2450815','cs_sold_date_sk=2450816','cs_sold_date_sk=2450817','cs_sold_date_sk=2450818','cs_sold_date_sk=2450819','cs_sold_date_sk=2450820','cs_sold_date_sk=2450821','cs_sold_date_sk=2450822','cs_sold_date_sk=2450823','cs_sold_date_sk=2450824','cs_sold_date_sk=2450825','cs_sold_date_sk=2450826','cs_sold_date_sk=2450827','cs_sold_date_sk=2450828','cs_sold_date_sk=2450829','cs_sold_date_sk=2450830','cs_sold_date_sk=2450831','cs_sold_date_sk=2450832','cs_sold_date_sk=2450833','cs_sold_date_sk=2450834','cs_sold_date_sk=2450835','cs_sold_date_sk=2450836','cs_sold_date_sk=2450837','cs_sold_date_sk=2450838','cs_sold_date_sk=2450839','cs_sold_date_sk=2450840','cs_sold_date_sk=2450841','cs_sold_date_sk=2450842','cs_sold_date_sk=2450843','cs_sold_date_sk=2450844','cs_sold_date_sk=2450845','cs_sold_date_sk=2450846','cs_sold_date_sk=2450847','cs_sold_date_sk=2450848','cs_sold_date_sk=2450849','cs_sold_date_sk=2450850','cs_sold_date_sk=2450851','cs_sold_date_sk=2450852','cs_sold_date_sk=2450853','cs_sold_date_sk=2450854','cs_sold_date_sk=2450855','cs_sold_date_sk=2450856',...,'cs_sold_date_sk=2452654')
group by "PARTITION_NAME";
{noformat}
Functionally equivalent query runs in 0.1 seconds
{noformat}
--Query_time: 0.121296 Lock_time: 0.000156 Rows_sent: 1836 Rows_examined:
18360
select count("COLUMN_NAME") from "PART_COL_STATS"
where "DB_NAME" = 'tpcds_bin_partitioned_orc_30000' and "TABLE_NAME" =
'catalog_sales' and "COLUMN_NAME" in
('cs_bill_customer_sk','cs_bill_cdemo_sk','cs_item_sk','cs_quantity','cs_list_price','cs_sales_price','cs_coupon_amt','cs_net_profit')
group by "PARTITION_NAME";
{noformat}
If removing the partition list seems drastic, its also possible to simply list
the range since hive gets a ordered list of partition names. This performs
equally well as earlier query
{noformat}
# Query_time: 0.143874 Lock_time: 0.000154 Rows_sent: 1836 Rows_examined:
18360
SET timestamp=1464014881;
select count("COLUMN_NAME") from "PART_COL_STATS" where "DB_NAME" =
'tpcds_bin_partitioned_orc_30000' and "TABLE_NAME" = 'catalog_sales' and
"COLUMN_NAME" in
('cs_bill_customer_sk','cs_bill_cdemo_sk','cs_item_sk','cs_quantity','cs_list_price','cs_sales_price','cs_coupon_amt','cs_net_profit')
and "PARTITION_NAME" >= 'cs_sold_date_sk=2450815' and "PARTITION_NAME" <=
'cs_sold_date_sk=2452654'
group by "PARTITION_NAME";
{noformat}
Another thing to check is the IN clause of column names. Columns in projection
list of hive query are mentioned here. Not sure if statistics of these columns
are required for hive query optimization.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)