Nita Dembla created HIVE-13995:
----------------------------------

             Summary: Hive generates inefficient metastore queries for TPCDS 
tables with 1800+ partitions leading to higher compile time
                 Key: HIVE-13995
                 URL: https://issues.apache.org/jira/browse/HIVE-13995
             Project: Hive
          Issue Type: Bug
          Components: Hive
    Affects Versions: 2.2.0
            Reporter: Nita Dembla


TPCDS fact tables (store_sales, catalog_sales) have 1800+ partitions and when 
the query does not a filter on the partition column, metastore queries 
generated have a large IN clause listing all the partition names. Most RDBMS 
systems have issues optimizing large IN clause and even when a good index plan 
is chosen , comparing to 1800+ string values will not lead to best execution 
time.

When all partitions are chosen, not specifying the partition list and having 
filters only on table and column name will generate the same result set as long 
as there are no concurrent modifications to partition list of the hive table 
(adding/dropping partitions).

For eg: For TPCDS query18, the metastore query gathering partition column 
statistics runs in 0.5 secs in Mysql. Following is output from mysql log

{noformat}
-- Query_time: 0.482063  Lock_time: 0.003037 Rows_sent: 1836  Rows_examined: 
18360
select count("COLUMN_NAME") from "PART_COL_STATS"
 where "DB_NAME" = 'tpcds_bin_partitioned_orc_30000' and "TABLE_NAME" = 
'catalog_sales' 
 and "COLUMN_NAME" in 
('cs_bill_customer_sk','cs_bill_cdemo_sk','cs_item_sk','cs_quantity','cs_list_price','cs_sales_price','cs_coupon_amt','cs_net_profit')
 and "PARTITION_NAME" in 
('cs_sold_date_sk=2450815','cs_sold_date_sk=2450816','cs_sold_date_sk=2450817','cs_sold_date_sk=2450818','cs_sold_date_sk=2450819','cs_sold_date_sk=2450820','cs_sold_date_sk=2450821','cs_sold_date_sk=2450822','cs_sold_date_sk=2450823','cs_sold_date_sk=2450824','cs_sold_date_sk=2450825','cs_sold_date_sk=2450826','cs_sold_date_sk=2450827','cs_sold_date_sk=2450828','cs_sold_date_sk=2450829','cs_sold_date_sk=2450830','cs_sold_date_sk=2450831','cs_sold_date_sk=2450832','cs_sold_date_sk=2450833','cs_sold_date_sk=2450834','cs_sold_date_sk=2450835','cs_sold_date_sk=2450836','cs_sold_date_sk=2450837','cs_sold_date_sk=2450838','cs_sold_date_sk=2450839','cs_sold_date_sk=2450840','cs_sold_date_sk=2450841','cs_sold_date_sk=2450842','cs_sold_date_sk=2450843','cs_sold_date_sk=2450844','cs_sold_date_sk=2450845','cs_sold_date_sk=2450846','cs_sold_date_sk=2450847','cs_sold_date_sk=2450848','cs_sold_date_sk=2450849','cs_sold_date_sk=2450850','cs_sold_date_sk=2450851','cs_sold_date_sk=2450852','cs_sold_date_sk=2450853','cs_sold_date_sk=2450854','cs_sold_date_sk=2450855','cs_sold_date_sk=2450856',...,'cs_sold_date_sk=2452654')
 group by "PARTITION_NAME";
{noformat}

Functionally equivalent query runs in 0.1 seconds
{noformat}
--Query_time: 0.121296  Lock_time: 0.000156 Rows_sent: 1836  Rows_examined: 
18360
select count("COLUMN_NAME") from "PART_COL_STATS"
 where "DB_NAME" = 'tpcds_bin_partitioned_orc_30000' and "TABLE_NAME" = 
'catalog_sales'  and "COLUMN_NAME" in 
('cs_bill_customer_sk','cs_bill_cdemo_sk','cs_item_sk','cs_quantity','cs_list_price','cs_sales_price','cs_coupon_amt','cs_net_profit')
 group by "PARTITION_NAME";

{noformat}

If removing the partition list seems drastic, its also possible to simply list 
the range since hive gets a ordered list of partition names. This performs 
equally well as earlier query

{noformat}
# Query_time: 0.143874  Lock_time: 0.000154 Rows_sent: 1836  Rows_examined: 
18360
SET timestamp=1464014881;
select count("COLUMN_NAME") from "PART_COL_STATS" where "DB_NAME" = 
'tpcds_bin_partitioned_orc_30000' and "TABLE_NAME" = 'catalog_sales'  and 
"COLUMN_NAME" in 
('cs_bill_customer_sk','cs_bill_cdemo_sk','cs_item_sk','cs_quantity','cs_list_price','cs_sales_price','cs_coupon_amt','cs_net_profit')
  and "PARTITION_NAME" >= 'cs_sold_date_sk=2450815' and "PARTITION_NAME" <= 
'cs_sold_date_sk=2452654' 
group by "PARTITION_NAME";
{noformat}

Another thing to check is the IN clause of column names. Columns in projection 
list of hive query are mentioned here. Not sure if statistics of these columns 
are required for hive query optimization.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to