[ https://issues.apache.org/jira/browse/HIVE-13995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15376104#comment-15376104 ]
Ashutosh Chauhan commented on HIVE-13995: ----------------------------------------- This will be much more useful if we can do both tasks (retrieving partitions & stats) in a single mysql query. > Hive generates inefficient metastore queries for TPCDS tables with 1800+ > partitions leading to higher compile time > ------------------------------------------------------------------------------------------------------------------ > > Key: HIVE-13995 > URL: https://issues.apache.org/jira/browse/HIVE-13995 > Project: Hive > Issue Type: Bug > Components: Hive > Affects Versions: 2.2.0 > Reporter: Nita Dembla > Assignee: Hari Sankar Sivarama Subramaniyan > Attachments: HIVE-13995.1.patch, HIVE-13995.2.patch > > > TPCDS fact tables (store_sales, catalog_sales) have 1800+ partitions and when > the query does not a filter on the partition column, metastore queries > generated have a large IN clause listing all the partition names. Most RDBMS > systems have issues optimizing large IN clause and even when a good index > plan is chosen , comparing to 1800+ string values will not lead to best > execution time. > When all partitions are chosen, not specifying the partition list and having > filters only on table and column name will generate the same result set as > long as there are no concurrent modifications to partition list of the hive > table (adding/dropping partitions). > For eg: For TPCDS query18, the metastore query gathering partition column > statistics runs in 0.5 secs in Mysql. Following is output from mysql log > {noformat} > -- Query_time: 0.482063 Lock_time: 0.003037 Rows_sent: 1836 Rows_examined: > 18360 > select count("COLUMN_NAME") from "PART_COL_STATS" > where "DB_NAME" = 'tpcds_bin_partitioned_orc_30000' and "TABLE_NAME" = > 'catalog_sales' > and "COLUMN_NAME" in > ('cs_bill_customer_sk','cs_bill_cdemo_sk','cs_item_sk','cs_quantity','cs_list_price','cs_sales_price','cs_coupon_amt','cs_net_profit') > and "PARTITION_NAME" in > ('cs_sold_date_sk=2450815','cs_sold_date_sk=2450816','cs_sold_date_sk=2450817','cs_sold_date_sk=2450818','cs_sold_date_sk=2450819','cs_sold_date_sk=2450820','cs_sold_date_sk=2450821','cs_sold_date_sk=2450822','cs_sold_date_sk=2450823','cs_sold_date_sk=2450824','cs_sold_date_sk=2450825','cs_sold_date_sk=2450826','cs_sold_date_sk=2450827','cs_sold_date_sk=2450828','cs_sold_date_sk=2450829','cs_sold_date_sk=2450830','cs_sold_date_sk=2450831','cs_sold_date_sk=2450832','cs_sold_date_sk=2450833','cs_sold_date_sk=2450834','cs_sold_date_sk=2450835','cs_sold_date_sk=2450836','cs_sold_date_sk=2450837','cs_sold_date_sk=2450838','cs_sold_date_sk=2450839','cs_sold_date_sk=2450840','cs_sold_date_sk=2450841','cs_sold_date_sk=2450842','cs_sold_date_sk=2450843','cs_sold_date_sk=2450844','cs_sold_date_sk=2450845','cs_sold_date_sk=2450846','cs_sold_date_sk=2450847','cs_sold_date_sk=2450848','cs_sold_date_sk=2450849','cs_sold_date_sk=2450850','cs_sold_date_sk=2450851','cs_sold_date_sk=2450852','cs_sold_date_sk=2450853','cs_sold_date_sk=2450854','cs_sold_date_sk=2450855','cs_sold_date_sk=2450856',...,'cs_sold_date_sk=2452654') > group by "PARTITION_NAME"; > {noformat} > Functionally equivalent query runs in 0.1 seconds > {noformat} > --Query_time: 0.121296 Lock_time: 0.000156 Rows_sent: 1836 Rows_examined: > 18360 > select count("COLUMN_NAME") from "PART_COL_STATS" > where "DB_NAME" = 'tpcds_bin_partitioned_orc_30000' and "TABLE_NAME" = > 'catalog_sales' and "COLUMN_NAME" in > ('cs_bill_customer_sk','cs_bill_cdemo_sk','cs_item_sk','cs_quantity','cs_list_price','cs_sales_price','cs_coupon_amt','cs_net_profit') > group by "PARTITION_NAME"; > {noformat} > If removing the partition list seems drastic, its also possible to simply > list the range since hive gets a ordered list of partition names. This > performs equally well as earlier query > {noformat} > # Query_time: 0.143874 Lock_time: 0.000154 Rows_sent: 1836 Rows_examined: > 18360 > SET timestamp=1464014881; > select count("COLUMN_NAME") from "PART_COL_STATS" where "DB_NAME" = > 'tpcds_bin_partitioned_orc_30000' and "TABLE_NAME" = 'catalog_sales' and > "COLUMN_NAME" in > ('cs_bill_customer_sk','cs_bill_cdemo_sk','cs_item_sk','cs_quantity','cs_list_price','cs_sales_price','cs_coupon_amt','cs_net_profit') > and "PARTITION_NAME" >= 'cs_sold_date_sk=2450815' and "PARTITION_NAME" <= > 'cs_sold_date_sk=2452654' > group by "PARTITION_NAME"; > {noformat} > Another thing to check is the IN clause of column names. Columns in > projection list of hive query are mentioned here. Not sure if statistics of > these columns are required for hive query optimization. -- This message was sent by Atlassian JIRA (v6.3.4#6332)