Yuming Wang created SPARK-35837: ----------------------------------- Summary: Recommendations for Common Query Problems Key: SPARK-35837 URL: https://issues.apache.org/jira/browse/SPARK-35837 Project: Spark Issue Type: New Feature Components: SQL Affects Versions: 3.2.0 Reporter: Yuming Wang
Teradata supports [Recommendations for Common Query Problems|https://docs.teradata.com/r/wada1XMYPkZVTqPKz2CNaw/JE7PEg6H~4nBZYEGphxxsg]. We can implement a similar feature. 1. Detect the most skew values for join. The user decides whether these are needed. 2. Detect the most skew values for window function. The user decides whether these are needed. 3. Detect it can be optimized to bucket read, for example, Analyzer add a cast to bucket column. 4.Recommend the user add a filter condition to the partition column of the partition table. 5. Check the condition of join, for example, the result of cast string to double may be incorrect. For example: {code:sql} 0: jdbc:hive2://localhost:10000/default> EXPLAIN OPTIMIZE 0: jdbc:hive2://localhost:10000/default> SELECT a.*, 0: jdbc:hive2://localhost:10000/default> CASE 0: jdbc:hive2://localhost:10000/default> WHEN ( NOT ( a.exclude = 1 0: jdbc:hive2://localhost:10000/default> AND a.cobrand = 6 0: jdbc:hive2://localhost:10000/default> AND a.primary_app_id IN ( 1462, 2878, 2571 ) ) ) 0: jdbc:hive2://localhost:10000/default> AND ( a.valid_page_count = 1 ) THEN 1 0: jdbc:hive2://localhost:10000/default> ELSE 0 0: jdbc:hive2://localhost:10000/default> END AS is_singlepage, 0: jdbc:hive2://localhost:10000/default> ca.bsns_vrtcl_name 0: jdbc:hive2://localhost:10000/default> FROM (SELECT * 0: jdbc:hive2://localhost:10000/default> FROM (SELECT *, 0: jdbc:hive2://localhost:10000/default> 'VI' AS page_type 0: jdbc:hive2://localhost:10000/default> FROM tbl1 0: jdbc:hive2://localhost:10000/default> UNION 0: jdbc:hive2://localhost:10000/default> SELECT *, 0: jdbc:hive2://localhost:10000/default> 'SRP' AS page_type 0: jdbc:hive2://localhost:10000/default> FROM tbl2 0: jdbc:hive2://localhost:10000/default> UNION 0: jdbc:hive2://localhost:10000/default> SELECT *, 0: jdbc:hive2://localhost:10000/default> 'My Garage' AS page_type 0: jdbc:hive2://localhost:10000/default> FROM tbl3 0: jdbc:hive2://localhost:10000/default> UNION 0: jdbc:hive2://localhost:10000/default> SELECT *, 0: jdbc:hive2://localhost:10000/default> 'Motors Homepage' AS page_type 0: jdbc:hive2://localhost:10000/default> FROM tbl4) t 0: jdbc:hive2://localhost:10000/default> WHERE session_start_dt BETWEEN ( '2020-01-01' ) AND ( 0: jdbc:hive2://localhost:10000/default> CURRENT_DATE() - 10 )) a 0: jdbc:hive2://localhost:10000/default> LEFT JOIN (SELECT item_id, 0: jdbc:hive2://localhost:10000/default> item_site_id, 0: jdbc:hive2://localhost:10000/default> auct_end_dt, 0: jdbc:hive2://localhost:10000/default> leaf_categ_id 0: jdbc:hive2://localhost:10000/default> FROM tbl5 0: jdbc:hive2://localhost:10000/default> WHERE auct_end_dt >= ( '2020-01-01' )) itm 0: jdbc:hive2://localhost:10000/default> ON a.item_id = itm.item_id 0: jdbc:hive2://localhost:10000/default> LEFT JOIN tbl6 ca 0: jdbc:hive2://localhost:10000/default> ON itm.leaf_categ_id = ca.leaf_categ_id 0: jdbc:hive2://localhost:10000/default> AND itm.item_site_id = ca.site_id; +-----------------------------------------------------------------------------------------------------------------------------------------+--+ | result | +-----------------------------------------------------------------------------------------------------------------------------------------+--+ | 1. Detect the most skew values for join | | Check join: Join LeftOuter, ((leaf_categ_id#1453 = leaf_categ_id#3020) AND (cast(item_site_id#1444 as decimal(9,0)) = site_id#3022)) | | table: tbl5 | | leaf_categ_id, item_site_id, count | | 171243, 0, 115412614 | | 176984, 3, 81003252 | | 176985, 3, 75035585 | | 179680, 77, 71831618 | | 171243, 3, 69821716 | | Check join: Join LeftOuter, (item_id#3244 = cast(item_id#1439 as decimal(20,0))) | | table: tbl1 | | item_id, count | | null, 254738836 | | 232644544775, 35530 | | 372028681865, 24189 | | 124014973844, 21725 | | 303476035378, 17075 | | | | 2. Detect the most skew values for window function | | | | 3. Detect bucket read | | Check join: SortMergeJoin [item_id#3244], [cast(item_id#1439 as decimal(20,0))], LeftOuter | | The data type do not match for bucket read, consider cast `item_id` to decimal(18,0). | | | | 4. Add a filter condition to partition column | | | | 5. Check the dangerous join condition | | | +-----------------------------------------------------------------------------------------------------------------------------------------+--+ 27 rows selected (152.685 seconds) {code} The optimized query is: {code:sql} SELECT a.*, CASE WHEN ( NOT ( a.exclude = 1 AND a.cobrand = 6 AND a.primary_app_id IN ( 1462, 2878, 2571 ) ) ) AND ( a.valid_page_count = 1 ) THEN 1 ELSE 0 END AS is_singlepage, ca.bsns_vrtcl_name FROM (SELECT * FROM (SELECT *, 'VI' AS page_type FROM tbl1 where item_id is not null -- Checked with user, the null values is not needed. Add a filter condition. UNION SELECT *, 'SRP' AS page_type FROM tbl2 UNION SELECT *, 'My Garage' AS page_type FROM tbl3 UNION SELECT *, 'Motors Homepage' AS page_type FROM tbl4) t WHERE session_start_dt BETWEEN ( '2020-01-01' ) AND ( CURRENT_DATE() - 10 )) a LEFT JOIN (SELECT item_id, item_site_id, auct_end_dt, leaf_categ_id FROM tbl5 WHERE auct_end_dt >= ( '2020-01-01' )) itm ON cast(a.item_id as decimal(18, 0)) = cast(itm.item_id as decimal(18, 0)) -- Checked with user. cast to decimal(18, 0) is safe. LEFT JOIN tbl6 ca ON itm.leaf_categ_id = ca.leaf_categ_id AND itm.item_site_id = ca.site_id {code} With these optimizations, the query execution time is reduced from 42 minutes to 6.3 minutes. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org