Yuming Wang created SPARK-35837:
-----------------------------------

             Summary: Recommendations for Common Query Problems
                 Key: SPARK-35837
                 URL: https://issues.apache.org/jira/browse/SPARK-35837
             Project: Spark
          Issue Type: New Feature
          Components: SQL
    Affects Versions: 3.2.0
            Reporter: Yuming Wang


Teradata supports [Recommendations for Common Query 
Problems|https://docs.teradata.com/r/wada1XMYPkZVTqPKz2CNaw/JE7PEg6H~4nBZYEGphxxsg].

We can implement a similar feature.
 1. Detect the most skew values for join. The user decides whether these are 
needed.
 2. Detect the most skew values for window function. The user decides whether 
these are needed.
 3. Detect it can be optimized to bucket read, for example, Analyzer add a cast 
to bucket column.
 4.Recommend the user add a filter condition to the partition column of the 
partition table.
 5. Check the condition of join, for example, the result of cast string to 
double may be incorrect.

For example:
{code:sql}
0: jdbc:hive2://localhost:10000/default> EXPLAIN OPTIMIZE
0: jdbc:hive2://localhost:10000/default> SELECT a.*,
0: jdbc:hive2://localhost:10000/default>        CASE
0: jdbc:hive2://localhost:10000/default>          WHEN ( NOT ( a.exclude = 1
0: jdbc:hive2://localhost:10000/default>                       AND a.cobrand = 6
0: jdbc:hive2://localhost:10000/default>                       AND 
a.primary_app_id IN ( 1462, 2878, 2571 ) ) )
0: jdbc:hive2://localhost:10000/default>               AND ( a.valid_page_count 
= 1 ) THEN 1
0: jdbc:hive2://localhost:10000/default>          ELSE 0
0: jdbc:hive2://localhost:10000/default>        END AS is_singlepage,
0: jdbc:hive2://localhost:10000/default>        ca.bsns_vrtcl_name
0: jdbc:hive2://localhost:10000/default> FROM   (SELECT *
0: jdbc:hive2://localhost:10000/default>         FROM   (SELECT *,
0: jdbc:hive2://localhost:10000/default>                        'VI' AS 
page_type
0: jdbc:hive2://localhost:10000/default>                 FROM   tbl1
0: jdbc:hive2://localhost:10000/default>                 UNION
0: jdbc:hive2://localhost:10000/default>                 SELECT *,
0: jdbc:hive2://localhost:10000/default>                        'SRP' AS 
page_type
0: jdbc:hive2://localhost:10000/default>                 FROM   tbl2
0: jdbc:hive2://localhost:10000/default>                 UNION
0: jdbc:hive2://localhost:10000/default>                 SELECT *,
0: jdbc:hive2://localhost:10000/default>                        'My Garage' AS 
page_type
0: jdbc:hive2://localhost:10000/default>                 FROM   tbl3
0: jdbc:hive2://localhost:10000/default>                 UNION
0: jdbc:hive2://localhost:10000/default>                 SELECT *,
0: jdbc:hive2://localhost:10000/default>                        'Motors 
Homepage' AS page_type
0: jdbc:hive2://localhost:10000/default>                 FROM   tbl4) t
0: jdbc:hive2://localhost:10000/default>         WHERE  session_start_dt 
BETWEEN ( '2020-01-01' ) AND (
0: jdbc:hive2://localhost:10000/default>                                        
 CURRENT_DATE() - 10 )) a
0: jdbc:hive2://localhost:10000/default>        LEFT JOIN (SELECT item_id,
0: jdbc:hive2://localhost:10000/default>                          item_site_id,
0: jdbc:hive2://localhost:10000/default>                          auct_end_dt,
0: jdbc:hive2://localhost:10000/default>                          leaf_categ_id
0: jdbc:hive2://localhost:10000/default>                   FROM   tbl5
0: jdbc:hive2://localhost:10000/default>                   WHERE  auct_end_dt 
>= ( '2020-01-01' )) itm
0: jdbc:hive2://localhost:10000/default>               ON a.item_id = 
itm.item_id
0: jdbc:hive2://localhost:10000/default>        LEFT JOIN tbl6 ca
0: jdbc:hive2://localhost:10000/default>               ON itm.leaf_categ_id = 
ca.leaf_categ_id
0: jdbc:hive2://localhost:10000/default>                  AND itm.item_site_id 
= ca.site_id;
+-----------------------------------------------------------------------------------------------------------------------------------------+--+
|                                                                 result        
                                                          |
+-----------------------------------------------------------------------------------------------------------------------------------------+--+
| 1. Detect the most skew values for join                                       
                                                          |
|   Check join: Join LeftOuter, ((leaf_categ_id#1453 = leaf_categ_id#3020) AND 
(cast(item_site_id#1444 as decimal(9,0)) = site_id#3022))  |
|     table: tbl5                                                               
                                                          |
|       leaf_categ_id, item_site_id, count                                      
                                                          |
|       171243, 0, 115412614                                                    
                                                          |
|       176984, 3, 81003252                                                     
                                                          |
|       176985, 3, 75035585                                                     
                                                          |
|       179680, 77, 71831618                                                    
                                                          |
|       171243, 3, 69821716                                                     
                                                          |
|   Check join: Join LeftOuter, (item_id#3244 = cast(item_id#1439 as 
decimal(20,0)))                                                      |
|     table: tbl1                                                               
                                                          |
|       item_id, count                                                          
                                                          |
|       null, 254738836                                                         
                                                          |
|       232644544775, 35530                                                     
                                                          |
|       372028681865, 24189                                                     
                                                          |
|       124014973844, 21725                                                     
                                                          |
|       303476035378, 17075                                                     
                                                          |
|                                                                               
                                                          |
| 2. Detect the most skew values for window function                            
                                                          |
|                                                                               
                                                          |
| 3. Detect bucket read                                                         
                                                             |
|   Check join: SortMergeJoin [item_id#3244], [cast(item_id#1439 as 
decimal(20,0))], LeftOuter                                            |
|     The data type do not match for bucket read, consider cast `item_id` to 
decimal(18,0).                                               |
|                                                                               
                                                          |
| 4. Add a filter condition to partition column                                 
                                                          |
|                                                                               
                                                          |
| 5. Check the dangerous join condition                                         
                                                                        |
|                                                                               
                                                          |
+-----------------------------------------------------------------------------------------------------------------------------------------+--+
27 rows selected (152.685 seconds)
{code}
The optimized query is:
{code:sql}
SELECT a.*,
     CASE
       WHEN ( NOT ( a.exclude = 1
                    AND a.cobrand = 6
                    AND a.primary_app_id IN ( 1462, 2878, 2571 ) ) )
            AND ( a.valid_page_count = 1 ) THEN 1
       ELSE 0
     END AS is_singlepage,
     ca.bsns_vrtcl_name
FROM   (SELECT *
      FROM   (SELECT *,
                     'VI' AS page_type
              FROM   tbl1 where item_id is not null -- Checked with user, the 
null values is not needed. Add a filter condition.
              UNION
              SELECT *,
                     'SRP' AS page_type
              FROM   tbl2
              UNION
              SELECT *,
                     'My Garage' AS page_type
              FROM   tbl3
              UNION
              SELECT *,
                     'Motors Homepage' AS page_type
              FROM   tbl4) t
      WHERE  session_start_dt BETWEEN ( '2020-01-01' ) AND (
                                      CURRENT_DATE() - 10 )) a
     LEFT JOIN (SELECT item_id,
                       item_site_id,
                       auct_end_dt,
                       leaf_categ_id
                FROM   tbl5
                WHERE  auct_end_dt >= ( '2020-01-01' )) itm
            ON cast(a.item_id as decimal(18, 0)) = cast(itm.item_id as 
decimal(18, 0)) -- Checked with user. cast to decimal(18, 0) is safe.
     LEFT JOIN tbl6 ca
            ON itm.leaf_categ_id = ca.leaf_categ_id
               AND itm.item_site_id = ca.site_id 
{code}
 
With these optimizations, the query execution time is reduced from 42 minutes 
to 6.3 minutes.

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to