[ https://issues.apache.org/jira/browse/SPARK-43526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17724080#comment-17724080 ]
caican edited comment on SPARK-43526 at 5/19/23 2:48 AM: --------------------------------------------------------- I find that the shuffle hash join is slower than the sort merge join because the sort node is added after two shuffle hash joins, and the number of data bars of the two shuffle hash joins expands a lot. I overwrote q95, after closing shuffle hash join and adding sort operation after corresponding join nodes, q95 execution also became slow. 1. The execution plan before I rewrite q95 sql is as follows: *Sort merge join* !sort1.png|width=926,height=473! *shuffle hash join* !shuffle1.png|width=921,height=441! 2. The execution plan after I rewrite q95 sql is as follows: *sort merge join* !sort2.png|width=936,height=496! The sort operation was added after the corresponding join nodes, and the execution was slower than shuffle hash join. And it can be confirmed that the performance deteriorates after the shuffle hash join function is enabled because a large amount of data is sorted. !image-2023-05-19-10-43-51-747.png|width=708,height=38! *q95 sql with sort operation added* {code:java} set spark.sql.optimizer.excludedRules="org.apache.spark.sql.catalyst.optimizer.EliminateSorts"; set spark.sql.execution.removeRedundantSorts=false; WITH ws_wh AS ( SELECT ws1.ws_order_number, ws1.ws_warehouse_sk wh1, ws2.ws_warehouse_sk wh2 FROM web_sales ws1, web_sales ws2 WHERE ws1.ws_order_number=ws2.ws_order_number AND ws1.ws_warehouse_sk<>ws2.ws_warehouse_sk SORT BY ws1.ws_order_number ), tmp1 as ( SELECT ws_order_number FROM ws_wh ), tmp2 as ( SELECT wr_order_number FROM web_returns, ws_wh WHERE wr_order_number=ws_wh.ws_order_number SORT BY wr_order_number ) SELECT count(DISTINCT ws_order_number) AS `order count `, sum(ws_ext_ship_cost) AS `total shipping cost `, sum(ws_net_profit) AS `total net profit ` FROM web_sales ws1 left semi join tmp1 on ws1.ws_order_number=tmp1.ws_order_number left semi join tmp2 on ws1.ws_order_number=tmp2.wr_order_number join date_dim on ws1.ws_ship_date_sk=date_dim.d_date_sk join customer_address on ws1.ws_ship_addr_sk=customer_address.ca_address_sk join web_site on ws1.ws_web_site_sk=web_site.web_site_sk WHERE d_date BETWEEN '1999-02-01' AND (CAST('1999-02-01' AS DATE)+INTERVAL 60 DAY) AND ws1.ws_ship_date_sk=d_date_sk AND ws1.ws_ship_addr_sk=ca_address_sk AND ca_state='IL' AND ws1.ws_web_site_sk=web_site_sk AND web_company_name='pri' ORDER BY count(DISTINCT ws_order_number) LIMIT 100{code} was (Author: JIRAUSER280464): I find that the shuffle hash join is slower than the sort merge join because the sort node is added after two shuffle hash joins, and the number of data bars of the two shuffle hash joins expands a lot. I overwrote q95, after closing shuffle hash join and adding sort operation after corresponding join nodes, q95 execution also became slow. 1. The execution plan before I rewrite q95 sql is as follows: ** *Sort merge join* !sort1.png|width=926,height=473! *shuffle hash join* !shuffle1.png|width=921,height=441! 2. The execution plan after I rewrite q95 sql is as follows: *sort merge join* !sort2.png|width=936,height=496! The sort operation was added after the corresponding join nodes, and the execution was slower than shuffle hash join. And it can be confirmed that the performance deteriorates after the shuffle hash join function is enabled because a large amount of data is sorted. !image-2023-05-19-10-43-51-747.png|width=932,height=50! *q95 sql with sort operation added* {code:java} set spark.sql.optimizer.excludedRules="org.apache.spark.sql.catalyst.optimizer.EliminateSorts"; set spark.sql.execution.removeRedundantSorts=false; WITH ws_wh AS ( SELECT ws1.ws_order_number, ws1.ws_warehouse_sk wh1, ws2.ws_warehouse_sk wh2 FROM web_sales ws1, web_sales ws2 WHERE ws1.ws_order_number=ws2.ws_order_number AND ws1.ws_warehouse_sk<>ws2.ws_warehouse_sk SORT BY ws1.ws_order_number ), tmp1 as ( SELECT ws_order_number FROM ws_wh ), tmp2 as ( SELECT wr_order_number FROM web_returns, ws_wh WHERE wr_order_number=ws_wh.ws_order_number SORT BY wr_order_number ) SELECT count(DISTINCT ws_order_number) AS `order count `, sum(ws_ext_ship_cost) AS `total shipping cost `, sum(ws_net_profit) AS `total net profit ` FROM web_sales ws1 left semi join tmp1 on ws1.ws_order_number=tmp1.ws_order_number left semi join tmp2 on ws1.ws_order_number=tmp2.wr_order_number join date_dim on ws1.ws_ship_date_sk=date_dim.d_date_sk join customer_address on ws1.ws_ship_addr_sk=customer_address.ca_address_sk join web_site on ws1.ws_web_site_sk=web_site.web_site_sk WHERE d_date BETWEEN '1999-02-01' AND (CAST('1999-02-01' AS DATE)+INTERVAL 60 DAY) AND ws1.ws_ship_date_sk=d_date_sk AND ws1.ws_ship_addr_sk=ca_address_sk AND ca_state='IL' AND ws1.ws_web_site_sk=web_site_sk AND web_company_name='pri' ORDER BY count(DISTINCT ws_order_number) LIMIT 100{code} > when shuffle hash join is enabled, q95 performance deteriorates > --------------------------------------------------------------- > > Key: SPARK-43526 > URL: https://issues.apache.org/jira/browse/SPARK-43526 > Project: Spark > Issue Type: Improvement > Components: SQL > Affects Versions: 3.1.2, 3.2.0 > Reporter: caican > Priority: Major > Attachments: image-2023-05-16-21-21-35-493.png, > image-2023-05-16-21-22-16-170.png, image-2023-05-16-21-23-35-237.png, > image-2023-05-16-21-24-09-182.png, image-2023-05-16-21-28-11-514.png, > image-2023-05-16-21-28-44-163.png, image-2023-05-17-16-53-42-302.png, > image-2023-05-17-16-54-59-053.png, image-2023-05-19-10-43-51-747.png, > shuffle1.png, sort1.png, sort2.png > > > Testing with 5TB dataset, the performance of q95 in tpcds deteriorates when > shuffle hash join is enabled and the performance is better when sortMergeJoin > is used. > > Performance difference: from 3.9min(sortMergeJoin) to > 8.1min(shuffledHashJoin) > > 1. enable shuffledHashJoin, the execution plan is as follows: > !image-2023-05-16-21-28-44-163.png|width=935,height=64! > !image-2023-05-16-21-21-35-493.png|width=924,height=502! > 2. disable shuffledHashJoin, the execution plan is as follows: > !image-2023-05-16-21-28-11-514.png|width=922,height=67! > !image-2023-05-16-21-22-16-170.png|width=934,height=477! > > And when shuffledHashJoin is enabled, gc is very serious, > !image-2023-05-16-21-23-35-237.png|width=929,height=570! > > but sortMergeJoin executes without this problem. > !image-2023-05-16-21-24-09-182.png|width=931,height=573! > > Any suggestions on how to solve it?Thanks! > > -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org