[ 
https://issues.apache.org/jira/browse/SPARK-43526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17724080#comment-17724080
 ] 

caican edited comment on SPARK-43526 at 5/19/23 2:48 AM:
---------------------------------------------------------

I find that the shuffle hash join is slower than the sort merge join because 
the sort node is added after two shuffle hash joins, and the number of data 
bars of the two shuffle hash joins expands a lot.

I overwrote q95, after closing shuffle hash join and adding sort operation 
after corresponding join nodes, q95 execution also became slow.

 

1. The execution plan before I rewrite q95 sql is as follows:



*Sort merge join*

!sort1.png|width=926,height=473!

*shuffle hash join*

!shuffle1.png|width=921,height=441!

 

2. The execution plan after I rewrite q95 sql is as follows:

*sort merge join*

!sort2.png|width=936,height=496!

 

The sort operation was added after the corresponding join nodes, and the 
execution was slower than shuffle hash join.

And it can be confirmed that the performance deteriorates after the shuffle 
hash join function is enabled because a large amount of data is sorted.

!image-2023-05-19-10-43-51-747.png|width=708,height=38!

 

*q95 sql with sort operation added*

 
{code:java}
 
set 
spark.sql.optimizer.excludedRules="org.apache.spark.sql.catalyst.optimizer.EliminateSorts";

set spark.sql.execution.removeRedundantSorts=false;

WITH
    ws_wh AS (
        SELECT
            ws1.ws_order_number,
            ws1.ws_warehouse_sk wh1,
            ws2.ws_warehouse_sk wh2
        FROM
            web_sales ws1,
            web_sales ws2
        WHERE
            ws1.ws_order_number=ws2.ws_order_number
            AND ws1.ws_warehouse_sk<>ws2.ws_warehouse_sk
        SORT BY
            ws1.ws_order_number
    ),
    tmp1 as (
        SELECT
            ws_order_number
        FROM
            ws_wh
    ),
    tmp2 as (
        SELECT
            wr_order_number
        FROM
            web_returns,
            ws_wh
        WHERE
            wr_order_number=ws_wh.ws_order_number
        SORT BY
            wr_order_number
    )
SELECT
    count(DISTINCT ws_order_number) AS `order count `,
    sum(ws_ext_ship_cost) AS `total shipping cost `,
    sum(ws_net_profit) AS `total net profit `
FROM
    web_sales ws1
    left semi join tmp1 on ws1.ws_order_number=tmp1.ws_order_number
    left semi join tmp2 on ws1.ws_order_number=tmp2.wr_order_number
    join date_dim on ws1.ws_ship_date_sk=date_dim.d_date_sk
    join customer_address on ws1.ws_ship_addr_sk=customer_address.ca_address_sk
    join web_site on ws1.ws_web_site_sk=web_site.web_site_sk
WHERE
    d_date BETWEEN '1999-02-01' AND (CAST('1999-02-01' AS DATE)+INTERVAL 60 DAY)
    AND ws1.ws_ship_date_sk=d_date_sk
    AND ws1.ws_ship_addr_sk=ca_address_sk
    AND ca_state='IL'
    AND ws1.ws_web_site_sk=web_site_sk
    AND web_company_name='pri'
ORDER BY
    count(DISTINCT ws_order_number)
LIMIT
    100{code}
 


was (Author: JIRAUSER280464):
I find that the shuffle hash join is slower than the sort merge join because 
the sort node is added after two shuffle hash joins, and the number of data 
bars of the two shuffle hash joins expands a lot.
I overwrote q95, after closing shuffle hash join and adding sort operation 
after corresponding join nodes, q95 execution also became slow.

1. The execution plan before I rewrite q95 sql is as follows:
**

*Sort merge join*

!sort1.png|width=926,height=473!

*shuffle hash join*

!shuffle1.png|width=921,height=441!

 

2. The execution plan after I rewrite q95 sql is as follows:

*sort merge join*

!sort2.png|width=936,height=496!

 

The sort operation was added after the corresponding join nodes, and the 
execution was slower than shuffle hash join.

And it can be confirmed that the performance deteriorates after the shuffle 
hash join function is enabled because a large amount of data is sorted.

!image-2023-05-19-10-43-51-747.png|width=932,height=50!

 

 

 

*q95 sql with sort operation added*

 
{code:java}
 
set 
spark.sql.optimizer.excludedRules="org.apache.spark.sql.catalyst.optimizer.EliminateSorts";

set spark.sql.execution.removeRedundantSorts=false;

WITH
    ws_wh AS (
        SELECT
            ws1.ws_order_number,
            ws1.ws_warehouse_sk wh1,
            ws2.ws_warehouse_sk wh2
        FROM
            web_sales ws1,
            web_sales ws2
        WHERE
            ws1.ws_order_number=ws2.ws_order_number
            AND ws1.ws_warehouse_sk<>ws2.ws_warehouse_sk
        SORT BY
            ws1.ws_order_number
    ),
    tmp1 as (
        SELECT
            ws_order_number
        FROM
            ws_wh
    ),
    tmp2 as (
        SELECT
            wr_order_number
        FROM
            web_returns,
            ws_wh
        WHERE
            wr_order_number=ws_wh.ws_order_number
        SORT BY
            wr_order_number
    )
SELECT
    count(DISTINCT ws_order_number) AS `order count `,
    sum(ws_ext_ship_cost) AS `total shipping cost `,
    sum(ws_net_profit) AS `total net profit `
FROM
    web_sales ws1
    left semi join tmp1 on ws1.ws_order_number=tmp1.ws_order_number
    left semi join tmp2 on ws1.ws_order_number=tmp2.wr_order_number
    join date_dim on ws1.ws_ship_date_sk=date_dim.d_date_sk
    join customer_address on ws1.ws_ship_addr_sk=customer_address.ca_address_sk
    join web_site on ws1.ws_web_site_sk=web_site.web_site_sk
WHERE
    d_date BETWEEN '1999-02-01' AND (CAST('1999-02-01' AS DATE)+INTERVAL 60 DAY)
    AND ws1.ws_ship_date_sk=d_date_sk
    AND ws1.ws_ship_addr_sk=ca_address_sk
    AND ca_state='IL'
    AND ws1.ws_web_site_sk=web_site_sk
    AND web_company_name='pri'
ORDER BY
    count(DISTINCT ws_order_number)
LIMIT
    100{code}
 

> when shuffle hash join is enabled, q95 performance deteriorates
> ---------------------------------------------------------------
>
>                 Key: SPARK-43526
>                 URL: https://issues.apache.org/jira/browse/SPARK-43526
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 3.1.2, 3.2.0
>            Reporter: caican
>            Priority: Major
>         Attachments: image-2023-05-16-21-21-35-493.png, 
> image-2023-05-16-21-22-16-170.png, image-2023-05-16-21-23-35-237.png, 
> image-2023-05-16-21-24-09-182.png, image-2023-05-16-21-28-11-514.png, 
> image-2023-05-16-21-28-44-163.png, image-2023-05-17-16-53-42-302.png, 
> image-2023-05-17-16-54-59-053.png, image-2023-05-19-10-43-51-747.png, 
> shuffle1.png, sort1.png, sort2.png
>
>
> Testing with 5TB dataset, the performance of q95 in tpcds deteriorates when 
> shuffle hash join is enabled and the performance is better when sortMergeJoin 
> is used.
>  
> Performance difference:  from 3.9min(sortMergeJoin) to 
> 8.1min(shuffledHashJoin)
>  
> 1. enable shuffledHashJoin, the execution plan is as follows:
> !image-2023-05-16-21-28-44-163.png|width=935,height=64!
> !image-2023-05-16-21-21-35-493.png|width=924,height=502!
> 2. disable shuffledHashJoin, the execution plan is as follows:
> !image-2023-05-16-21-28-11-514.png|width=922,height=67!
> !image-2023-05-16-21-22-16-170.png|width=934,height=477!
>  
> And when shuffledHashJoin is enabled, gc is very serious,
> !image-2023-05-16-21-23-35-237.png|width=929,height=570!
>  
> but sortMergeJoin executes without this problem.
> !image-2023-05-16-21-24-09-182.png|width=931,height=573!
>  
> Any suggestions on how to solve it?Thanks!
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to