from:"\"caican \\\(Jira\\\)\""

[jira] [Created] (SPARK-44426) optimize adaptive skew join for ExistenceJoin

2023-07-14 Thread caican (Jira)

caican created SPARK-44426:
--

 Summary: optimize adaptive skew join for ExistenceJoin
 Key: SPARK-44426
 URL: https://issues.apache.org/jira/browse/SPARK-44426
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.4.0, 3.3.0, 3.2.0, 3.1.2
Reporter: caican


For this query,  InSubQuery would be cast to `ExistenceJoin` and now 
`ExistenceJoin` does not support automatic data skew for the left table.
{code:java}
SELECT * FROM skewData1
where
(key1 in (select key2 from skewData2)
or value1 in (select value2 from skewData2){code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-44426) optimize adaptive skew join for ExistenceJoin

2023-07-14 Thread caican (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44426?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

caican updated SPARK-44426:
---
Description: 
For this query,  InSubQuery would be cast to ExistenceJoin and now 
ExistenceJoin does not support automatic data skew for the left table.
{code:java}
SELECT * FROM skewData1
where
(key1 in (select key2 from skewData2)
or value1 in (select value2 from skewData2){code}

  was:
For this query,  InSubQuery would be cast to `ExistenceJoin` and now 
`ExistenceJoin` does not support automatic data skew for the left table.
{code:java}
SELECT * FROM skewData1
where
(key1 in (select key2 from skewData2)
or value1 in (select value2 from skewData2){code}


> optimize adaptive skew join for ExistenceJoin
> -
>
> Key: SPARK-44426
> URL: https://issues.apache.org/jira/browse/SPARK-44426
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.2, 3.2.0, 3.3.0, 3.4.0
>Reporter: caican
>Priority: Major
>
> For this query,  InSubQuery would be cast to ExistenceJoin and now 
> ExistenceJoin does not support automatic data skew for the left table.
> {code:java}
> SELECT * FROM skewData1
> where
> (key1 in (select key2 from skewData2)
> or value1 in (select value2 from skewData2){code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-44419) Support to extract partial filters of datasource v2 table and push them down

2023-07-13 Thread caican (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44419?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

caican updated SPARK-44419:
---
Description: 
 

Run the following sql, and the date predicate in the where clause is not pushed 
down and it would cause a full table scan.

 
{code:java}
SELECT
id,
data,
date
FROM
testcat.db.table
where
(date = 20221110 and udfStrLen(data) = 8)
or
(date = 2022 and udfStrLen(data) = 8)  {code}
 

 

  was:
Run the following sql, and the date predicate in the where clause is not pushed 
down and it would cause a full table scan.
{code:java}
SELECT
id,
data,
date
FROM
testcat.db.table
where
(date = 20221110 and udfStrLen(data) = 8)
or
(date = 2022 and udfStrLen(data) = 8)  {code}
 

 


> Support to extract partial filters of datasource v2 table and push them down
> 
>
> Key: SPARK-44419
> URL: https://issues.apache.org/jira/browse/SPARK-44419
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.2, 3.2.0, 3.3.0, 3.4.0
>Reporter: caican
>Priority: Major
>
>  
> Run the following sql, and the date predicate in the where clause is not 
> pushed down and it would cause a full table scan.
>  
> {code:java}
> SELECT
> id,
> data,
> date
> FROM
> testcat.db.table
> where
> (date = 20221110 and udfStrLen(data) = 8)
> or
> (date = 2022 and udfStrLen(data) = 8)  {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-44419) Support to extract partial filters of datasource v2 table and push them down

2023-07-13 Thread caican (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44419?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

caican updated SPARK-44419:
---
Description: 
 

Run the following sql, and the date predicate in the where clause is not pushed 
down and it would cause a full table scan.

 
{code:java}
SELECT
id,
data,
date
FROM
testcat.db.table
where
(date = 20221110 and udfStrLen(data) = 8)
or
(date = 2022 and udfStrLen(data) = 8)  {code}

  was:
 

Run the following sql, and the date predicate in the where clause is not pushed 
down and it would cause a full table scan.

 
{code:java}
SELECT
id,
data,
date
FROM
testcat.db.table
where
(date = 20221110 and udfStrLen(data) = 8)
or
(date = 2022 and udfStrLen(data) = 8)  {code}
 

 


> Support to extract partial filters of datasource v2 table and push them down
> 
>
> Key: SPARK-44419
> URL: https://issues.apache.org/jira/browse/SPARK-44419
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.2, 3.2.0, 3.3.0, 3.4.0
>Reporter: caican
>Priority: Major
>
>  
> Run the following sql, and the date predicate in the where clause is not 
> pushed down and it would cause a full table scan.
>  
> {code:java}
> SELECT
> id,
> data,
> date
> FROM
> testcat.db.table
> where
> (date = 20221110 and udfStrLen(data) = 8)
> or
> (date = 2022 and udfStrLen(data) = 8)  {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-44419) Support to extract partial filters of datasource v2 table and push them down

2023-07-13 Thread caican (Jira)

caican created SPARK-44419:
--

 Summary: Support to extract partial filters of datasource v2 table 
and push them down
 Key: SPARK-44419
 URL: https://issues.apache.org/jira/browse/SPARK-44419
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.4.0, 3.3.0, 3.2.0, 3.1.2
Reporter: caican


Run the following sql, and the date predicate in the where clause is not pushed 
down and it would cause a full table scan.
{code:java}
SELECT
id,
data,
date
FROM
testcat.db.table
where
(date = 20221110 and udfStrLen(data) = 8)
or
(date = 2022 and udfStrLen(data) = 8)  {code}
 

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-44414) Fixed matching check for CharType/VarcharType

2023-07-13 Thread caican (Jira)

caican created SPARK-44414:
--

 Summary: Fixed matching check for CharType/VarcharType
 Key: SPARK-44414
 URL: https://issues.apache.org/jira/browse/SPARK-44414
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.4.0, 3.3.0, 3.2.0, 3.1.2
Reporter: caican


Running the following code throws an exception
{code:java}
val analyzer = getAnalyzer
// check varchar type
val json1 = "{\"__CHAR_VARCHAR_TYPE_STRING\":\"varchar(80)\"}"
val metadata1 = new 
MetadataBuilder().withMetadata(Metadata.fromJson(json1)).build()

val query1 = TestRelation(StructType(Seq(
StructField("x", StringType, metadata = metadata1),
StructField("y", StringType, metadata = metadata1))).toAttributes)

val table1 = TestRelation(StructType(Seq(
StructField("x", StringType, metadata = metadata1),
StructField("y", StringType, metadata = metadata1))).toAttributes)

val parsedPlanByName1 = byName(table1, query1)
analyzer.executeAndCheck(parsedPlanByName1, new QueryPlanningTracker()) {code}
 

Exception details are as follows
{code:java}
org.apache.spark.sql.AnalysisException: unresolved operator 'AppendData 
TestRelation [x#8, y#9], true;
'AppendData TestRelation [x#8, y#9], true
+- TestRelation [x#6, y#7]    at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis.failAnalysis(CheckAnalysis.scala:52)
    at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis.failAnalysis$(CheckAnalysis.scala:51)
    at 
org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:156)
    at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis$47(CheckAnalysis.scala:704)
    at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis$47$adapted(CheckAnalysis.scala:702)
    at 
org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:186)
    at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkAnalysis(CheckAnalysis.scala:702)
    at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkAnalysis$(CheckAnalysis.scala:92)
    at 
org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:156)
    at 
org.apache.spark.sql.catalyst.analysis.Analyzer.$anonfun$executeAndCheck$1(Analyzer.scala:177)
    at 
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.markInAnalyzer(AnalysisHelper.scala:228)
    at 
org.apache.spark.sql.catalyst.analysis.Analyzer.executeAndCheck(Analyzer.scala:174)
    at 
org.apache.spark.sql.catalyst.analysis.DataSourceV2AnalysisBaseSuite.$anonfun$new$36(DataSourceV2AnalysisSuite.scala:691)
 {code}
 

 

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-43526) when shuffle hash join is enabled, q95 performance deteriorates

2023-05-18 Thread caican (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-43526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

caican updated SPARK-43526:
---
Affects Version/s: 3.3.2

> when shuffle hash join is enabled, q95 performance deteriorates
> ---
>
> Key: SPARK-43526
> URL: https://issues.apache.org/jira/browse/SPARK-43526
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.2, 3.2.0, 3.3.2
>Reporter: caican
>Priority: Major
> Attachments: image-2023-05-16-21-21-35-493.png, 
> image-2023-05-16-21-22-16-170.png, image-2023-05-16-21-23-35-237.png, 
> image-2023-05-16-21-24-09-182.png, image-2023-05-16-21-28-11-514.png, 
> image-2023-05-16-21-28-44-163.png, image-2023-05-17-16-53-42-302.png, 
> image-2023-05-17-16-54-59-053.png, image-2023-05-19-10-43-51-747.png, 
> shuffle1.png, sort1.png, sort2.png
>
>
> Testing with 5TB dataset, the performance of q95 in tpcds deteriorates when 
> shuffle hash join is enabled and the performance is better when sortMergeJoin 
> is used.
>  
> Performance difference:  from 3.9min(sortMergeJoin) to 
> 8.1min(shuffledHashJoin)
>  
> 1. enable shuffledHashJoin, the execution plan is as follows:
> !image-2023-05-16-21-28-44-163.png|width=935,height=64!
> !image-2023-05-16-21-21-35-493.png|width=924,height=502!
> 2. disable shuffledHashJoin, the execution plan is as follows:
> !image-2023-05-16-21-28-11-514.png|width=922,height=67!
> !image-2023-05-16-21-22-16-170.png|width=934,height=477!
>  
> And when shuffledHashJoin is enabled, gc is very serious,
> !image-2023-05-16-21-23-35-237.png|width=929,height=570!
>  
> but sortMergeJoin executes without this problem.
> !image-2023-05-16-21-24-09-182.png|width=931,height=573!
>  
> Any suggestions on how to solve it？Thanks!
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-43526) when shuffle hash join is enabled, q95 performance deteriorates

2023-05-18 Thread caican (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-43526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17724080#comment-17724080
 ] 

caican edited comment on SPARK-43526 at 5/19/23 2:51 AM:
-

gently ping [~yumwang] 

I find that the shuffle hash join is slower than the sort merge join because 
the sort node is added after two shuffle hash joins, and the number of data 
bars of the two shuffle hash joins expands a lot.

 

I overwrote q95, after closing shuffle hash join and adding sort operation 
after corresponding join nodes, q95 execution also became slow.

 

1. The execution plan before I rewrite q95 sql is as follows:

*Sort merge join*

!sort1.png|width=926,height=473!

*shuffle hash join*

!shuffle1.png|width=921,height=441!

 

2. The execution plan after I rewrite q95 sql is as follows:

*sort merge join*

!sort2.png|width=936,height=496!

 

The sort operation was added after the corresponding join nodes, and the 
execution was slower than shuffle hash join.

And it can be confirmed that the performance deteriorates after the shuffle 
hash join function is enabled because a large amount of data is sorted.

!image-2023-05-19-10-43-51-747.png|width=708,height=38!

 

*q95 sql with sort operation added*

 
{code:java}
 
set 
spark.sql.optimizer.excludedRules="org.apache.spark.sql.catalyst.optimizer.EliminateSorts";

set spark.sql.execution.removeRedundantSorts=false;

WITH
ws_wh AS (
SELECT
ws1.ws_order_number,
ws1.ws_warehouse_sk wh1,
ws2.ws_warehouse_sk wh2
FROM
web_sales ws1,
web_sales ws2
WHERE
ws1.ws_order_number=ws2.ws_order_number
AND ws1.ws_warehouse_sk<>ws2.ws_warehouse_sk
SORT BY
ws1.ws_order_number
),
tmp1 as (
SELECT
ws_order_number
FROM
ws_wh
),
tmp2 as (
SELECT
wr_order_number
FROM
web_returns,
ws_wh
WHERE
wr_order_number=ws_wh.ws_order_number
SORT BY
wr_order_number
)
SELECT
count(DISTINCT ws_order_number) AS `order count `,
sum(ws_ext_ship_cost) AS `total shipping cost `,
sum(ws_net_profit) AS `total net profit `
FROM
web_sales ws1
left semi join tmp1 on ws1.ws_order_number=tmp1.ws_order_number
left semi join tmp2 on ws1.ws_order_number=tmp2.wr_order_number
join date_dim on ws1.ws_ship_date_sk=date_dim.d_date_sk
join customer_address on ws1.ws_ship_addr_sk=customer_address.ca_address_sk
join web_site on ws1.ws_web_site_sk=web_site.web_site_sk
WHERE
d_date BETWEEN '1999-02-01' AND (CAST('1999-02-01' AS DATE)+INTERVAL 60 DAY)
AND ws1.ws_ship_date_sk=d_date_sk
AND ws1.ws_ship_addr_sk=ca_address_sk
AND ca_state='IL'
AND ws1.ws_web_site_sk=web_site_sk
AND web_company_name='pri'
ORDER BY
count(DISTINCT ws_order_number)
LIMIT
100{code}
 


was (Author: JIRAUSER280464):
I find that the shuffle hash join is slower than the sort merge join because 
the sort node is added after two shuffle hash joins, and the number of data 
bars of the two shuffle hash joins expands a lot.

I overwrote q95, after closing shuffle hash join and adding sort operation 
after corresponding join nodes, q95 execution also became slow.

 

1. The execution plan before I rewrite q95 sql is as follows:

*Sort merge join*

!sort1.png|width=926,height=473!

*shuffle hash join*

!shuffle1.png|width=921,height=441!

 

2. The execution plan after I rewrite q95 sql is as follows:

*sort merge join*

!sort2.png|width=936,height=496!

 

The sort operation was added after the corresponding join nodes, and the 
execution was slower than shuffle hash join.

And it can be confirmed that the performance deteriorates after the shuffle 
hash join function is enabled because a large amount of data is sorted.

!image-2023-05-19-10-43-51-747.png|width=708,height=38!

 

*q95 sql with sort operation added*

 
{code:java}
 
set 
spark.sql.optimizer.excludedRules="org.apache.spark.sql.catalyst.optimizer.EliminateSorts";

set spark.sql.execution.removeRedundantSorts=false;

WITH
ws_wh AS (
SELECT
ws1.ws_order_number,
ws1.ws_warehouse_sk wh1,
ws2.ws_warehouse_sk wh2
FROM
web_sales ws1,
web_sales ws2
WHERE
ws1.ws_order_number=ws2.ws_order_number
AND ws1.ws_warehouse_sk<>ws2.ws_warehouse_sk
SORT BY
ws1.ws_order_number
),
tmp1 as (
SELECT
ws_order_number
FROM
ws_wh
),
tmp2 as (
SELECT
wr_order_number
FROM
web_returns,
ws_wh
WHERE
wr_order_number=ws_wh.ws_order_number
SORT BY
wr_order_number
)
SELECT
count(DISTINCT w

[jira] [Comment Edited] (SPARK-43526) when shuffle hash join is enabled, q95 performance deteriorates

2023-05-18 Thread caican (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-43526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17724080#comment-17724080
 ] 

caican edited comment on SPARK-43526 at 5/19/23 2:49 AM:
-

I find that the shuffle hash join is slower than the sort merge join because 
the sort node is added after two shuffle hash joins, and the number of data 
bars of the two shuffle hash joins expands a lot.

I overwrote q95, after closing shuffle hash join and adding sort operation 
after corresponding join nodes, q95 execution also became slow.

 

1. The execution plan before I rewrite q95 sql is as follows:

*Sort merge join*

!sort1.png|width=926,height=473!

*shuffle hash join*

!shuffle1.png|width=921,height=441!

 

2. The execution plan after I rewrite q95 sql is as follows:

*sort merge join*

!sort2.png|width=936,height=496!

 

The sort operation was added after the corresponding join nodes, and the 
execution was slower than shuffle hash join.

And it can be confirmed that the performance deteriorates after the shuffle 
hash join function is enabled because a large amount of data is sorted.

!image-2023-05-19-10-43-51-747.png|width=708,height=38!

 

*q95 sql with sort operation added*

 
{code:java}
 
set 
spark.sql.optimizer.excludedRules="org.apache.spark.sql.catalyst.optimizer.EliminateSorts";

set spark.sql.execution.removeRedundantSorts=false;

WITH
ws_wh AS (
SELECT
ws1.ws_order_number,
ws1.ws_warehouse_sk wh1,
ws2.ws_warehouse_sk wh2
FROM
web_sales ws1,
web_sales ws2
WHERE
ws1.ws_order_number=ws2.ws_order_number
AND ws1.ws_warehouse_sk<>ws2.ws_warehouse_sk
SORT BY
ws1.ws_order_number
),
tmp1 as (
SELECT
ws_order_number
FROM
ws_wh
),
tmp2 as (
SELECT
wr_order_number
FROM
web_returns,
ws_wh
WHERE
wr_order_number=ws_wh.ws_order_number
SORT BY
wr_order_number
)
SELECT
count(DISTINCT ws_order_number) AS `order count `,
sum(ws_ext_ship_cost) AS `total shipping cost `,
sum(ws_net_profit) AS `total net profit `
FROM
web_sales ws1
left semi join tmp1 on ws1.ws_order_number=tmp1.ws_order_number
left semi join tmp2 on ws1.ws_order_number=tmp2.wr_order_number
join date_dim on ws1.ws_ship_date_sk=date_dim.d_date_sk
join customer_address on ws1.ws_ship_addr_sk=customer_address.ca_address_sk
join web_site on ws1.ws_web_site_sk=web_site.web_site_sk
WHERE
d_date BETWEEN '1999-02-01' AND (CAST('1999-02-01' AS DATE)+INTERVAL 60 DAY)
AND ws1.ws_ship_date_sk=d_date_sk
AND ws1.ws_ship_addr_sk=ca_address_sk
AND ca_state='IL'
AND ws1.ws_web_site_sk=web_site_sk
AND web_company_name='pri'
ORDER BY
count(DISTINCT ws_order_number)
LIMIT
100{code}
 


was (Author: JIRAUSER280464):
I find that the shuffle hash join is slower than the sort merge join because 
the sort node is added after two shuffle hash joins, and the number of data 
bars of the two shuffle hash joins expands a lot.

I overwrote q95, after closing shuffle hash join and adding sort operation 
after corresponding join nodes, q95 execution also became slow.

 

1. The execution plan before I rewrite q95 sql is as follows:



*Sort merge join*

!sort1.png|width=926,height=473!

*shuffle hash join*

!shuffle1.png|width=921,height=441!

 

2. The execution plan after I rewrite q95 sql is as follows:

*sort merge join*

!sort2.png|width=936,height=496!

 

The sort operation was added after the corresponding join nodes, and the 
execution was slower than shuffle hash join.

And it can be confirmed that the performance deteriorates after the shuffle 
hash join function is enabled because a large amount of data is sorted.

!image-2023-05-19-10-43-51-747.png|width=708,height=38!

 

*q95 sql with sort operation added*

 
{code:java}
 
set 
spark.sql.optimizer.excludedRules="org.apache.spark.sql.catalyst.optimizer.EliminateSorts";

set spark.sql.execution.removeRedundantSorts=false;

WITH
ws_wh AS (
SELECT
ws1.ws_order_number,
ws1.ws_warehouse_sk wh1,
ws2.ws_warehouse_sk wh2
FROM
web_sales ws1,
web_sales ws2
WHERE
ws1.ws_order_number=ws2.ws_order_number
AND ws1.ws_warehouse_sk<>ws2.ws_warehouse_sk
SORT BY
ws1.ws_order_number
),
tmp1 as (
SELECT
ws_order_number
FROM
ws_wh
),
tmp2 as (
SELECT
wr_order_number
FROM
web_returns,
ws_wh
WHERE
wr_order_number=ws_wh.ws_order_number
SORT BY
wr_order_number
)
SELECT
count(DISTINCT ws_order_number) AS `order

[jira] [Comment Edited] (SPARK-43526) when shuffle hash join is enabled, q95 performance deteriorates

2023-05-18 Thread caican (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-43526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17724080#comment-17724080
 ] 

caican edited comment on SPARK-43526 at 5/19/23 2:48 AM:
-

I find that the shuffle hash join is slower than the sort merge join because 
the sort node is added after two shuffle hash joins, and the number of data 
bars of the two shuffle hash joins expands a lot.

I overwrote q95, after closing shuffle hash join and adding sort operation 
after corresponding join nodes, q95 execution also became slow.

 

1. The execution plan before I rewrite q95 sql is as follows:



*Sort merge join*

!sort1.png|width=926,height=473!

*shuffle hash join*

!shuffle1.png|width=921,height=441!

 

2. The execution plan after I rewrite q95 sql is as follows:

*sort merge join*

!sort2.png|width=936,height=496!

 

The sort operation was added after the corresponding join nodes, and the 
execution was slower than shuffle hash join.

And it can be confirmed that the performance deteriorates after the shuffle 
hash join function is enabled because a large amount of data is sorted.

!image-2023-05-19-10-43-51-747.png|width=708,height=38!

 

*q95 sql with sort operation added*

 
{code:java}
 
set 
spark.sql.optimizer.excludedRules="org.apache.spark.sql.catalyst.optimizer.EliminateSorts";

set spark.sql.execution.removeRedundantSorts=false;

WITH
ws_wh AS (
SELECT
ws1.ws_order_number,
ws1.ws_warehouse_sk wh1,
ws2.ws_warehouse_sk wh2
FROM
web_sales ws1,
web_sales ws2
WHERE
ws1.ws_order_number=ws2.ws_order_number
AND ws1.ws_warehouse_sk<>ws2.ws_warehouse_sk
SORT BY
ws1.ws_order_number
),
tmp1 as (
SELECT
ws_order_number
FROM
ws_wh
),
tmp2 as (
SELECT
wr_order_number
FROM
web_returns,
ws_wh
WHERE
wr_order_number=ws_wh.ws_order_number
SORT BY
wr_order_number
)
SELECT
count(DISTINCT ws_order_number) AS `order count `,
sum(ws_ext_ship_cost) AS `total shipping cost `,
sum(ws_net_profit) AS `total net profit `
FROM
web_sales ws1
left semi join tmp1 on ws1.ws_order_number=tmp1.ws_order_number
left semi join tmp2 on ws1.ws_order_number=tmp2.wr_order_number
join date_dim on ws1.ws_ship_date_sk=date_dim.d_date_sk
join customer_address on ws1.ws_ship_addr_sk=customer_address.ca_address_sk
join web_site on ws1.ws_web_site_sk=web_site.web_site_sk
WHERE
d_date BETWEEN '1999-02-01' AND (CAST('1999-02-01' AS DATE)+INTERVAL 60 DAY)
AND ws1.ws_ship_date_sk=d_date_sk
AND ws1.ws_ship_addr_sk=ca_address_sk
AND ca_state='IL'
AND ws1.ws_web_site_sk=web_site_sk
AND web_company_name='pri'
ORDER BY
count(DISTINCT ws_order_number)
LIMIT
100{code}
 


was (Author: JIRAUSER280464):
I find that the shuffle hash join is slower than the sort merge join because 
the sort node is added after two shuffle hash joins, and the number of data 
bars of the two shuffle hash joins expands a lot.
I overwrote q95, after closing shuffle hash join and adding sort operation 
after corresponding join nodes, q95 execution also became slow.

1. The execution plan before I rewrite q95 sql is as follows:
**

*Sort merge join*

!sort1.png|width=926,height=473!

*shuffle hash join*

!shuffle1.png|width=921,height=441!

 

2. The execution plan after I rewrite q95 sql is as follows:

*sort merge join*

!sort2.png|width=936,height=496!

 

The sort operation was added after the corresponding join nodes, and the 
execution was slower than shuffle hash join.

And it can be confirmed that the performance deteriorates after the shuffle 
hash join function is enabled because a large amount of data is sorted.

!image-2023-05-19-10-43-51-747.png|width=932,height=50!

 

 

 

*q95 sql with sort operation added*

 
{code:java}
 
set 
spark.sql.optimizer.excludedRules="org.apache.spark.sql.catalyst.optimizer.EliminateSorts";

set spark.sql.execution.removeRedundantSorts=false;

WITH
ws_wh AS (
SELECT
ws1.ws_order_number,
ws1.ws_warehouse_sk wh1,
ws2.ws_warehouse_sk wh2
FROM
web_sales ws1,
web_sales ws2
WHERE
ws1.ws_order_number=ws2.ws_order_number
AND ws1.ws_warehouse_sk<>ws2.ws_warehouse_sk
SORT BY
ws1.ws_order_number
),
tmp1 as (
SELECT
ws_order_number
FROM
ws_wh
),
tmp2 as (
SELECT
wr_order_number
FROM
web_returns,
ws_wh
WHERE
wr_order_number=ws_wh.ws_order_number
SORT BY
wr_order_number
)
SELECT
count(DISTINCT ws_order_number) AS `o

[jira] [Comment Edited] (SPARK-43526) when shuffle hash join is enabled, q95 performance deteriorates

2023-05-18 Thread caican (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-43526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17724080#comment-17724080
 ] 

caican edited comment on SPARK-43526 at 5/19/23 2:47 AM:
-

I find that the shuffle hash join is slower than the sort merge join because 
the sort node is added after two shuffle hash joins, and the number of data 
bars of the two shuffle hash joins expands a lot.
I overwrote q95, after closing shuffle hash join and adding sort operation 
after corresponding join nodes, q95 execution also became slow.

1. The execution plan before I rewrite q95 sql is as follows:
**

*Sort merge join*

!sort1.png|width=926,height=473!

*shuffle hash join*

!shuffle1.png|width=921,height=441!

 

2. The execution plan after I rewrite q95 sql is as follows:

*sort merge join*

!sort2.png|width=936,height=496!

 

The sort operation was added after the corresponding join nodes, and the 
execution was slower than shuffle hash join.

And it can be confirmed that the performance deteriorates after the shuffle 
hash join function is enabled because a large amount of data is sorted.

!image-2023-05-19-10-43-51-747.png|width=932,height=50!

 

 

 

*q95 sql with sort operation added*

 
{code:java}
 
set 
spark.sql.optimizer.excludedRules="org.apache.spark.sql.catalyst.optimizer.EliminateSorts";

set spark.sql.execution.removeRedundantSorts=false;

WITH
ws_wh AS (
SELECT
ws1.ws_order_number,
ws1.ws_warehouse_sk wh1,
ws2.ws_warehouse_sk wh2
FROM
web_sales ws1,
web_sales ws2
WHERE
ws1.ws_order_number=ws2.ws_order_number
AND ws1.ws_warehouse_sk<>ws2.ws_warehouse_sk
SORT BY
ws1.ws_order_number
),
tmp1 as (
SELECT
ws_order_number
FROM
ws_wh
),
tmp2 as (
SELECT
wr_order_number
FROM
web_returns,
ws_wh
WHERE
wr_order_number=ws_wh.ws_order_number
SORT BY
wr_order_number
)
SELECT
count(DISTINCT ws_order_number) AS `order count `,
sum(ws_ext_ship_cost) AS `total shipping cost `,
sum(ws_net_profit) AS `total net profit `
FROM
web_sales ws1
left semi join tmp1 on ws1.ws_order_number=tmp1.ws_order_number
left semi join tmp2 on ws1.ws_order_number=tmp2.wr_order_number
join date_dim on ws1.ws_ship_date_sk=date_dim.d_date_sk
join customer_address on ws1.ws_ship_addr_sk=customer_address.ca_address_sk
join web_site on ws1.ws_web_site_sk=web_site.web_site_sk
WHERE
d_date BETWEEN '1999-02-01' AND (CAST('1999-02-01' AS DATE)+INTERVAL 60 DAY)
AND ws1.ws_ship_date_sk=d_date_sk
AND ws1.ws_ship_addr_sk=ca_address_sk
AND ca_state='IL'
AND ws1.ws_web_site_sk=web_site_sk
AND web_company_name='pri'
ORDER BY
count(DISTINCT ws_order_number)
LIMIT
100{code}
 


was (Author: JIRAUSER280464):
I find that the shuffle hash join is slower than the sort merge join because 
the sort node is added after two shuffle hash joins, and the number of data 
bars of the two shuffle hash joins expands a lot.
I overwrote q95, after closing shuffle hash join and adding sort operation 
after corresponding join nodes, q95 execution also became slow.

The execution plan before I rewrite q95 sql is as follows:
```Sort merge join```

!sort1.png|width=926,height=473!

```shuffle hash join```

!shuffle1.png|width=921,height=441!

The execution plan after I rewrite q95 sql is as follows:

!sort2.png|width=936,height=496!

the sort operation was added after the corresponding join nodes, and the 
execution was slower than shuffle hash join.

And it can be confirmed that the performance deteriorates after the shuffle 
hash join function is enabled because a large amount of data is sorted.

!image-2023-05-19-10-43-51-747.png|width=932,height=50!


q95 sql with sort operation added```

set 
spark.sql.optimizer.excludedRules="org.apache.spark.sql.catalyst.optimizer.EliminateSorts";


set spark.sql.execution.removeRedundantSorts=false;


WITH ws_wh AS ( SELECT ws1.ws_order_number, ws1.ws_warehouse_sk wh1, 
ws2.ws_warehouse_sk wh2 FROM web_sales ws1, web_sales ws2 WHERE 
ws1.ws_order_number=ws2.ws_order_number AND 
ws1.ws_warehouse_sk<>ws2.ws_warehouse_sk SORT BY ws1.ws_order_number ), tmp1 as 
( SELECT ws_order_number FROM ws_wh ), tmp2 as ( SELECT wr_order_number FROM 
web_returns, ws_wh WHERE wr_order_number=ws_wh.ws_order_number SORT BY 
wr_order_number ) SELECT count(DISTINCT ws_order_number) AS `order count `, 
sum(ws_ext_ship_cost) AS `total shipping cost `, sum(ws_net_profit) AS `total 
net profit ` FROM web_sales ws1 left semi join tmp1 on 
ws1.ws_order_number=tmp1.ws_order_number left semi join tmp2 on 
ws1.ws_order_number=tmp2.wr_order_number join date_dim on 
ws1.ws_ship_date_sk=date_dim.d_date_sk join customer_

[jira] [Commented] (SPARK-43526) when shuffle hash join is enabled, q95 performance deteriorates

2023-05-18 Thread caican (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-43526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17724080#comment-17724080
 ] 

caican commented on SPARK-43526:


I find that the shuffle hash join is slower than the sort merge join because 
the sort node is added after two shuffle hash joins, and the number of data 
bars of the two shuffle hash joins expands a lot.
I overwrote q95, after closing shuffle hash join and adding sort operation 
after corresponding join nodes, q95 execution also became slow.

The execution plan before I rewrite q95 sql is as follows:
```Sort merge join```

!sort1.png|width=926,height=473!

```shuffle hash join```

!shuffle1.png|width=921,height=441!

The execution plan after I rewrite q95 sql is as follows:

!sort2.png|width=936,height=496!

the sort operation was added after the corresponding join nodes, and the 
execution was slower than shuffle hash join.

And it can be confirmed that the performance deteriorates after the shuffle 
hash join function is enabled because a large amount of data is sorted.

!image-2023-05-19-10-43-51-747.png|width=932,height=50!


q95 sql with sort operation added```

set 
spark.sql.optimizer.excludedRules="org.apache.spark.sql.catalyst.optimizer.EliminateSorts";


set spark.sql.execution.removeRedundantSorts=false;


WITH ws_wh AS ( SELECT ws1.ws_order_number, ws1.ws_warehouse_sk wh1, 
ws2.ws_warehouse_sk wh2 FROM web_sales ws1, web_sales ws2 WHERE 
ws1.ws_order_number=ws2.ws_order_number AND 
ws1.ws_warehouse_sk<>ws2.ws_warehouse_sk SORT BY ws1.ws_order_number ), tmp1 as 
( SELECT ws_order_number FROM ws_wh ), tmp2 as ( SELECT wr_order_number FROM 
web_returns, ws_wh WHERE wr_order_number=ws_wh.ws_order_number SORT BY 
wr_order_number ) SELECT count(DISTINCT ws_order_number) AS `order count `, 
sum(ws_ext_ship_cost) AS `total shipping cost `, sum(ws_net_profit) AS `total 
net profit ` FROM web_sales ws1 left semi join tmp1 on 
ws1.ws_order_number=tmp1.ws_order_number left semi join tmp2 on 
ws1.ws_order_number=tmp2.wr_order_number join date_dim on 
ws1.ws_ship_date_sk=date_dim.d_date_sk join customer_address on 
ws1.ws_ship_addr_sk=customer_address.ca_address_sk join web_site on 
ws1.ws_web_site_sk=web_site.web_site_sk WHERE d_date BETWEEN '1999-02-01' AND 
(CAST('1999-02-01' AS DATE)+INTERVAL 60 DAY) AND ws1.ws_ship_date_sk=d_date_sk 
AND ws1.ws_ship_addr_sk=ca_address_sk AND ca_state='IL' AND 
ws1.ws_web_site_sk=web_site_sk AND web_company_name='pri' ORDER BY 
count(DISTINCT ws_order_number) LIMIT 100

```

> when shuffle hash join is enabled, q95 performance deteriorates
> ---
>
> Key: SPARK-43526
> URL: https://issues.apache.org/jira/browse/SPARK-43526
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.2, 3.2.0
>Reporter: caican
>Priority: Major
> Attachments: image-2023-05-16-21-21-35-493.png, 
> image-2023-05-16-21-22-16-170.png, image-2023-05-16-21-23-35-237.png, 
> image-2023-05-16-21-24-09-182.png, image-2023-05-16-21-28-11-514.png, 
> image-2023-05-16-21-28-44-163.png, image-2023-05-17-16-53-42-302.png, 
> image-2023-05-17-16-54-59-053.png, image-2023-05-19-10-43-51-747.png, 
> shuffle1.png, sort1.png, sort2.png
>
>
> Testing with 5TB dataset, the performance of q95 in tpcds deteriorates when 
> shuffle hash join is enabled and the performance is better when sortMergeJoin 
> is used.
>  
> Performance difference:  from 3.9min(sortMergeJoin) to 
> 8.1min(shuffledHashJoin)
>  
> 1. enable shuffledHashJoin, the execution plan is as follows:
> !image-2023-05-16-21-28-44-163.png|width=935,height=64!
> !image-2023-05-16-21-21-35-493.png|width=924,height=502!
> 2. disable shuffledHashJoin, the execution plan is as follows:
> !image-2023-05-16-21-28-11-514.png|width=922,height=67!
> !image-2023-05-16-21-22-16-170.png|width=934,height=477!
>  
> And when shuffledHashJoin is enabled, gc is very serious,
> !image-2023-05-16-21-23-35-237.png|width=929,height=570!
>  
> but sortMergeJoin executes without this problem.
> !image-2023-05-16-21-24-09-182.png|width=931,height=573!
>  
> Any suggestions on how to solve it？Thanks!
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-43526) when shuffle hash join is enabled, q95 performance deteriorates

2023-05-18 Thread caican (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-43526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

caican updated SPARK-43526:
---
Attachment: sort2.png

> when shuffle hash join is enabled, q95 performance deteriorates
> ---
>
> Key: SPARK-43526
> URL: https://issues.apache.org/jira/browse/SPARK-43526
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.2, 3.2.0
>Reporter: caican
>Priority: Major
> Attachments: image-2023-05-16-21-21-35-493.png, 
> image-2023-05-16-21-22-16-170.png, image-2023-05-16-21-23-35-237.png, 
> image-2023-05-16-21-24-09-182.png, image-2023-05-16-21-28-11-514.png, 
> image-2023-05-16-21-28-44-163.png, image-2023-05-17-16-53-42-302.png, 
> image-2023-05-17-16-54-59-053.png, image-2023-05-19-10-43-51-747.png, 
> shuffle1.png, sort1.png, sort2.png
>
>
> Testing with 5TB dataset, the performance of q95 in tpcds deteriorates when 
> shuffle hash join is enabled and the performance is better when sortMergeJoin 
> is used.
>  
> Performance difference:  from 3.9min(sortMergeJoin) to 
> 8.1min(shuffledHashJoin)
>  
> 1. enable shuffledHashJoin, the execution plan is as follows:
> !image-2023-05-16-21-28-44-163.png|width=935,height=64!
> !image-2023-05-16-21-21-35-493.png|width=924,height=502!
> 2. disable shuffledHashJoin, the execution plan is as follows:
> !image-2023-05-16-21-28-11-514.png|width=922,height=67!
> !image-2023-05-16-21-22-16-170.png|width=934,height=477!
>  
> And when shuffledHashJoin is enabled, gc is very serious,
> !image-2023-05-16-21-23-35-237.png|width=929,height=570!
>  
> but sortMergeJoin executes without this problem.
> !image-2023-05-16-21-24-09-182.png|width=931,height=573!
>  
> Any suggestions on how to solve it？Thanks!
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-43526) when shuffle hash join is enabled, q95 performance deteriorates

2023-05-18 Thread caican (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-43526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

caican updated SPARK-43526:
---
Attachment: image-2023-05-19-10-43-51-747.png

> when shuffle hash join is enabled, q95 performance deteriorates
> ---
>
> Key: SPARK-43526
> URL: https://issues.apache.org/jira/browse/SPARK-43526
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.2, 3.2.0
>Reporter: caican
>Priority: Major
> Attachments: image-2023-05-16-21-21-35-493.png, 
> image-2023-05-16-21-22-16-170.png, image-2023-05-16-21-23-35-237.png, 
> image-2023-05-16-21-24-09-182.png, image-2023-05-16-21-28-11-514.png, 
> image-2023-05-16-21-28-44-163.png, image-2023-05-17-16-53-42-302.png, 
> image-2023-05-17-16-54-59-053.png, image-2023-05-19-10-43-51-747.png, 
> shuffle1.png, sort1.png, sort2.png
>
>
> Testing with 5TB dataset, the performance of q95 in tpcds deteriorates when 
> shuffle hash join is enabled and the performance is better when sortMergeJoin 
> is used.
>  
> Performance difference:  from 3.9min(sortMergeJoin) to 
> 8.1min(shuffledHashJoin)
>  
> 1. enable shuffledHashJoin, the execution plan is as follows:
> !image-2023-05-16-21-28-44-163.png|width=935,height=64!
> !image-2023-05-16-21-21-35-493.png|width=924,height=502!
> 2. disable shuffledHashJoin, the execution plan is as follows:
> !image-2023-05-16-21-28-11-514.png|width=922,height=67!
> !image-2023-05-16-21-22-16-170.png|width=934,height=477!
>  
> And when shuffledHashJoin is enabled, gc is very serious,
> !image-2023-05-16-21-23-35-237.png|width=929,height=570!
>  
> but sortMergeJoin executes without this problem.
> !image-2023-05-16-21-24-09-182.png|width=931,height=573!
>  
> Any suggestions on how to solve it？Thanks!
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-43526) when shuffle hash join is enabled, q95 performance deteriorates

2023-05-18 Thread caican (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-43526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

caican updated SPARK-43526:
---
Attachment: shuffle1.png

> when shuffle hash join is enabled, q95 performance deteriorates
> ---
>
> Key: SPARK-43526
> URL: https://issues.apache.org/jira/browse/SPARK-43526
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.2, 3.2.0
>Reporter: caican
>Priority: Major
> Attachments: image-2023-05-16-21-21-35-493.png, 
> image-2023-05-16-21-22-16-170.png, image-2023-05-16-21-23-35-237.png, 
> image-2023-05-16-21-24-09-182.png, image-2023-05-16-21-28-11-514.png, 
> image-2023-05-16-21-28-44-163.png, image-2023-05-17-16-53-42-302.png, 
> image-2023-05-17-16-54-59-053.png, shuffle1.png, sort1.png
>
>
> Testing with 5TB dataset, the performance of q95 in tpcds deteriorates when 
> shuffle hash join is enabled and the performance is better when sortMergeJoin 
> is used.
>  
> Performance difference:  from 3.9min(sortMergeJoin) to 
> 8.1min(shuffledHashJoin)
>  
> 1. enable shuffledHashJoin, the execution plan is as follows:
> !image-2023-05-16-21-28-44-163.png|width=935,height=64!
> !image-2023-05-16-21-21-35-493.png|width=924,height=502!
> 2. disable shuffledHashJoin, the execution plan is as follows:
> !image-2023-05-16-21-28-11-514.png|width=922,height=67!
> !image-2023-05-16-21-22-16-170.png|width=934,height=477!
>  
> And when shuffledHashJoin is enabled, gc is very serious,
> !image-2023-05-16-21-23-35-237.png|width=929,height=570!
>  
> but sortMergeJoin executes without this problem.
> !image-2023-05-16-21-24-09-182.png|width=931,height=573!
>  
> Any suggestions on how to solve it？Thanks!
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-43526) when shuffle hash join is enabled, q95 performance deteriorates

2023-05-18 Thread caican (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-43526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

caican updated SPARK-43526:
---
Attachment: sort1.png

> when shuffle hash join is enabled, q95 performance deteriorates
> ---
>
> Key: SPARK-43526
> URL: https://issues.apache.org/jira/browse/SPARK-43526
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.2, 3.2.0
>Reporter: caican
>Priority: Major
> Attachments: image-2023-05-16-21-21-35-493.png, 
> image-2023-05-16-21-22-16-170.png, image-2023-05-16-21-23-35-237.png, 
> image-2023-05-16-21-24-09-182.png, image-2023-05-16-21-28-11-514.png, 
> image-2023-05-16-21-28-44-163.png, image-2023-05-17-16-53-42-302.png, 
> image-2023-05-17-16-54-59-053.png, shuffle1.png, sort1.png
>
>
> Testing with 5TB dataset, the performance of q95 in tpcds deteriorates when 
> shuffle hash join is enabled and the performance is better when sortMergeJoin 
> is used.
>  
> Performance difference:  from 3.9min(sortMergeJoin) to 
> 8.1min(shuffledHashJoin)
>  
> 1. enable shuffledHashJoin, the execution plan is as follows:
> !image-2023-05-16-21-28-44-163.png|width=935,height=64!
> !image-2023-05-16-21-21-35-493.png|width=924,height=502!
> 2. disable shuffledHashJoin, the execution plan is as follows:
> !image-2023-05-16-21-28-11-514.png|width=922,height=67!
> !image-2023-05-16-21-22-16-170.png|width=934,height=477!
>  
> And when shuffledHashJoin is enabled, gc is very serious,
> !image-2023-05-16-21-23-35-237.png|width=929,height=570!
>  
> but sortMergeJoin executes without this problem.
> !image-2023-05-16-21-24-09-182.png|width=931,height=573!
>  
> Any suggestions on how to solve it？Thanks!
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-43526) when shuffle hash join is enabled, q95 performance deteriorates

2023-05-17 Thread caican (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-43526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17723388#comment-17723388
 ] 

caican edited comment on SPARK-43526 at 5/17/23 9:03 AM:
-

[~yumwang] 
Tpcds tests show performance gains for most queries and we plan to use 
shuffledHashJoin preferentially to eliminate sort consumption when the small 
table meets a certain threshold, but q95 in tpcds has a serious performance 
regression and we are not sure if it can be turned on by default.

with shuffledHashJoin:

!image-2023-05-17-16-53-42-302.png|width=691,height=344!

sortMergeJoin is preferred:

!image-2023-05-17-16-54-59-053.png|width=722,height=319!


was (Author: JIRAUSER280464):
[~yumwang] 
Tpcds tests show performance gains for most queries and we plan to use 
shuffledHashJoin preferentially to eliminate sort consumption when the small 
table meets a certain threshold, but q95 in tpcds has a serious performance 
regression and we are not sure if it can be turned on by default.

with shuffledHashJoin:

!image-2023-05-17-16-53-42-302.png|width=691,height=344!

without shuffledHashJoin:

!image-2023-05-17-16-54-59-053.png|width=722,height=319!

> when shuffle hash join is enabled, q95 performance deteriorates
> ---
>
> Key: SPARK-43526
> URL: https://issues.apache.org/jira/browse/SPARK-43526
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.2, 3.2.0
>Reporter: caican
>Priority: Major
> Attachments: image-2023-05-16-21-21-35-493.png, 
> image-2023-05-16-21-22-16-170.png, image-2023-05-16-21-23-35-237.png, 
> image-2023-05-16-21-24-09-182.png, image-2023-05-16-21-28-11-514.png, 
> image-2023-05-16-21-28-44-163.png, image-2023-05-17-16-53-42-302.png, 
> image-2023-05-17-16-54-59-053.png
>
>
> Testing with 5TB dataset, the performance of q95 in tpcds deteriorates when 
> shuffle hash join is enabled and the performance is better when sortMergeJoin 
> is used.
>  
> Performance difference:  from 3.9min(sortMergeJoin) to 
> 8.1min(shuffledHashJoin)
>  
> 1. enable shuffledHashJoin, the execution plan is as follows:
> !image-2023-05-16-21-28-44-163.png|width=935,height=64!
> !image-2023-05-16-21-21-35-493.png|width=924,height=502!
> 2. disable shuffledHashJoin, the execution plan is as follows:
> !image-2023-05-16-21-28-11-514.png|width=922,height=67!
> !image-2023-05-16-21-22-16-170.png|width=934,height=477!
>  
> And when shuffledHashJoin is enabled, gc is very serious,
> !image-2023-05-16-21-23-35-237.png|width=929,height=570!
>  
> but sortMergeJoin executes without this problem.
> !image-2023-05-16-21-24-09-182.png|width=931,height=573!
>  
> Any suggestions on how to solve it？Thanks!
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-43526) when shuffle hash join is enabled, q95 performance deteriorates

2023-05-17 Thread caican (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-43526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17723388#comment-17723388
 ] 

caican edited comment on SPARK-43526 at 5/17/23 9:02 AM:
-

[~yumwang] 
Tpcds tests show performance gains for most queries and we plan to use 
shuffledHashJoin preferentially to eliminate sort consumption when the small 
table meets a certain threshold, but q95 in tpcds has a serious performance 
regression and we are not sure if it can be turned on by default.

with shuffledHashJoin:

!image-2023-05-17-16-53-42-302.png|width=691,height=344!

without shuffledHashJoin:

!image-2023-05-17-16-54-59-053.png|width=722,height=319!


was (Author: JIRAUSER280464):
[~yumwang] 
We plan to use shuffledHashJoin preferentially to eliminate sort consumption 
when the small table meets a certain threshold, but q95 in tpcds has a serious 
performance regression and we are not sure if it can be turned on by default.



with shuffledHashJoin:

!image-2023-05-17-16-53-42-302.png|width=691,height=344!

without shuffledHashJoin:

!image-2023-05-17-16-54-59-053.png|width=722,height=319!

> when shuffle hash join is enabled, q95 performance deteriorates
> ---
>
> Key: SPARK-43526
> URL: https://issues.apache.org/jira/browse/SPARK-43526
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.2, 3.2.0
>Reporter: caican
>Priority: Major
> Attachments: image-2023-05-16-21-21-35-493.png, 
> image-2023-05-16-21-22-16-170.png, image-2023-05-16-21-23-35-237.png, 
> image-2023-05-16-21-24-09-182.png, image-2023-05-16-21-28-11-514.png, 
> image-2023-05-16-21-28-44-163.png, image-2023-05-17-16-53-42-302.png, 
> image-2023-05-17-16-54-59-053.png
>
>
> Testing with 5TB dataset, the performance of q95 in tpcds deteriorates when 
> shuffle hash join is enabled and the performance is better when sortMergeJoin 
> is used.
>  
> Performance difference:  from 3.9min(sortMergeJoin) to 
> 8.1min(shuffledHashJoin)
>  
> 1. enable shuffledHashJoin, the execution plan is as follows:
> !image-2023-05-16-21-28-44-163.png|width=935,height=64!
> !image-2023-05-16-21-21-35-493.png|width=924,height=502!
> 2. disable shuffledHashJoin, the execution plan is as follows:
> !image-2023-05-16-21-28-11-514.png|width=922,height=67!
> !image-2023-05-16-21-22-16-170.png|width=934,height=477!
>  
> And when shuffledHashJoin is enabled, gc is very serious,
> !image-2023-05-16-21-23-35-237.png|width=929,height=570!
>  
> but sortMergeJoin executes without this problem.
> !image-2023-05-16-21-24-09-182.png|width=931,height=573!
>  
> Any suggestions on how to solve it？Thanks!
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-43526) when shuffle hash join is enabled, q95 performance deteriorates

2023-05-17 Thread caican (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-43526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17723388#comment-17723388
 ] 

caican commented on SPARK-43526:


[~yumwang] 
We plan to use shuffledHashJoin preferentially to eliminate sort consumption 
when the small table meets a certain threshold, but q95 in tpcds has a serious 
performance regression and we are not sure if it can be turned on by default.



with shuffledHashJoin:

!image-2023-05-17-16-53-42-302.png|width=691,height=344!

without shuffledHashJoin:

!image-2023-05-17-16-54-59-053.png|width=722,height=319!

> when shuffle hash join is enabled, q95 performance deteriorates
> ---
>
> Key: SPARK-43526
> URL: https://issues.apache.org/jira/browse/SPARK-43526
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.2, 3.2.0
>Reporter: caican
>Priority: Major
> Attachments: image-2023-05-16-21-21-35-493.png, 
> image-2023-05-16-21-22-16-170.png, image-2023-05-16-21-23-35-237.png, 
> image-2023-05-16-21-24-09-182.png, image-2023-05-16-21-28-11-514.png, 
> image-2023-05-16-21-28-44-163.png, image-2023-05-17-16-53-42-302.png, 
> image-2023-05-17-16-54-59-053.png
>
>
> Testing with 5TB dataset, the performance of q95 in tpcds deteriorates when 
> shuffle hash join is enabled and the performance is better when sortMergeJoin 
> is used.
>  
> Performance difference:  from 3.9min(sortMergeJoin) to 
> 8.1min(shuffledHashJoin)
>  
> 1. enable shuffledHashJoin, the execution plan is as follows:
> !image-2023-05-16-21-28-44-163.png|width=935,height=64!
> !image-2023-05-16-21-21-35-493.png|width=924,height=502!
> 2. disable shuffledHashJoin, the execution plan is as follows:
> !image-2023-05-16-21-28-11-514.png|width=922,height=67!
> !image-2023-05-16-21-22-16-170.png|width=934,height=477!
>  
> And when shuffledHashJoin is enabled, gc is very serious,
> !image-2023-05-16-21-23-35-237.png|width=929,height=570!
>  
> but sortMergeJoin executes without this problem.
> !image-2023-05-16-21-24-09-182.png|width=931,height=573!
>  
> Any suggestions on how to solve it？Thanks!
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-43526) when shuffle hash join is enabled, q95 performance deteriorates

2023-05-17 Thread caican (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-43526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

caican updated SPARK-43526:
---
Attachment: image-2023-05-17-16-54-59-053.png

> when shuffle hash join is enabled, q95 performance deteriorates
> ---
>
> Key: SPARK-43526
> URL: https://issues.apache.org/jira/browse/SPARK-43526
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.2, 3.2.0
>Reporter: caican
>Priority: Major
> Attachments: image-2023-05-16-21-21-35-493.png, 
> image-2023-05-16-21-22-16-170.png, image-2023-05-16-21-23-35-237.png, 
> image-2023-05-16-21-24-09-182.png, image-2023-05-16-21-28-11-514.png, 
> image-2023-05-16-21-28-44-163.png, image-2023-05-17-16-53-42-302.png, 
> image-2023-05-17-16-54-59-053.png
>
>
> Testing with 5TB dataset, the performance of q95 in tpcds deteriorates when 
> shuffle hash join is enabled and the performance is better when sortMergeJoin 
> is used.
>  
> Performance difference:  from 3.9min(sortMergeJoin) to 
> 8.1min(shuffledHashJoin)
>  
> 1. enable shuffledHashJoin, the execution plan is as follows:
> !image-2023-05-16-21-28-44-163.png|width=935,height=64!
> !image-2023-05-16-21-21-35-493.png|width=924,height=502!
> 2. disable shuffledHashJoin, the execution plan is as follows:
> !image-2023-05-16-21-28-11-514.png|width=922,height=67!
> !image-2023-05-16-21-22-16-170.png|width=934,height=477!
>  
> And when shuffledHashJoin is enabled, gc is very serious,
> !image-2023-05-16-21-23-35-237.png|width=929,height=570!
>  
> but sortMergeJoin executes without this problem.
> !image-2023-05-16-21-24-09-182.png|width=931,height=573!
>  
> Any suggestions on how to solve it？Thanks!
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-43526) when shuffle hash join is enabled, q95 performance deteriorates

2023-05-17 Thread caican (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-43526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

caican updated SPARK-43526:
---
Attachment: image-2023-05-17-16-53-42-302.png

> when shuffle hash join is enabled, q95 performance deteriorates
> ---
>
> Key: SPARK-43526
> URL: https://issues.apache.org/jira/browse/SPARK-43526
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.2, 3.2.0
>Reporter: caican
>Priority: Major
> Attachments: image-2023-05-16-21-21-35-493.png, 
> image-2023-05-16-21-22-16-170.png, image-2023-05-16-21-23-35-237.png, 
> image-2023-05-16-21-24-09-182.png, image-2023-05-16-21-28-11-514.png, 
> image-2023-05-16-21-28-44-163.png, image-2023-05-17-16-53-42-302.png
>
>
> Testing with 5TB dataset, the performance of q95 in tpcds deteriorates when 
> shuffle hash join is enabled and the performance is better when sortMergeJoin 
> is used.
>  
> Performance difference:  from 3.9min(sortMergeJoin) to 
> 8.1min(shuffledHashJoin)
>  
> 1. enable shuffledHashJoin, the execution plan is as follows:
> !image-2023-05-16-21-28-44-163.png|width=935,height=64!
> !image-2023-05-16-21-21-35-493.png|width=924,height=502!
> 2. disable shuffledHashJoin, the execution plan is as follows:
> !image-2023-05-16-21-28-11-514.png|width=922,height=67!
> !image-2023-05-16-21-22-16-170.png|width=934,height=477!
>  
> And when shuffledHashJoin is enabled, gc is very serious,
> !image-2023-05-16-21-23-35-237.png|width=929,height=570!
>  
> but sortMergeJoin executes without this problem.
> !image-2023-05-16-21-24-09-182.png|width=931,height=573!
>  
> Any suggestions on how to solve it？Thanks!
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-43526) when shuffle hash join is enabled, q95 performance deteriorates

2023-05-16 Thread caican (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-43526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

caican updated SPARK-43526:
---
Attachment: (was: image-2023-05-16-21-23-33-611.png)

> when shuffle hash join is enabled, q95 performance deteriorates
> ---
>
> Key: SPARK-43526
> URL: https://issues.apache.org/jira/browse/SPARK-43526
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.2, 3.2.0
>Reporter: caican
>Priority: Major
> Attachments: image-2023-05-16-21-21-35-493.png, 
> image-2023-05-16-21-22-16-170.png, image-2023-05-16-21-23-35-237.png, 
> image-2023-05-16-21-24-09-182.png, image-2023-05-16-21-28-11-514.png, 
> image-2023-05-16-21-28-44-163.png
>
>
> Testing with 5TB dataset, the performance of q95 in tpcds deteriorates when 
> shuffle hash join is enabled and the performance is better when sortMergeJoin 
> is used.
>  
> Performance difference:  from 3.9min(sortMergeJoin) to 
> 8.1min(shuffledHashJoin)
>  
> 1. enable shuffledHashJoin, the execution plan is as follows:
> !image-2023-05-16-21-28-44-163.png|width=935,height=64!
> !image-2023-05-16-21-21-35-493.png|width=924,height=502!
> 2. disable shuffledHashJoin, the execution plan is as follows:
> !image-2023-05-16-21-28-11-514.png|width=922,height=67!
> !image-2023-05-16-21-22-16-170.png|width=934,height=477!
>  
> And when shuffledHashJoin is enabled, gc is very serious,
> !image-2023-05-16-21-23-35-237.png|width=929,height=570!
>  
> but sortMergeJoin executes without this problem.
> !image-2023-05-16-21-24-09-182.png|width=931,height=573!
>  
> Any suggestions on how to solve it？Thanks!
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-43526) when shuffle hash join is enabled, q95 performance deteriorates

2023-05-16 Thread caican (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-43526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

caican updated SPARK-43526:
---
Attachment: (was: image-2023-05-16-21-22-44-532.png)

> when shuffle hash join is enabled, q95 performance deteriorates
> ---
>
> Key: SPARK-43526
> URL: https://issues.apache.org/jira/browse/SPARK-43526
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.2, 3.2.0
>Reporter: caican
>Priority: Major
> Attachments: image-2023-05-16-21-21-35-493.png, 
> image-2023-05-16-21-22-16-170.png, image-2023-05-16-21-23-35-237.png, 
> image-2023-05-16-21-24-09-182.png, image-2023-05-16-21-28-11-514.png, 
> image-2023-05-16-21-28-44-163.png
>
>
> Testing with 5TB dataset, the performance of q95 in tpcds deteriorates when 
> shuffle hash join is enabled and the performance is better when sortMergeJoin 
> is used.
>  
> Performance difference:  from 3.9min(sortMergeJoin) to 
> 8.1min(shuffledHashJoin)
>  
> 1. enable shuffledHashJoin, the execution plan is as follows:
> !image-2023-05-16-21-28-44-163.png|width=935,height=64!
> !image-2023-05-16-21-21-35-493.png|width=924,height=502!
> 2. disable shuffledHashJoin, the execution plan is as follows:
> !image-2023-05-16-21-28-11-514.png|width=922,height=67!
> !image-2023-05-16-21-22-16-170.png|width=934,height=477!
>  
> And when shuffledHashJoin is enabled, gc is very serious,
> !image-2023-05-16-21-23-35-237.png|width=929,height=570!
>  
> but sortMergeJoin executes without this problem.
> !image-2023-05-16-21-24-09-182.png|width=931,height=573!
>  
> Any suggestions on how to solve it？Thanks!
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-43526) when shuffle hash join is enabled, q95 performance deteriorates

2023-05-16 Thread caican (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-43526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

caican updated SPARK-43526:
---
Attachment: (was: image-2023-05-16-21-20-18-727.png)

> when shuffle hash join is enabled, q95 performance deteriorates
> ---
>
> Key: SPARK-43526
> URL: https://issues.apache.org/jira/browse/SPARK-43526
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.2, 3.2.0
>Reporter: caican
>Priority: Major
> Attachments: image-2023-05-16-21-21-35-493.png, 
> image-2023-05-16-21-22-16-170.png, image-2023-05-16-21-22-44-532.png, 
> image-2023-05-16-21-23-33-611.png, image-2023-05-16-21-23-35-237.png, 
> image-2023-05-16-21-24-09-182.png, image-2023-05-16-21-28-11-514.png, 
> image-2023-05-16-21-28-44-163.png
>
>
> Testing with 5TB dataset, the performance of q95 in tpcds deteriorates when 
> shuffle hash join is enabled and the performance is better when sortMergeJoin 
> is used.
>  
> Performance difference:  from 3.9min(sortMergeJoin) to 
> 8.1min(shuffledHashJoin)
>  
> 1. enable shuffledHashJoin, the execution plan is as follows:
> !image-2023-05-16-21-28-44-163.png|width=935,height=64!
> !image-2023-05-16-21-21-35-493.png|width=924,height=502!
> 2. disable shuffledHashJoin, the execution plan is as follows:
> !image-2023-05-16-21-28-11-514.png|width=922,height=67!
> !image-2023-05-16-21-22-16-170.png|width=934,height=477!
>  
> And when shuffledHashJoin is enabled, gc is very serious,
> !image-2023-05-16-21-23-35-237.png|width=929,height=570!
>  
> but sortMergeJoin executes without this problem.
> !image-2023-05-16-21-24-09-182.png|width=931,height=573!
>  
> Any suggestions on how to solve it？Thanks!
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-43526) when shuffle hash join is enabled, q95 performance deteriorates

2023-05-16 Thread caican (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-43526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

caican updated SPARK-43526:
---
Attachment: (was: application_1684208757063_0028_90.html)

> when shuffle hash join is enabled, q95 performance deteriorates
> ---
>
> Key: SPARK-43526
> URL: https://issues.apache.org/jira/browse/SPARK-43526
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.2, 3.2.0
>Reporter: caican
>Priority: Major
> Attachments: image-2023-05-16-21-21-35-493.png, 
> image-2023-05-16-21-22-16-170.png, image-2023-05-16-21-22-44-532.png, 
> image-2023-05-16-21-23-33-611.png, image-2023-05-16-21-23-35-237.png, 
> image-2023-05-16-21-24-09-182.png, image-2023-05-16-21-28-11-514.png, 
> image-2023-05-16-21-28-44-163.png
>
>
> Testing with 5TB dataset, the performance of q95 in tpcds deteriorates when 
> shuffle hash join is enabled and the performance is better when sortMergeJoin 
> is used.
>  
> Performance difference:  from 3.9min(sortMergeJoin) to 
> 8.1min(shuffledHashJoin)
>  
> 1. enable shuffledHashJoin, the execution plan is as follows:
> !image-2023-05-16-21-28-44-163.png|width=935,height=64!
> !image-2023-05-16-21-21-35-493.png|width=924,height=502!
> 2. disable shuffledHashJoin, the execution plan is as follows:
> !image-2023-05-16-21-28-11-514.png|width=922,height=67!
> !image-2023-05-16-21-22-16-170.png|width=934,height=477!
>  
> And when shuffledHashJoin is enabled, gc is very serious,
> !image-2023-05-16-21-23-35-237.png|width=929,height=570!
>  
> but sortMergeJoin executes without this problem.
> !image-2023-05-16-21-24-09-182.png|width=931,height=573!
>  
> Any suggestions on how to solve it？Thanks!
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-43526) when shuffle hash join is enabled, q95 performance deteriorates

2023-05-16 Thread caican (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-43526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

caican updated SPARK-43526:
---
Attachment: application_1684208757063_0028_90.html

> when shuffle hash join is enabled, q95 performance deteriorates
> ---
>
> Key: SPARK-43526
> URL: https://issues.apache.org/jira/browse/SPARK-43526
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.2, 3.2.0
>Reporter: caican
>Priority: Major
> Attachments: application_1684208757063_0028_90.html, 
> image-2023-05-16-21-20-18-727.png, image-2023-05-16-21-21-35-493.png, 
> image-2023-05-16-21-22-16-170.png, image-2023-05-16-21-22-44-532.png, 
> image-2023-05-16-21-23-33-611.png, image-2023-05-16-21-23-35-237.png, 
> image-2023-05-16-21-24-09-182.png, image-2023-05-16-21-28-11-514.png, 
> image-2023-05-16-21-28-44-163.png
>
>
> Testing with 5TB dataset, the performance of q95 in tpcds deteriorates when 
> shuffle hash join is enabled and the performance is better when sortMergeJoin 
> is used.
>  
> Performance difference:  from 3.9min(sortMergeJoin) to 
> 8.1min(shuffledHashJoin)
>  
> 1. enable shuffledHashJoin, the execution plan is as follows:
> !image-2023-05-16-21-28-44-163.png|width=935,height=64!
> !image-2023-05-16-21-21-35-493.png|width=924,height=502!
> 2. disable shuffledHashJoin, the execution plan is as follows:
> !image-2023-05-16-21-28-11-514.png|width=922,height=67!
> !image-2023-05-16-21-22-16-170.png|width=934,height=477!
>  
> And when shuffledHashJoin is enabled, gc is very serious,
> !image-2023-05-16-21-23-35-237.png|width=929,height=570!
>  
> but sortMergeJoin executes without this problem.
> !image-2023-05-16-21-24-09-182.png|width=931,height=573!
>  
> Any suggestions on how to solve it？Thanks!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-43526) when shuffle hash join is enabled, q95 performance deteriorates

2023-05-16 Thread caican (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-43526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

caican updated SPARK-43526:
---
Description: 
Testing with 5TB dataset, the performance of q95 in tpcds deteriorates when 
shuffle hash join is enabled and the performance is better when sortMergeJoin 
is used.

 
Performance difference:  from 3.9min(sortMergeJoin) to 8.1min(shuffledHashJoin)
 

1. enable shuffledHashJoin, the execution plan is as follows:

!image-2023-05-16-21-28-44-163.png|width=935,height=64!

!image-2023-05-16-21-21-35-493.png|width=924,height=502!

2. disable shuffledHashJoin, the execution plan is as follows:

!image-2023-05-16-21-28-11-514.png|width=922,height=67!

!image-2023-05-16-21-22-16-170.png|width=934,height=477!

 

And when shuffledHashJoin is enabled, gc is very serious,

!image-2023-05-16-21-23-35-237.png|width=929,height=570!

 

but sortMergeJoin executes without this problem.

!image-2023-05-16-21-24-09-182.png|width=931,height=573!

 

Any suggestions on how to solve it？Thanks!

 

 

  was:
Testing with 5TB dataset, the performance of q95 in tpcds deteriorates when 
shuffle hash join is enabled and the performance is better when sortMergeJoin 
is used.

 
Performance difference:  from 3.9min(sortMergeJoin) to 8.1min(shuffledHashJoin)
 

1. enable shuffledHashJoin, the execution plan is as follows:

!image-2023-05-16-21-28-44-163.png|width=935,height=64!

!image-2023-05-16-21-21-35-493.png|width=924,height=502!

2. disable shuffledHashJoin, the execution plan is as follows:

!image-2023-05-16-21-28-11-514.png|width=922,height=67!

!image-2023-05-16-21-22-16-170.png|width=934,height=477!

 

And when shuffledHashJoin is enabled, gc is very serious,

!image-2023-05-16-21-23-35-237.png|width=929,height=570!

 

but sortMergeJoin executes without this problem.

!image-2023-05-16-21-24-09-182.png|width=931,height=573!

 

Any suggestions on how to solve it？Thanks!


> when shuffle hash join is enabled, q95 performance deteriorates
> ---
>
> Key: SPARK-43526
> URL: https://issues.apache.org/jira/browse/SPARK-43526
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.2, 3.2.0
>Reporter: caican
>Priority: Major
> Attachments: application_1684208757063_0028_90.html, 
> image-2023-05-16-21-20-18-727.png, image-2023-05-16-21-21-35-493.png, 
> image-2023-05-16-21-22-16-170.png, image-2023-05-16-21-22-44-532.png, 
> image-2023-05-16-21-23-33-611.png, image-2023-05-16-21-23-35-237.png, 
> image-2023-05-16-21-24-09-182.png, image-2023-05-16-21-28-11-514.png, 
> image-2023-05-16-21-28-44-163.png
>
>
> Testing with 5TB dataset, the performance of q95 in tpcds deteriorates when 
> shuffle hash join is enabled and the performance is better when sortMergeJoin 
> is used.
>  
> Performance difference:  from 3.9min(sortMergeJoin) to 
> 8.1min(shuffledHashJoin)
>  
> 1. enable shuffledHashJoin, the execution plan is as follows:
> !image-2023-05-16-21-28-44-163.png|width=935,height=64!
> !image-2023-05-16-21-21-35-493.png|width=924,height=502!
> 2. disable shuffledHashJoin, the execution plan is as follows:
> !image-2023-05-16-21-28-11-514.png|width=922,height=67!
> !image-2023-05-16-21-22-16-170.png|width=934,height=477!
>  
> And when shuffledHashJoin is enabled, gc is very serious,
> !image-2023-05-16-21-23-35-237.png|width=929,height=570!
>  
> but sortMergeJoin executes without this problem.
> !image-2023-05-16-21-24-09-182.png|width=931,height=573!
>  
> Any suggestions on how to solve it？Thanks!
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-43526) when shuffle hash join is enabled, q95 performance deteriorates

2023-05-16 Thread caican (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-43526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

caican updated SPARK-43526:
---
Description: 
Testing with 5TB dataset, the performance of q95 in tpcds deteriorates when 
shuffle hash join is enabled and the performance is better when sortMergeJoin 
is used.

 
Performance difference:  from 3.9min(sortMergeJoin) to 8.1min(shuffledHashJoin)
 

1. enable shuffledHashJoin, the execution plan is as follows:

!image-2023-05-16-21-28-44-163.png|width=935,height=64!

!image-2023-05-16-21-21-35-493.png|width=924,height=502!

2. disable shuffledHashJoin, the execution plan is as follows:

!image-2023-05-16-21-28-11-514.png|width=922,height=67!

!image-2023-05-16-21-22-16-170.png|width=934,height=477!

 

And when shuffledHashJoin is enabled, gc is very serious,

!image-2023-05-16-21-23-35-237.png|width=929,height=570!

 

but sortMergeJoin executes without this problem.

!image-2023-05-16-21-24-09-182.png|width=931,height=573!

 

Any suggestions on how to solve it？Thanks!

  was:
Testing with 5TB dataset, the performance of q95 in tpcds deteriorates when 
shuffle hash join is enabled and the performance is better when sortMergeJoin 
is used.

 
Performance difference:  from 3.9min(sortMergeJoin) to 8.1min(shuffledHashJoin)
 

1. enable shuffledHashJoin, the execution plan is as follows:

!image-2023-05-16-21-20-18-727.png|width=990,height=68!

!image-2023-05-16-21-21-35-493.png|width=924,height=502!

2. disable shuffledHashJoin, the execution plan is as follows:

!image-2023-05-16-21-22-44-532.png|width=1114,height=73!

!image-2023-05-16-21-22-16-170.png|width=934,height=477!

 

And when shuffledHashJoin is enabled, gc is very serious,

!image-2023-05-16-21-23-35-237.png|width=929,height=570!

 

but sortMergeJoin executes without this problem.

!image-2023-05-16-21-24-09-182.png|width=931,height=573!

 

Any suggestions on how to solve it？Thanks!


> when shuffle hash join is enabled, q95 performance deteriorates
> ---
>
> Key: SPARK-43526
> URL: https://issues.apache.org/jira/browse/SPARK-43526
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.2, 3.2.0
>Reporter: caican
>Priority: Major
> Attachments: image-2023-05-16-21-20-18-727.png, 
> image-2023-05-16-21-21-35-493.png, image-2023-05-16-21-22-16-170.png, 
> image-2023-05-16-21-22-44-532.png, image-2023-05-16-21-23-33-611.png, 
> image-2023-05-16-21-23-35-237.png, image-2023-05-16-21-24-09-182.png, 
> image-2023-05-16-21-28-11-514.png, image-2023-05-16-21-28-44-163.png
>
>
> Testing with 5TB dataset, the performance of q95 in tpcds deteriorates when 
> shuffle hash join is enabled and the performance is better when sortMergeJoin 
> is used.
>  
> Performance difference:  from 3.9min(sortMergeJoin) to 
> 8.1min(shuffledHashJoin)
>  
> 1. enable shuffledHashJoin, the execution plan is as follows:
> !image-2023-05-16-21-28-44-163.png|width=935,height=64!
> !image-2023-05-16-21-21-35-493.png|width=924,height=502!
> 2. disable shuffledHashJoin, the execution plan is as follows:
> !image-2023-05-16-21-28-11-514.png|width=922,height=67!
> !image-2023-05-16-21-22-16-170.png|width=934,height=477!
>  
> And when shuffledHashJoin is enabled, gc is very serious,
> !image-2023-05-16-21-23-35-237.png|width=929,height=570!
>  
> but sortMergeJoin executes without this problem.
> !image-2023-05-16-21-24-09-182.png|width=931,height=573!
>  
> Any suggestions on how to solve it？Thanks!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-43526) when shuffle hash join is enabled, q95 performance deteriorates

2023-05-16 Thread caican (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-43526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

caican updated SPARK-43526:
---
Attachment: image-2023-05-16-21-28-11-514.png

> when shuffle hash join is enabled, q95 performance deteriorates
> ---
>
> Key: SPARK-43526
> URL: https://issues.apache.org/jira/browse/SPARK-43526
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.2, 3.2.0
>Reporter: caican
>Priority: Major
> Attachments: image-2023-05-16-21-20-18-727.png, 
> image-2023-05-16-21-21-35-493.png, image-2023-05-16-21-22-16-170.png, 
> image-2023-05-16-21-22-44-532.png, image-2023-05-16-21-23-33-611.png, 
> image-2023-05-16-21-23-35-237.png, image-2023-05-16-21-24-09-182.png, 
> image-2023-05-16-21-28-11-514.png, image-2023-05-16-21-28-44-163.png
>
>
> Testing with 5TB dataset, the performance of q95 in tpcds deteriorates when 
> shuffle hash join is enabled and the performance is better when sortMergeJoin 
> is used.
>  
> Performance difference:  from 3.9min(sortMergeJoin) to 
> 8.1min(shuffledHashJoin)
>  
> 1. enable shuffledHashJoin, the execution plan is as follows:
> !image-2023-05-16-21-20-18-727.png|width=990,height=68!
> !image-2023-05-16-21-21-35-493.png|width=924,height=502!
> 2. disable shuffledHashJoin, the execution plan is as follows:
> !image-2023-05-16-21-22-44-532.png|width=1114,height=73!
> !image-2023-05-16-21-22-16-170.png|width=934,height=477!
>  
> And when shuffledHashJoin is enabled, gc is very serious,
> !image-2023-05-16-21-23-35-237.png|width=929,height=570!
>  
> but sortMergeJoin executes without this problem.
> !image-2023-05-16-21-24-09-182.png|width=931,height=573!
>  
> Any suggestions on how to solve it？Thanks!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-43526) when shuffle hash join is enabled, q95 performance deteriorates

2023-05-16 Thread caican (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-43526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

caican updated SPARK-43526:
---
Attachment: image-2023-05-16-21-28-44-163.png

> when shuffle hash join is enabled, q95 performance deteriorates
> ---
>
> Key: SPARK-43526
> URL: https://issues.apache.org/jira/browse/SPARK-43526
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.2, 3.2.0
>Reporter: caican
>Priority: Major
> Attachments: image-2023-05-16-21-20-18-727.png, 
> image-2023-05-16-21-21-35-493.png, image-2023-05-16-21-22-16-170.png, 
> image-2023-05-16-21-22-44-532.png, image-2023-05-16-21-23-33-611.png, 
> image-2023-05-16-21-23-35-237.png, image-2023-05-16-21-24-09-182.png, 
> image-2023-05-16-21-28-11-514.png, image-2023-05-16-21-28-44-163.png
>
>
> Testing with 5TB dataset, the performance of q95 in tpcds deteriorates when 
> shuffle hash join is enabled and the performance is better when sortMergeJoin 
> is used.
>  
> Performance difference:  from 3.9min(sortMergeJoin) to 
> 8.1min(shuffledHashJoin)
>  
> 1. enable shuffledHashJoin, the execution plan is as follows:
> !image-2023-05-16-21-20-18-727.png|width=990,height=68!
> !image-2023-05-16-21-21-35-493.png|width=924,height=502!
> 2. disable shuffledHashJoin, the execution plan is as follows:
> !image-2023-05-16-21-22-44-532.png|width=1114,height=73!
> !image-2023-05-16-21-22-16-170.png|width=934,height=477!
>  
> And when shuffledHashJoin is enabled, gc is very serious,
> !image-2023-05-16-21-23-35-237.png|width=929,height=570!
>  
> but sortMergeJoin executes without this problem.
> !image-2023-05-16-21-24-09-182.png|width=931,height=573!
>  
> Any suggestions on how to solve it？Thanks!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-43526) when shuffle hash join is enabled, q95 performance deteriorates

2023-05-16 Thread caican (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-43526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

caican updated SPARK-43526:
---
Description: 
Testing with 5TB dataset, the performance of q95 in tpcds deteriorates when 
shuffle hash join is enabled and the performance is better when sortMergeJoin 
is used.

 
Performance difference:  from 3.9min(sortMergeJoin) to 8.1min(shuffledHashJoin)
 

1. enable shuffledHashJoin, the execution plan is as follows:

!image-2023-05-16-21-20-18-727.png|width=990,height=68!

!image-2023-05-16-21-21-35-493.png|width=924,height=502!

2. disable shuffledHashJoin, the execution plan is as follows:

!image-2023-05-16-21-22-44-532.png|width=1114,height=73!

!image-2023-05-16-21-22-16-170.png|width=934,height=477!

 

And when shuffledHashJoin is enabled, gc is very serious,

!image-2023-05-16-21-23-35-237.png|width=929,height=570!

 

but sortMergeJoin executes without this problem.

!image-2023-05-16-21-24-09-182.png|width=931,height=573!

 

Any suggestions on how to solve it？Thanks!

  was:
Testing with 5TB dataset, the performance of q95 in tpcds deteriorates when 
shuffle hash join is enabled and the performance is better when sortMergeJoin 
is used.

 
Performance difference:  from 3.9min(sortMergeJoin) to 8.1min(shuffledHashJoin)
 

1. enable shuffledHashJoin, the execution plan is as follows:

!image-2023-05-16-21-20-18-727.png|width=990,height=68!

!image-2023-05-16-21-21-35-493.png|width=924,height=502!

2. disable shuffledHashJoin, the execution plan is as follows:

!image-2023-05-16-21-22-44-532.png|width=1114,height=73!

!image-2023-05-16-21-22-16-170.png|width=934,height=477!

 

And when shuffledHashJoin is enabled, gc is very serious,

!image-2023-05-16-21-23-35-237.png|width=929,height=570!

but sortMergeJoin executes without this problem.

!image-2023-05-16-21-24-09-182.png|width=931,height=573!

 

Any suggestions on how to solve it？Thanks!


> when shuffle hash join is enabled, q95 performance deteriorates
> ---
>
> Key: SPARK-43526
> URL: https://issues.apache.org/jira/browse/SPARK-43526
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.2, 3.2.0
>Reporter: caican
>Priority: Major
> Attachments: image-2023-05-16-21-20-18-727.png, 
> image-2023-05-16-21-21-35-493.png, image-2023-05-16-21-22-16-170.png, 
> image-2023-05-16-21-22-44-532.png, image-2023-05-16-21-23-33-611.png, 
> image-2023-05-16-21-23-35-237.png, image-2023-05-16-21-24-09-182.png
>
>
> Testing with 5TB dataset, the performance of q95 in tpcds deteriorates when 
> shuffle hash join is enabled and the performance is better when sortMergeJoin 
> is used.
>  
> Performance difference:  from 3.9min(sortMergeJoin) to 
> 8.1min(shuffledHashJoin)
>  
> 1. enable shuffledHashJoin, the execution plan is as follows:
> !image-2023-05-16-21-20-18-727.png|width=990,height=68!
> !image-2023-05-16-21-21-35-493.png|width=924,height=502!
> 2. disable shuffledHashJoin, the execution plan is as follows:
> !image-2023-05-16-21-22-44-532.png|width=1114,height=73!
> !image-2023-05-16-21-22-16-170.png|width=934,height=477!
>  
> And when shuffledHashJoin is enabled, gc is very serious,
> !image-2023-05-16-21-23-35-237.png|width=929,height=570!
>  
> but sortMergeJoin executes without this problem.
> !image-2023-05-16-21-24-09-182.png|width=931,height=573!
>  
> Any suggestions on how to solve it？Thanks!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-43526) when shuffle hash join is enabled, q95 performance deteriorates

2023-05-16 Thread caican (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-43526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

caican updated SPARK-43526:
---
Description: 
Testing with 5TB dataset, the performance of q95 in tpcds deteriorates when 
shuffle hash join is enabled and the performance is better when sortMergeJoin 
is used.

 
Performance difference:  from 3.9min(sortMergeJoin) to 8.1min(shuffledHashJoin)
 

1. enable shuffledHashJoin, the execution plan is as follows:

!image-2023-05-16-21-20-18-727.png|width=990,height=68!

!image-2023-05-16-21-21-35-493.png|width=924,height=502!

2. disable shuffledHashJoin, the execution plan is as follows:

!image-2023-05-16-21-22-44-532.png|width=1114,height=73!

!image-2023-05-16-21-22-16-170.png|width=934,height=477!

 

And when shuffledHashJoin is enabled, gc is very serious,

!image-2023-05-16-21-23-35-237.png|width=929,height=570!

but sortMergeJoin executes without this problem.

!image-2023-05-16-21-24-09-182.png|width=931,height=573!

 

Any suggestions on how to solve it？Thanks!

  was:
Testing with 5TB dataset, the performance of q95 in tpcds deteriorates when 
shuffle hash join is enabled and the performance is better when sortMergeJoin 
is used.

 
Performance difference: from 3.9min(sortMergeJoin) to 8.1min(shuffledHashJoin)
 

1. enable shuffledHashJoin, the execution plan is as follows:

!image-2023-05-16-21-20-18-727.png|width=990,height=68!

!image-2023-05-16-21-21-35-493.png|width=924,height=502!

2. disable shuffledHashJoin, the execution plan is as follows:

!image-2023-05-16-21-22-44-532.png|width=1190,height=78!

!image-2023-05-16-21-22-16-170.png|width=934,height=477!

 

And when shuffledHashJoin is enabled, gc is very serious,

!image-2023-05-16-21-23-35-237.png|width=929,height=570!

but sortMergeJoin executes without this problem.

!image-2023-05-16-21-24-09-182.png|width=931,height=573!

 

Any suggestions on how to solve it？Thanks!


> when shuffle hash join is enabled, q95 performance deteriorates
> ---
>
> Key: SPARK-43526
> URL: https://issues.apache.org/jira/browse/SPARK-43526
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.2, 3.2.0
>Reporter: caican
>Priority: Major
> Attachments: image-2023-05-16-21-20-18-727.png, 
> image-2023-05-16-21-21-35-493.png, image-2023-05-16-21-22-16-170.png, 
> image-2023-05-16-21-22-44-532.png, image-2023-05-16-21-23-33-611.png, 
> image-2023-05-16-21-23-35-237.png, image-2023-05-16-21-24-09-182.png
>
>
> Testing with 5TB dataset, the performance of q95 in tpcds deteriorates when 
> shuffle hash join is enabled and the performance is better when sortMergeJoin 
> is used.
>  
> Performance difference:  from 3.9min(sortMergeJoin) to 
> 8.1min(shuffledHashJoin)
>  
> 1. enable shuffledHashJoin, the execution plan is as follows:
> !image-2023-05-16-21-20-18-727.png|width=990,height=68!
> !image-2023-05-16-21-21-35-493.png|width=924,height=502!
> 2. disable shuffledHashJoin, the execution plan is as follows:
> !image-2023-05-16-21-22-44-532.png|width=1114,height=73!
> !image-2023-05-16-21-22-16-170.png|width=934,height=477!
>  
> And when shuffledHashJoin is enabled, gc is very serious,
> !image-2023-05-16-21-23-35-237.png|width=929,height=570!
> but sortMergeJoin executes without this problem.
> !image-2023-05-16-21-24-09-182.png|width=931,height=573!
>  
> Any suggestions on how to solve it？Thanks!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-43526) when shuffle hash join is enabled, q95 performance deteriorates

2023-05-16 Thread caican (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-43526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

caican updated SPARK-43526:
---
Description: 
Testing with 5TB dataset, the performance of q95 in tpcds deteriorates when 
shuffle hash join is enabled and the performance is better when sortMergeJoin 
is used.

 
Performance difference: from 3.9min(sortMergeJoin) to 8.1min(shuffledHashJoin)
 

1. enable shuffledHashJoin, the execution plan is as follows:

!image-2023-05-16-21-20-18-727.png|width=990,height=68!

!image-2023-05-16-21-21-35-493.png|width=924,height=502!

2. disable shuffledHashJoin, the execution plan is as follows:

!image-2023-05-16-21-22-44-532.png|width=1190,height=78!

!image-2023-05-16-21-22-16-170.png|width=934,height=477!

 

And when shuffledHashJoin is enabled, gc is very serious,

!image-2023-05-16-21-23-35-237.png|width=929,height=570!

but sortMergeJoin executes without this problem.

!image-2023-05-16-21-24-09-182.png|width=931,height=573!

 

Any suggestions on how to solve it？Thanks!

  was:
Testing with 5TB dataset, the performance of q95 in tpcds deteriorates when 
shuffle hash join is enabled and the performance is better when sortMergeJoin 
is used.

 
Performance difference: from 3.9min(sortMergeJoin) to 8.1min(shuffledHashJoin)
 

1. enable shuffledHashJoin, the execution plan is as follows:

!image-2023-05-16-21-20-18-727.png|width=990,height=68!

!image-2023-05-16-21-21-35-493.png|width=924,height=502!

2. disable shuffledHashJoin, the execution plan is as follows:

!image-2023-05-16-21-22-44-532.png!

!image-2023-05-16-21-22-16-170.png!

 

And when shuffledHashJoin is enabled, gc is very serious,

!image-2023-05-16-21-23-35-237.png!

but sortMergeJoin executes without this problem.

!image-2023-05-16-21-24-09-182.png!

 

Any suggestions on how to solve it？Thanks!


> when shuffle hash join is enabled, q95 performance deteriorates
> ---
>
> Key: SPARK-43526
> URL: https://issues.apache.org/jira/browse/SPARK-43526
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.2, 3.2.0
>Reporter: caican
>Priority: Major
> Attachments: image-2023-05-16-21-20-18-727.png, 
> image-2023-05-16-21-21-35-493.png, image-2023-05-16-21-22-16-170.png, 
> image-2023-05-16-21-22-44-532.png, image-2023-05-16-21-23-33-611.png, 
> image-2023-05-16-21-23-35-237.png, image-2023-05-16-21-24-09-182.png
>
>
> Testing with 5TB dataset, the performance of q95 in tpcds deteriorates when 
> shuffle hash join is enabled and the performance is better when sortMergeJoin 
> is used.
>  
> Performance difference: from 3.9min(sortMergeJoin) to 8.1min(shuffledHashJoin)
>  
> 1. enable shuffledHashJoin, the execution plan is as follows:
> !image-2023-05-16-21-20-18-727.png|width=990,height=68!
> !image-2023-05-16-21-21-35-493.png|width=924,height=502!
> 2. disable shuffledHashJoin, the execution plan is as follows:
> !image-2023-05-16-21-22-44-532.png|width=1190,height=78!
> !image-2023-05-16-21-22-16-170.png|width=934,height=477!
>  
> And when shuffledHashJoin is enabled, gc is very serious,
> !image-2023-05-16-21-23-35-237.png|width=929,height=570!
> but sortMergeJoin executes without this problem.
> !image-2023-05-16-21-24-09-182.png|width=931,height=573!
>  
> Any suggestions on how to solve it？Thanks!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-43526) when shuffle hash join is enabled, q95 performance deteriorates

2023-05-16 Thread caican (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-43526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

caican updated SPARK-43526:
---
Description: 
Testing with 5TB dataset, the performance of q95 in tpcds deteriorates when 
shuffle hash join is enabled and the performance is better when sortMergeJoin 
is used.

 
Performance difference: from 3.9min(sortMergeJoin) to 8.1min(shuffledHashJoin)
 

1. enable shuffledHashJoin, the execution plan is as follows:

!image-2023-05-16-21-20-18-727.png!

!image-2023-05-16-21-21-35-493.png!

2. disable shuffledHashJoin, the execution plan is as follows:

!image-2023-05-16-21-22-44-532.png!

!image-2023-05-16-21-22-16-170.png!

 

And when shuffledHashJoin is enabled, gc is very serious,

!image-2023-05-16-21-23-35-237.png!

but sortMergeJoin executes without this problem.

!image-2023-05-16-21-24-09-182.png!

 

Any suggestions on how to solve it？Thanks!

  was:
Testing with 5TB dataset, the performance of q95 in tpcds deteriorates when 
shuffle hash join is enabled and the performance is better when sortMergeJoin 
is used.

 
Performance difference: from 3.9min(sortMergeJoin) to 8.1min(shuffledHashJoin)
 

1. enable shuffledHashJoin, the execution plan is as follows:

!image-2023-05-16-21-01-53-423.png!

!image-2023-05-16-21-16-37-376.png!

2. disable shuffledHashJoin, the execution plan is as follows:

!image-2023-05-16-21-05-45-361.png!

!image-2023-05-16-21-16-13-128.png!

 

and When shuffledHashJoin is enabled, gc is very serious.

!image-2023-05-16-21-12-24-618.png!

but sortMergeJoin executes without this problem.

!image-2023-05-16-21-15-21-047.png!

 

Any suggestions on how to solve it？Thanks!


> when shuffle hash join is enabled, q95 performance deteriorates
> ---
>
> Key: SPARK-43526
> URL: https://issues.apache.org/jira/browse/SPARK-43526
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.2, 3.2.0
>Reporter: caican
>Priority: Major
> Attachments: image-2023-05-16-21-20-18-727.png, 
> image-2023-05-16-21-21-35-493.png, image-2023-05-16-21-22-16-170.png, 
> image-2023-05-16-21-22-44-532.png, image-2023-05-16-21-23-33-611.png, 
> image-2023-05-16-21-23-35-237.png, image-2023-05-16-21-24-09-182.png
>
>
> Testing with 5TB dataset, the performance of q95 in tpcds deteriorates when 
> shuffle hash join is enabled and the performance is better when sortMergeJoin 
> is used.
>  
> Performance difference: from 3.9min(sortMergeJoin) to 8.1min(shuffledHashJoin)
>  
> 1. enable shuffledHashJoin, the execution plan is as follows:
> !image-2023-05-16-21-20-18-727.png!
> !image-2023-05-16-21-21-35-493.png!
> 2. disable shuffledHashJoin, the execution plan is as follows:
> !image-2023-05-16-21-22-44-532.png!
> !image-2023-05-16-21-22-16-170.png!
>  
> And when shuffledHashJoin is enabled, gc is very serious,
> !image-2023-05-16-21-23-35-237.png!
> but sortMergeJoin executes without this problem.
> !image-2023-05-16-21-24-09-182.png!
>  
> Any suggestions on how to solve it？Thanks!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-43526) when shuffle hash join is enabled, q95 performance deteriorates

2023-05-16 Thread caican (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-43526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

caican updated SPARK-43526:
---
Description: 
Testing with 5TB dataset, the performance of q95 in tpcds deteriorates when 
shuffle hash join is enabled and the performance is better when sortMergeJoin 
is used.

 
Performance difference: from 3.9min(sortMergeJoin) to 8.1min(shuffledHashJoin)
 

1. enable shuffledHashJoin, the execution plan is as follows:

!image-2023-05-16-21-20-18-727.png|width=1340,height=92!

!image-2023-05-16-21-21-35-493.png!

2. disable shuffledHashJoin, the execution plan is as follows:

!image-2023-05-16-21-22-44-532.png!

!image-2023-05-16-21-22-16-170.png!

 

And when shuffledHashJoin is enabled, gc is very serious,

!image-2023-05-16-21-23-35-237.png!

but sortMergeJoin executes without this problem.

!image-2023-05-16-21-24-09-182.png!

 

Any suggestions on how to solve it？Thanks!

  was:
Testing with 5TB dataset, the performance of q95 in tpcds deteriorates when 
shuffle hash join is enabled and the performance is better when sortMergeJoin 
is used.

 
Performance difference: from 3.9min(sortMergeJoin) to 8.1min(shuffledHashJoin)
 

1. enable shuffledHashJoin, the execution plan is as follows:

!image-2023-05-16-21-20-18-727.png!

!image-2023-05-16-21-21-35-493.png!

2. disable shuffledHashJoin, the execution plan is as follows:

!image-2023-05-16-21-22-44-532.png!

!image-2023-05-16-21-22-16-170.png!

 

And when shuffledHashJoin is enabled, gc is very serious,

!image-2023-05-16-21-23-35-237.png!

but sortMergeJoin executes without this problem.

!image-2023-05-16-21-24-09-182.png!

 

Any suggestions on how to solve it？Thanks!


> when shuffle hash join is enabled, q95 performance deteriorates
> ---
>
> Key: SPARK-43526
> URL: https://issues.apache.org/jira/browse/SPARK-43526
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.2, 3.2.0
>Reporter: caican
>Priority: Major
> Attachments: image-2023-05-16-21-20-18-727.png, 
> image-2023-05-16-21-21-35-493.png, image-2023-05-16-21-22-16-170.png, 
> image-2023-05-16-21-22-44-532.png, image-2023-05-16-21-23-33-611.png, 
> image-2023-05-16-21-23-35-237.png, image-2023-05-16-21-24-09-182.png
>
>
> Testing with 5TB dataset, the performance of q95 in tpcds deteriorates when 
> shuffle hash join is enabled and the performance is better when sortMergeJoin 
> is used.
>  
> Performance difference: from 3.9min(sortMergeJoin) to 8.1min(shuffledHashJoin)
>  
> 1. enable shuffledHashJoin, the execution plan is as follows:
> !image-2023-05-16-21-20-18-727.png|width=1340,height=92!
> !image-2023-05-16-21-21-35-493.png!
> 2. disable shuffledHashJoin, the execution plan is as follows:
> !image-2023-05-16-21-22-44-532.png!
> !image-2023-05-16-21-22-16-170.png!
>  
> And when shuffledHashJoin is enabled, gc is very serious,
> !image-2023-05-16-21-23-35-237.png!
> but sortMergeJoin executes without this problem.
> !image-2023-05-16-21-24-09-182.png!
>  
> Any suggestions on how to solve it？Thanks!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-43526) when shuffle hash join is enabled, q95 performance deteriorates

2023-05-16 Thread caican (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-43526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

caican updated SPARK-43526:
---
Attachment: image-2023-05-16-21-24-09-182.png

> when shuffle hash join is enabled, q95 performance deteriorates
> ---
>
> Key: SPARK-43526
> URL: https://issues.apache.org/jira/browse/SPARK-43526
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.2, 3.2.0
>Reporter: caican
>Priority: Major
> Attachments: image-2023-05-16-21-20-18-727.png, 
> image-2023-05-16-21-21-35-493.png, image-2023-05-16-21-22-16-170.png, 
> image-2023-05-16-21-22-44-532.png, image-2023-05-16-21-23-33-611.png, 
> image-2023-05-16-21-23-35-237.png, image-2023-05-16-21-24-09-182.png
>
>
> Testing with 5TB dataset, the performance of q95 in tpcds deteriorates when 
> shuffle hash join is enabled and the performance is better when sortMergeJoin 
> is used.
>  
> Performance difference: from 3.9min(sortMergeJoin) to 8.1min(shuffledHashJoin)
>  
> 1. enable shuffledHashJoin, the execution plan is as follows:
> !image-2023-05-16-21-01-53-423.png!
> !image-2023-05-16-21-16-37-376.png!
> 2. disable shuffledHashJoin, the execution plan is as follows:
> !image-2023-05-16-21-05-45-361.png!
> !image-2023-05-16-21-16-13-128.png!
>  
> and When shuffledHashJoin is enabled, gc is very serious.
> !image-2023-05-16-21-12-24-618.png!
> but sortMergeJoin executes without this problem.
> !image-2023-05-16-21-15-21-047.png!
>  
> Any suggestions on how to solve it？Thanks!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-43526) when shuffle hash join is enabled, q95 performance deteriorates

2023-05-16 Thread caican (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-43526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

caican updated SPARK-43526:
---
Description: 
Testing with 5TB dataset, the performance of q95 in tpcds deteriorates when 
shuffle hash join is enabled and the performance is better when sortMergeJoin 
is used.

 
Performance difference: from 3.9min(sortMergeJoin) to 8.1min(shuffledHashJoin)
 

1. enable shuffledHashJoin, the execution plan is as follows:

!image-2023-05-16-21-20-18-727.png|width=990,height=68!

!image-2023-05-16-21-21-35-493.png|width=924,height=502!

2. disable shuffledHashJoin, the execution plan is as follows:

!image-2023-05-16-21-22-44-532.png!

!image-2023-05-16-21-22-16-170.png!

 

And when shuffledHashJoin is enabled, gc is very serious,

!image-2023-05-16-21-23-35-237.png!

but sortMergeJoin executes without this problem.

!image-2023-05-16-21-24-09-182.png!

 

Any suggestions on how to solve it？Thanks!

  was:
Testing with 5TB dataset, the performance of q95 in tpcds deteriorates when 
shuffle hash join is enabled and the performance is better when sortMergeJoin 
is used.

 
Performance difference: from 3.9min(sortMergeJoin) to 8.1min(shuffledHashJoin)
 

1. enable shuffledHashJoin, the execution plan is as follows:

!image-2023-05-16-21-20-18-727.png|width=1340,height=92!

!image-2023-05-16-21-21-35-493.png!

2. disable shuffledHashJoin, the execution plan is as follows:

!image-2023-05-16-21-22-44-532.png!

!image-2023-05-16-21-22-16-170.png!

 

And when shuffledHashJoin is enabled, gc is very serious,

!image-2023-05-16-21-23-35-237.png!

but sortMergeJoin executes without this problem.

!image-2023-05-16-21-24-09-182.png!

 

Any suggestions on how to solve it？Thanks!


> when shuffle hash join is enabled, q95 performance deteriorates
> ---
>
> Key: SPARK-43526
> URL: https://issues.apache.org/jira/browse/SPARK-43526
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.2, 3.2.0
>Reporter: caican
>Priority: Major
> Attachments: image-2023-05-16-21-20-18-727.png, 
> image-2023-05-16-21-21-35-493.png, image-2023-05-16-21-22-16-170.png, 
> image-2023-05-16-21-22-44-532.png, image-2023-05-16-21-23-33-611.png, 
> image-2023-05-16-21-23-35-237.png, image-2023-05-16-21-24-09-182.png
>
>
> Testing with 5TB dataset, the performance of q95 in tpcds deteriorates when 
> shuffle hash join is enabled and the performance is better when sortMergeJoin 
> is used.
>  
> Performance difference: from 3.9min(sortMergeJoin) to 8.1min(shuffledHashJoin)
>  
> 1. enable shuffledHashJoin, the execution plan is as follows:
> !image-2023-05-16-21-20-18-727.png|width=990,height=68!
> !image-2023-05-16-21-21-35-493.png|width=924,height=502!
> 2. disable shuffledHashJoin, the execution plan is as follows:
> !image-2023-05-16-21-22-44-532.png!
> !image-2023-05-16-21-22-16-170.png!
>  
> And when shuffledHashJoin is enabled, gc is very serious,
> !image-2023-05-16-21-23-35-237.png!
> but sortMergeJoin executes without this problem.
> !image-2023-05-16-21-24-09-182.png!
>  
> Any suggestions on how to solve it？Thanks!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-43526) when shuffle hash join is enabled, q95 performance deteriorates

2023-05-16 Thread caican (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-43526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

caican updated SPARK-43526:
---
Attachment: image-2023-05-16-21-23-35-237.png

> when shuffle hash join is enabled, q95 performance deteriorates
> ---
>
> Key: SPARK-43526
> URL: https://issues.apache.org/jira/browse/SPARK-43526
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.2, 3.2.0
>Reporter: caican
>Priority: Major
> Attachments: image-2023-05-16-21-20-18-727.png, 
> image-2023-05-16-21-21-35-493.png, image-2023-05-16-21-22-16-170.png, 
> image-2023-05-16-21-22-44-532.png, image-2023-05-16-21-23-33-611.png, 
> image-2023-05-16-21-23-35-237.png
>
>
> Testing with 5TB dataset, the performance of q95 in tpcds deteriorates when 
> shuffle hash join is enabled and the performance is better when sortMergeJoin 
> is used.
>  
> Performance difference: from 3.9min(sortMergeJoin) to 8.1min(shuffledHashJoin)
>  
> 1. enable shuffledHashJoin, the execution plan is as follows:
> !image-2023-05-16-21-01-53-423.png!
> !image-2023-05-16-21-16-37-376.png!
> 2. disable shuffledHashJoin, the execution plan is as follows:
> !image-2023-05-16-21-05-45-361.png!
> !image-2023-05-16-21-16-13-128.png!
>  
> and When shuffledHashJoin is enabled, gc is very serious.
> !image-2023-05-16-21-12-24-618.png!
> but sortMergeJoin executes without this problem.
> !image-2023-05-16-21-15-21-047.png!
>  
> Any suggestions on how to solve it？Thanks!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-43526) when shuffle hash join is enabled, q95 performance deteriorates

2023-05-16 Thread caican (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-43526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

caican updated SPARK-43526:
---
Attachment: image-2023-05-16-21-23-33-611.png

> when shuffle hash join is enabled, q95 performance deteriorates
> ---
>
> Key: SPARK-43526
> URL: https://issues.apache.org/jira/browse/SPARK-43526
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.2, 3.2.0
>Reporter: caican
>Priority: Major
> Attachments: image-2023-05-16-21-20-18-727.png, 
> image-2023-05-16-21-21-35-493.png, image-2023-05-16-21-22-16-170.png, 
> image-2023-05-16-21-22-44-532.png, image-2023-05-16-21-23-33-611.png, 
> image-2023-05-16-21-23-35-237.png
>
>
> Testing with 5TB dataset, the performance of q95 in tpcds deteriorates when 
> shuffle hash join is enabled and the performance is better when sortMergeJoin 
> is used.
>  
> Performance difference: from 3.9min(sortMergeJoin) to 8.1min(shuffledHashJoin)
>  
> 1. enable shuffledHashJoin, the execution plan is as follows:
> !image-2023-05-16-21-01-53-423.png!
> !image-2023-05-16-21-16-37-376.png!
> 2. disable shuffledHashJoin, the execution plan is as follows:
> !image-2023-05-16-21-05-45-361.png!
> !image-2023-05-16-21-16-13-128.png!
>  
> and When shuffledHashJoin is enabled, gc is very serious.
> !image-2023-05-16-21-12-24-618.png!
> but sortMergeJoin executes without this problem.
> !image-2023-05-16-21-15-21-047.png!
>  
> Any suggestions on how to solve it？Thanks!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-43526) when shuffle hash join is enabled, q95 performance deteriorates

2023-05-16 Thread caican (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-43526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

caican updated SPARK-43526:
---
Attachment: image-2023-05-16-21-22-16-170.png

> when shuffle hash join is enabled, q95 performance deteriorates
> ---
>
> Key: SPARK-43526
> URL: https://issues.apache.org/jira/browse/SPARK-43526
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.2, 3.2.0
>Reporter: caican
>Priority: Major
> Attachments: image-2023-05-16-21-20-18-727.png, 
> image-2023-05-16-21-21-35-493.png, image-2023-05-16-21-22-16-170.png, 
> image-2023-05-16-21-22-44-532.png
>
>
> Testing with 5TB dataset, the performance of q95 in tpcds deteriorates when 
> shuffle hash join is enabled and the performance is better when sortMergeJoin 
> is used.
>  
> Performance difference: from 3.9min(sortMergeJoin) to 8.1min(shuffledHashJoin)
>  
> 1. enable shuffledHashJoin, the execution plan is as follows:
> !image-2023-05-16-21-01-53-423.png!
> !image-2023-05-16-21-16-37-376.png!
> 2. disable shuffledHashJoin, the execution plan is as follows:
> !image-2023-05-16-21-05-45-361.png!
> !image-2023-05-16-21-16-13-128.png!
>  
> and When shuffledHashJoin is enabled, gc is very serious.
> !image-2023-05-16-21-12-24-618.png!
> but sortMergeJoin executes without this problem.
> !image-2023-05-16-21-15-21-047.png!
>  
> Any suggestions on how to solve it？Thanks!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-43526) when shuffle hash join is enabled, q95 performance deteriorates

2023-05-16 Thread caican (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-43526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

caican updated SPARK-43526:
---
Attachment: image-2023-05-16-21-22-44-532.png

> when shuffle hash join is enabled, q95 performance deteriorates
> ---
>
> Key: SPARK-43526
> URL: https://issues.apache.org/jira/browse/SPARK-43526
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.2, 3.2.0
>Reporter: caican
>Priority: Major
> Attachments: image-2023-05-16-21-20-18-727.png, 
> image-2023-05-16-21-21-35-493.png, image-2023-05-16-21-22-16-170.png, 
> image-2023-05-16-21-22-44-532.png
>
>
> Testing with 5TB dataset, the performance of q95 in tpcds deteriorates when 
> shuffle hash join is enabled and the performance is better when sortMergeJoin 
> is used.
>  
> Performance difference: from 3.9min(sortMergeJoin) to 8.1min(shuffledHashJoin)
>  
> 1. enable shuffledHashJoin, the execution plan is as follows:
> !image-2023-05-16-21-01-53-423.png!
> !image-2023-05-16-21-16-37-376.png!
> 2. disable shuffledHashJoin, the execution plan is as follows:
> !image-2023-05-16-21-05-45-361.png!
> !image-2023-05-16-21-16-13-128.png!
>  
> and When shuffledHashJoin is enabled, gc is very serious.
> !image-2023-05-16-21-12-24-618.png!
> but sortMergeJoin executes without this problem.
> !image-2023-05-16-21-15-21-047.png!
>  
> Any suggestions on how to solve it？Thanks!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-43526) when shuffle hash join is enabled, q95 performance deteriorates

2023-05-16 Thread caican (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-43526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

caican updated SPARK-43526:
---
Attachment: image-2023-05-16-21-21-35-493.png

> when shuffle hash join is enabled, q95 performance deteriorates
> ---
>
> Key: SPARK-43526
> URL: https://issues.apache.org/jira/browse/SPARK-43526
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.2, 3.2.0
>Reporter: caican
>Priority: Major
> Attachments: image-2023-05-16-21-20-18-727.png, 
> image-2023-05-16-21-21-35-493.png
>
>
> Testing with 5TB dataset, the performance of q95 in tpcds deteriorates when 
> shuffle hash join is enabled and the performance is better when sortMergeJoin 
> is used.
>  
> Performance difference: from 3.9min(sortMergeJoin) to 8.1min(shuffledHashJoin)
>  
> 1. enable shuffledHashJoin, the execution plan is as follows:
> !image-2023-05-16-21-01-53-423.png!
> !image-2023-05-16-21-16-37-376.png!
> 2. disable shuffledHashJoin, the execution plan is as follows:
> !image-2023-05-16-21-05-45-361.png!
> !image-2023-05-16-21-16-13-128.png!
>  
> and When shuffledHashJoin is enabled, gc is very serious.
> !image-2023-05-16-21-12-24-618.png!
> but sortMergeJoin executes without this problem.
> !image-2023-05-16-21-15-21-047.png!
>  
> Any suggestions on how to solve it？Thanks!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-43526) when shuffle hash join is enabled, q95 performance deteriorates

2023-05-16 Thread caican (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-43526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

caican updated SPARK-43526:
---
Attachment: image-2023-05-16-21-20-18-727.png

> when shuffle hash join is enabled, q95 performance deteriorates
> ---
>
> Key: SPARK-43526
> URL: https://issues.apache.org/jira/browse/SPARK-43526
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.2, 3.2.0
>Reporter: caican
>Priority: Major
> Attachments: image-2023-05-16-21-20-18-727.png
>
>
> Testing with 5TB dataset, the performance of q95 in tpcds deteriorates when 
> shuffle hash join is enabled and the performance is better when sortMergeJoin 
> is used.
>  
> Performance difference: from 3.9min(sortMergeJoin) to 8.1min(shuffledHashJoin)
>  
> 1. enable shuffledHashJoin, the execution plan is as follows:
> !image-2023-05-16-21-01-53-423.png!
> !image-2023-05-16-21-16-37-376.png!
> 2. disable shuffledHashJoin, the execution plan is as follows:
> !image-2023-05-16-21-05-45-361.png!
> !image-2023-05-16-21-16-13-128.png!
>  
> and When shuffledHashJoin is enabled, gc is very serious.
> !image-2023-05-16-21-12-24-618.png!
> but sortMergeJoin executes without this problem.
> !image-2023-05-16-21-15-21-047.png!
>  
> Any suggestions on how to solve it？Thanks!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-43526) when shuffle hash join is enabled, q95 performance deteriorates

2023-05-16 Thread caican (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-43526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

caican updated SPARK-43526:
---
Description: 
Testing with 5TB dataset, the performance of q95 in tpcds deteriorates when 
shuffle hash join is enabled and the performance is better when sortMergeJoin 
is used.

 
Performance difference: from 3.9min(sortMergeJoin) to 8.1min(shuffledHashJoin)
 

1. enable shuffledHashJoin, the execution plan is as follows:

!image-2023-05-16-21-01-53-423.png!

!image-2023-05-16-21-16-37-376.png!

2. disable shuffledHashJoin, the execution plan is as follows:

!image-2023-05-16-21-05-45-361.png!

!image-2023-05-16-21-16-13-128.png!

 

and When shuffledHashJoin is enabled, gc is very serious.

!image-2023-05-16-21-12-24-618.png!

but sortMergeJoin executes without this problem.

!image-2023-05-16-21-15-21-047.png!

 

Any suggestions on how to solve it？Thanks!

  was:
Testing with 5TB dataset, the performance of q95 in tpcds deteriorates when 
shuffle hash join is enabled and the performance is better when sortMergeJoin 
is used.

 

>From 8.1min(shuffledHashJoin) to 3.9min(sortMergeJoin).

enable shuffledHashJoin, the execution plan is as follows:

!image-2023-05-16-21-01-53-423.png!

!image-2023-05-16-21-16-37-376.png!

disable shuffledHashJoin, the execution plan is as follows:

!image-2023-05-16-21-05-45-361.png!

!image-2023-05-16-21-16-13-128.png!

And When shuffledHashJoin is enabled, gc is very serious

!image-2023-05-16-21-12-24-618.png!

But sortMergeJoin executes without this problem

!image-2023-05-16-21-15-21-047.png!


> when shuffle hash join is enabled, q95 performance deteriorates
> ---
>
> Key: SPARK-43526
> URL: https://issues.apache.org/jira/browse/SPARK-43526
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.2, 3.2.0
>Reporter: caican
>Priority: Major
>
> Testing with 5TB dataset, the performance of q95 in tpcds deteriorates when 
> shuffle hash join is enabled and the performance is better when sortMergeJoin 
> is used.
>  
> Performance difference: from 3.9min(sortMergeJoin) to 8.1min(shuffledHashJoin)
>  
> 1. enable shuffledHashJoin, the execution plan is as follows:
> !image-2023-05-16-21-01-53-423.png!
> !image-2023-05-16-21-16-37-376.png!
> 2. disable shuffledHashJoin, the execution plan is as follows:
> !image-2023-05-16-21-05-45-361.png!
> !image-2023-05-16-21-16-13-128.png!
>  
> and When shuffledHashJoin is enabled, gc is very serious.
> !image-2023-05-16-21-12-24-618.png!
> but sortMergeJoin executes without this problem.
> !image-2023-05-16-21-15-21-047.png!
>  
> Any suggestions on how to solve it？Thanks!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-43526) when shuffle hash join is enabled, q95 performance deteriorates

2023-05-16 Thread caican (Jira)

caican created SPARK-43526:
--

 Summary: when shuffle hash join is enabled, q95 performance 
deteriorates
 Key: SPARK-43526
 URL: https://issues.apache.org/jira/browse/SPARK-43526
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.2.0, 3.1.2
Reporter: caican


Testing with 5TB dataset, the performance of q95 in tpcds deteriorates when 
shuffle hash join is enabled and the performance is better when sortMergeJoin 
is used.

 

>From 8.1min(shuffledHashJoin) to 3.9min(sortMergeJoin).

enable shuffledHashJoin, the execution plan is as follows:

!image-2023-05-16-21-01-53-423.png!

!image-2023-05-16-21-16-37-376.png!

disable shuffledHashJoin, the execution plan is as follows:

!image-2023-05-16-21-05-45-361.png!

!image-2023-05-16-21-16-13-128.png!

And When shuffledHashJoin is enabled, gc is very serious

!image-2023-05-16-21-12-24-618.png!

But sortMergeJoin executes without this problem

!image-2023-05-16-21-15-21-047.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-43065) Set job description for tpcds queries

2023-04-07 Thread caican (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-43065?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

caican updated SPARK-43065:
---
Description: 
 

When using Spark's TPCDSQueryBenchmark to run tpcds, the spark ui does not 
display the sql information

!https://user-images.githubusercontent.com/94670132/230567550-9bb2842c-aecc-41a5-acb6-0ff8ea765df1.png|width=1694,height=523!

  was:
When using Spark's TPCDSQueryBenchmark to run tpcds, the spark ui does not 
display the sql information

!https://user-images.githubusercontent.com/94670132/230567550-9bb2842c-aecc-41a5-acb6-0ff8ea765df1.png|width=1694,height=523!


> Set job description for tpcds queries
> -
>
> Key: SPARK-43065
> URL: https://issues.apache.org/jira/browse/SPARK-43065
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.2, 3.2.0, 3.3.0
>Reporter: caican
>Priority: Major
>
>  
> When using Spark's TPCDSQueryBenchmark to run tpcds, the spark ui does not 
> display the sql information
> !https://user-images.githubusercontent.com/94670132/230567550-9bb2842c-aecc-41a5-acb6-0ff8ea765df1.png|width=1694,height=523!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-43065) Set job description for tpcds queries

2023-04-07 Thread caican (Jira)

caican created SPARK-43065:
--

 Summary: Set job description for tpcds queries
 Key: SPARK-43065
 URL: https://issues.apache.org/jira/browse/SPARK-43065
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.3.0, 3.2.0, 3.1.2
Reporter: caican


When using Spark's TPCDSQueryBenchmark to run tpcds, the spark ui does not 
display the sql information

!https://user-images.githubusercontent.com/94670132/230567550-9bb2842c-aecc-41a5-acb6-0ff8ea765df1.png|width=1694,height=523!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-40455) Abort result stage directly when it failed caused by FetchFailed

2022-09-15 Thread caican (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40455?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

caican updated SPARK-40455:
---
Description: 
Here's a very serious bug：

When result stage failed caused by FetchFailedException,  the previous 
condition to determine whether result stage retries are allowed is 
{color:#ff}numMissingPartitions < resultStage.numTasks{color}. 

 

If this condition holds on retry, but the other tasks at the current result 
stage are not killed, when result stage was resubmit, it would got wrong 
partitions to recalculation.
{code:java}
// DAGScheduler#submitMissingTasks
 
// Figure out the indexes of partition ids to compute.
val partitionsToCompute: Seq[Int] = stage.findMissingPartitions() {code}
It is possible that the number of partitions to be recalculated is smaller than 
the actual number of partitions at result stage

  was:
Here's a very serious bug：

When result stage failed caused by FetchFailedException,  the previous 
condition to determine whether result stage retries are allowed is 
{color:#FF}numMissingPartitions < resultStage.numTasks{color}. 

 

If this condition holds on retry, but the other tasks in the current result 
stage are not killed, when result stage was resubmit, it would got wrong 
partitions to recalculation.
{code:java}
// DAGScheduler#submitMissingTasks
 
// Figure out the indexes of partition ids to compute.
val partitionsToCompute: Seq[Int] = stage.findMissingPartitions() {code}
 


> Abort result stage directly when it failed caused by FetchFailed
> 
>
> Key: SPARK-40455
> URL: https://issues.apache.org/jira/browse/SPARK-40455
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0, 3.1.2, 3.2.1, 3.3.0
>Reporter: caican
>Priority: Major
>
> Here's a very serious bug：
> When result stage failed caused by FetchFailedException,  the previous 
> condition to determine whether result stage retries are allowed is 
> {color:#ff}numMissingPartitions < resultStage.numTasks{color}. 
>  
> If this condition holds on retry, but the other tasks at the current result 
> stage are not killed, when result stage was resubmit, it would got wrong 
> partitions to recalculation.
> {code:java}
> // DAGScheduler#submitMissingTasks
>  
> // Figure out the indexes of partition ids to compute.
> val partitionsToCompute: Seq[Int] = stage.findMissingPartitions() {code}
> It is possible that the number of partitions to be recalculated is smaller 
> than the actual number of partitions at result stage



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-40455) Abort result stage directly when it failed caused by FetchFailed

2022-09-15 Thread caican (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40455?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

caican updated SPARK-40455:
---
Description: 
Here's a very serious bug：

When result stage failed caused by FetchFailedException,  the previous 
condition to determine whether result stage retries are allowed is 
{color:#FF}numMissingPartitions < resultStage.numTasks{color}. 

 

If this condition holds on retry, but the other tasks in the current result 
stage are not killed, when result stage was resubmit, it would got wrong 
partitions to recalculation.
{code:java}
// DAGScheduler#submitMissingTasks
 
// Figure out the indexes of partition ids to compute.
val partitionsToCompute: Seq[Int] = stage.findMissingPartitions() {code}
 

  was:
Here's a very serious bug：

When result stage failed caused by `FetchFailedException`,  the previous 
condition to determine whether result stage retries are allowed is 
`numMissingPartitions < resultStage.numTasks`. 

 

If this condition holds on retry, but the other tasks in the current result 
stage are not killed, when result stage was resubmit, it would got wrong 
partitions to recalculation

```

 

```
{code:java}
// DAGScheduler#submitMissingTasks
 
// Figure out the indexes of partition ids to compute.
val partitionsToCompute: Seq[Int] = stage.findMissingPartitions() {code}
 


> Abort result stage directly when it failed caused by FetchFailed
> 
>
> Key: SPARK-40455
> URL: https://issues.apache.org/jira/browse/SPARK-40455
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0, 3.1.2, 3.2.1, 3.3.0
>Reporter: caican
>Priority: Major
>
> Here's a very serious bug：
> When result stage failed caused by FetchFailedException,  the previous 
> condition to determine whether result stage retries are allowed is 
> {color:#FF}numMissingPartitions < resultStage.numTasks{color}. 
>  
> If this condition holds on retry, but the other tasks in the current result 
> stage are not killed, when result stage was resubmit, it would got wrong 
> partitions to recalculation.
> {code:java}
> // DAGScheduler#submitMissingTasks
>  
> // Figure out the indexes of partition ids to compute.
> val partitionsToCompute: Seq[Int] = stage.findMissingPartitions() {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-40455) Abort result stage directly when it failed caused by FetchFailed

2022-09-15 Thread caican (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40455?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

caican updated SPARK-40455:
---
Description: 
Here's a very serious bug：

When result stage failed caused by `FetchFailedException`,  the previous 
condition to determine whether result stage retries are allowed is 
`numMissingPartitions < resultStage.numTasks`. 

 

If this condition holds on retry, but the other tasks in the current result 
stage are not killed, when result stage was resubmit, it would got wrong 
partitions to recalculation

```

 

```
{code:java}
// DAGScheduler#submitMissingTasks
 
// Figure out the indexes of partition ids to compute.
val partitionsToCompute: Seq[Int] = stage.findMissingPartitions() {code}
 

  was:
Here's a very serious bug：

When result stage failed caused by `FetchFailedException`,  the previous 
condition to determine whether result stage retries are allowed is 
`numMissingPartitions < resultStage.numTasks`. 

 

If this condition holds on retry, but the other tasks in the current result 
stage are not killed, when result stage was resubmit, it would got wrong 
partitions to recalculation

```

// DAGScheduler#submitMissingTasks

 

// Figure out the indexes of partition ids to compute.
val partitionsToCompute: Seq[Int] = stage.findMissingPartitions()

```


> Abort result stage directly when it failed caused by FetchFailed
> 
>
> Key: SPARK-40455
> URL: https://issues.apache.org/jira/browse/SPARK-40455
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0, 3.1.2, 3.2.1, 3.3.0
>Reporter: caican
>Priority: Major
>
> Here's a very serious bug：
> When result stage failed caused by `FetchFailedException`,  the previous 
> condition to determine whether result stage retries are allowed is 
> `numMissingPartitions < resultStage.numTasks`. 
>  
> If this condition holds on retry, but the other tasks in the current result 
> stage are not killed, when result stage was resubmit, it would got wrong 
> partitions to recalculation
> ```
>  
> ```
> {code:java}
> // DAGScheduler#submitMissingTasks
>  
> // Figure out the indexes of partition ids to compute.
> val partitionsToCompute: Seq[Int] = stage.findMissingPartitions() {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-40455) Abort result stage directly when it failed caused by FetchFailed

2022-09-15 Thread caican (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40455?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

caican updated SPARK-40455:
---
Description: 
Here's a very serious bug：

When result stage failed caused by `FetchFailedException`,  the previous 
condition to determine whether result stage retries are allowed is 
`numMissingPartitions < resultStage.numTasks`. 

 

If this condition holds on retry, but the other tasks in the current result 
stage are not killed, when result stage was resubmit, it would got wrong 
partitions to recalculation

```

// DAGScheduler#submitMissingTasks

 

// Figure out the indexes of partition ids to compute.
val partitionsToCompute: Seq[Int] = stage.findMissingPartitions()

```

  was:
Here's a very serious bug：

 


> Abort result stage directly when it failed caused by FetchFailed
> 
>
> Key: SPARK-40455
> URL: https://issues.apache.org/jira/browse/SPARK-40455
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0, 3.1.2, 3.2.1, 3.3.0
>Reporter: caican
>Priority: Major
>
> Here's a very serious bug：
> When result stage failed caused by `FetchFailedException`,  the previous 
> condition to determine whether result stage retries are allowed is 
> `numMissingPartitions < resultStage.numTasks`. 
>  
> If this condition holds on retry, but the other tasks in the current result 
> stage are not killed, when result stage was resubmit, it would got wrong 
> partitions to recalculation
> ```
> // DAGScheduler#submitMissingTasks
>  
> // Figure out the indexes of partition ids to compute.
> val partitionsToCompute: Seq[Int] = stage.findMissingPartitions()
> ```



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-40455) Abort result stage directly when it failed caused by FetchFailed

2022-09-15 Thread caican (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40455?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

caican updated SPARK-40455:
---
Description: 
Here's a very serious bug：

 

> Abort result stage directly when it failed caused by FetchFailed
> 
>
> Key: SPARK-40455
> URL: https://issues.apache.org/jira/browse/SPARK-40455
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0, 3.1.2, 3.2.1, 3.3.0
>Reporter: caican
>Priority: Major
>
> Here's a very serious bug：
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-40455) Abort result stage directly when it failed caused by FetchFailed

2022-09-15 Thread caican (Jira)

caican created SPARK-40455:
--

 Summary: Abort result stage directly when it failed caused by 
FetchFailed
 Key: SPARK-40455
 URL: https://issues.apache.org/jira/browse/SPARK-40455
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 3.3.0, 3.2.1, 3.1.2, 3.0.0
Reporter: caican






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-40170) StringCoding UTF8 decode slowly

2022-08-22 Thread caican (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-40170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17582935#comment-17582935
 ] 

caican edited comment on SPARK-40170 at 8/22/22 12:13 PM:
--

[~kabhwan]

My program code is very simple，as shown below.

```

val rdd = spark.sql("select triggerId,adMetadata,userData from 
iceberg_my_cloud.mydb.myTable where date = 20220801").rdd
println(rdd.count())

```

In addition to string decode, the conversion of Tuple2 to MAP is slow and i 
have submitted a patch:https://github.com/apache/spark/pull/37609 to optimize 
it but right now I don't have a good way to optimize string decode


was (Author: JIRAUSER280464):
My program code is very simple，As shown below.

```

val rdd = spark.sql("select triggerId,adMetadata,userData from 
iceberg_my_cloud.mydb.myTable where date = 20220801").rdd
println(rdd.count())

```

> StringCoding UTF8 decode slowly
> ---
>
> Key: SPARK-40170
> URL: https://issues.apache.org/jira/browse/SPARK-40170
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.2, 3.2.0, 3.1.3, 3.2.1, 3.3.0, 3.2.2, 3.3.1
>Reporter: caican
>Priority: Major
> Attachments: image-2022-08-22-10-56-54-768.png, 
> image-2022-08-22-10-57-11-744.png
>
>
> When `UnsafeRow` is converted to `Row` at 
> `org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificSafeProjection.createExternalRow
>  `,  UTF8String decoding and copyMemory  process are very slow.
> Does anyone have any ideas for optimization?
> !image-2022-08-22-10-56-54-768.png!
>  
> !image-2022-08-22-10-57-11-744.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-40170) StringCoding UTF8 decode slowly

2022-08-22 Thread caican (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-40170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17582935#comment-17582935
 ] 

caican commented on SPARK-40170:


My program code is very simple，As shown below.

```

val rdd = spark.sql("select triggerId,adMetadata,userData from 
iceberg_my_cloud.mydb.myTable where date = 20220801").rdd
println(rdd.count())

```

> StringCoding UTF8 decode slowly
> ---
>
> Key: SPARK-40170
> URL: https://issues.apache.org/jira/browse/SPARK-40170
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.2, 3.2.0, 3.1.3, 3.2.1, 3.3.0, 3.2.2, 3.3.1
>Reporter: caican
>Priority: Major
> Attachments: image-2022-08-22-10-56-54-768.png, 
> image-2022-08-22-10-57-11-744.png
>
>
> When `UnsafeRow` is converted to `Row` at 
> `org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificSafeProjection.createExternalRow
>  `,  UTF8String decoding and copyMemory  process are very slow.
> Does anyone have any ideas for optimization?
> !image-2022-08-22-10-56-54-768.png!
>  
> !image-2022-08-22-10-57-11-744.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-40170) StringCoding UTF8 decode slowly

2022-08-22 Thread caican (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40170?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

caican updated SPARK-40170:
---
Affects Version/s: 3.2.2
   3.2.1
   3.1.3
   3.2.0
   3.1.2
   3.3.1

> StringCoding UTF8 decode slowly
> ---
>
> Key: SPARK-40170
> URL: https://issues.apache.org/jira/browse/SPARK-40170
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.2, 3.2.0, 3.1.3, 3.2.1, 3.3.0, 3.2.2, 3.3.1
>Reporter: caican
>Priority: Major
> Attachments: image-2022-08-22-10-56-54-768.png, 
> image-2022-08-22-10-57-11-744.png
>
>
> When `UnsafeRow` is converted to `Row` at 
> `org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificSafeProjection.createExternalRow
>  `,  UTF8String decoding and copyMemory  process are very slow.
> Does anyone have any ideas for optimization?
> !image-2022-08-22-10-56-54-768.png!
>  
> !image-2022-08-22-10-57-11-744.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-40175) Converting Tuple2 to Scala Map via `.toMap` is slow

2022-08-22 Thread caican (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

caican updated SPARK-40175:
---
Description: 
Converting Tuple2 to Scala Map via `.toMap` is slow

!image-2022-08-22-14-58-53-046.png!

!image-2022-08-22-14-58-26-491.png!

  was:
Converting Tuple2 to Scala Map via `.toMap` is slow

!image-2022-08-22-14-56-50-280.png!

!image-2022-08-22-14-57-37-954.png!


> Converting Tuple2 to Scala Map via `.toMap` is slow
> ---
>
> Key: SPARK-40175
> URL: https://issues.apache.org/jira/browse/SPARK-40175
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.2, 3.2.0, 3.1.3, 3.3.0, 3.2.2, 3.3.1
>Reporter: caican
>Priority: Major
> Attachments: image-2022-08-22-14-58-26-491.png, 
> image-2022-08-22-14-58-53-046.png
>
>
> Converting Tuple2 to Scala Map via `.toMap` is slow
> !image-2022-08-22-14-58-53-046.png!
> !image-2022-08-22-14-58-26-491.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-40175) Converting Tuple2 to Scala Map via `.toMap` is slow

2022-08-21 Thread caican (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

caican updated SPARK-40175:
---
Attachment: image-2022-08-22-14-58-53-046.png

> Converting Tuple2 to Scala Map via `.toMap` is slow
> ---
>
> Key: SPARK-40175
> URL: https://issues.apache.org/jira/browse/SPARK-40175
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.2, 3.2.0, 3.1.3, 3.3.0, 3.2.2, 3.3.1
>Reporter: caican
>Priority: Major
> Attachments: image-2022-08-22-14-58-26-491.png, 
> image-2022-08-22-14-58-53-046.png
>
>
> Converting Tuple2 to Scala Map via `.toMap` is slow
> !image-2022-08-22-14-56-50-280.png!
> !image-2022-08-22-14-57-37-954.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-40175) Converting Tuple2 to Scala Map via `.toMap` is slow

2022-08-21 Thread caican (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

caican updated SPARK-40175:
---
Attachment: image-2022-08-22-14-58-26-491.png

> Converting Tuple2 to Scala Map via `.toMap` is slow
> ---
>
> Key: SPARK-40175
> URL: https://issues.apache.org/jira/browse/SPARK-40175
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.2, 3.2.0, 3.1.3, 3.3.0, 3.2.2, 3.3.1
>Reporter: caican
>Priority: Major
> Attachments: image-2022-08-22-14-58-26-491.png, 
> image-2022-08-22-14-58-53-046.png
>
>
> Converting Tuple2 to Scala Map via `.toMap` is slow
> !image-2022-08-22-14-56-50-280.png!
> !image-2022-08-22-14-57-37-954.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-40175) Converting Tuple2 to Scala Map via `.toMap` is slow

2022-08-21 Thread caican (Jira)

caican created SPARK-40175:
--

 Summary: Converting Tuple2 to Scala Map via `.toMap` is slow
 Key: SPARK-40175
 URL: https://issues.apache.org/jira/browse/SPARK-40175
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.2.2, 3.3.0, 3.1.3, 3.2.0, 3.1.2, 3.3.1
Reporter: caican


Converting Tuple2 to Scala Map via `.toMap` is slow

!image-2022-08-22-14-56-50-280.png!

!image-2022-08-22-14-57-37-954.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-40170) StringCoding UTF8 decode slowly

2022-08-21 Thread caican (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-40170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17582708#comment-17582708
 ] 

caican commented on SPARK-40170:


gently ping [~sowen]  [~r...@databricks.com] 

> StringCoding UTF8 decode slowly
> ---
>
> Key: SPARK-40170
> URL: https://issues.apache.org/jira/browse/SPARK-40170
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: caican
>Priority: Major
> Attachments: image-2022-08-22-10-56-54-768.png, 
> image-2022-08-22-10-57-11-744.png
>
>
> When `UnsafeRow` is converted to `Row` at 
> `org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificSafeProjection.createExternalRow
>  `,  UTF8String decoding and copyMemory  process are very slow.
> Does anyone have any ideas for optimization?
> !image-2022-08-22-10-56-54-768.png!
>  
> !image-2022-08-22-10-57-11-744.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-40170) StringCoding UTF8 decode slowly

2022-08-21 Thread caican (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40170?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

caican updated SPARK-40170:
---
Description: 
When `UnsafeRow` is converted to `Row` at 
`org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificSafeProjection.createExternalRow
 `,  UTF8String decoding and copyMemory  process are very slow.

Does anyone have any ideas for optimization?

!image-2022-08-22-10-56-54-768.png!

 

!image-2022-08-22-10-57-11-744.png!

  was:
When `UnsafeRow` is converted to `Row` at  
`org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificSafeProjection.createExternalRow
 `,  UTF8String decoding and copyMemory  process are very slow.

!image-2022-08-22-10-56-54-768.png!

 

!image-2022-08-22-10-57-11-744.png!


> StringCoding UTF8 decode slowly
> ---
>
> Key: SPARK-40170
> URL: https://issues.apache.org/jira/browse/SPARK-40170
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: caican
>Priority: Major
> Attachments: image-2022-08-22-10-56-54-768.png, 
> image-2022-08-22-10-57-11-744.png
>
>
> When `UnsafeRow` is converted to `Row` at 
> `org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificSafeProjection.createExternalRow
>  `,  UTF8String decoding and copyMemory  process are very slow.
> Does anyone have any ideas for optimization?
> !image-2022-08-22-10-56-54-768.png!
>  
> !image-2022-08-22-10-57-11-744.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-40170) StringCoding UTF8 decode slowly

2022-08-21 Thread caican (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40170?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

caican updated SPARK-40170:
---
Attachment: image-2022-08-22-10-57-11-744.png

> StringCoding UTF8 decode slowly
> ---
>
> Key: SPARK-40170
> URL: https://issues.apache.org/jira/browse/SPARK-40170
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: caican
>Priority: Major
> Attachments: image-2022-08-22-10-56-54-768.png, 
> image-2022-08-22-10-57-11-744.png
>
>
> When `UnsafeRow` is converted to `Row` at  
> `org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificSafeProjection.createExternalRow
>  `,  UTF8String decoding and copyMemory  process are very slow.
> !image-2022-08-22-10-51-07-542.png!
>  
> !image-2022-08-22-10-56-04-574.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-40170) StringCoding UTF8 decode slowly

2022-08-21 Thread caican (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40170?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

caican updated SPARK-40170:
---
Description: 
When `UnsafeRow` is converted to `Row` at  
`org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificSafeProjection.createExternalRow
 `,  UTF8String decoding and copyMemory  process are very slow.

!image-2022-08-22-10-56-54-768.png!

 

!image-2022-08-22-10-57-11-744.png!

  was:
When `UnsafeRow` is converted to `Row` at  
`org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificSafeProjection.createExternalRow
 `,  UTF8String decoding and copyMemory  process are very slow.

!image-2022-08-22-10-51-07-542.png!

 

!image-2022-08-22-10-56-04-574.png!


> StringCoding UTF8 decode slowly
> ---
>
> Key: SPARK-40170
> URL: https://issues.apache.org/jira/browse/SPARK-40170
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: caican
>Priority: Major
> Attachments: image-2022-08-22-10-56-54-768.png, 
> image-2022-08-22-10-57-11-744.png
>
>
> When `UnsafeRow` is converted to `Row` at  
> `org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificSafeProjection.createExternalRow
>  `,  UTF8String decoding and copyMemory  process are very slow.
> !image-2022-08-22-10-56-54-768.png!
>  
> !image-2022-08-22-10-57-11-744.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-40170) StringCoding UTF8 decode slowly

2022-08-21 Thread caican (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40170?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

caican updated SPARK-40170:
---
Attachment: image-2022-08-22-10-56-54-768.png

> StringCoding UTF8 decode slowly
> ---
>
> Key: SPARK-40170
> URL: https://issues.apache.org/jira/browse/SPARK-40170
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: caican
>Priority: Major
> Attachments: image-2022-08-22-10-56-54-768.png
>
>
> When `UnsafeRow` is converted to `Row` at  
> `org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificSafeProjection.createExternalRow
>  `,  UTF8String decoding and copyMemory  process are very slow.
> !image-2022-08-22-10-51-07-542.png!
>  
> !image-2022-08-22-10-56-04-574.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-40170) StringCoding UTF8 decode slowly

2022-08-21 Thread caican (Jira)

caican created SPARK-40170:
--

 Summary: StringCoding UTF8 decode slowly
 Key: SPARK-40170
 URL: https://issues.apache.org/jira/browse/SPARK-40170
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.3.0
Reporter: caican
 Attachments: image-2022-08-22-10-56-54-768.png

When `UnsafeRow` is converted to `Row` at  
`org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificSafeProjection.createExternalRow
 `,  UTF8String decoding and copyMemory  process are very slow.

!image-2022-08-22-10-51-07-542.png!

 

!image-2022-08-22-10-56-04-574.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-40045) The order of filtering predicates is not reasonable

2022-08-11 Thread caican (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

caican updated SPARK-40045:
---
Description: 
{code:java}
select id, data FROM testcat.ns1.ns2.table
where id =2
and md5(data) = '8cde774d6f7333752ed72cacddb05126'
and trim(data) = 'a' {code}
Based on the SQL, we currently get the filters in the following order:
{code:java}
// `(md5(cast(data#23 as binary)) = 8cde774d6f7333752ed72cacddb05126)) AND 
(trim(data#23, None) = a))` comes before `(id#22L = 2)`
== Physical Plan == *(1) Project [id#22L, data#23]
 +- *(1) Filter isnotnull(data#23) AND isnotnull(id#22L)) AND 
(md5(cast(data#23 as binary)) = 8cde774d6f7333752ed72cacddb05126)) AND 
(trim(data#23, None) = a)) AND (id#22L = 2))
    +- BatchScan[id#22L, data#23] class 
org.apache.spark.sql.connector.InMemoryTable$InMemoryBatchScan{code}
In this predicate order, all data needs to participate in the evaluation, even 
if some data does not meet the later filtering criteria and it may causes spark 
tasks to execute slowly.

 

So i think that filtering predicates that need to be evaluated should 
automatically be placed to the far right to avoid data that does not meet the 
criteria being evaluated.

 

As shown below:
{noformat}
//  `(id#22L = 2)` comes before `(md5(cast(data#23 as binary)) = 
8cde774d6f7333752ed72cacddb05126)) AND (trim(data#23, None) = a))`
== Physical Plan == *(1) Project [id#22L, data#23]
 +- *(1) Filter isnotnull(data#23) AND isnotnull(id#22L)) AND (id#22L = 2) 
AND (md5(cast(data#23 as binary)) = 8cde774d6f7333752ed72cacddb05126)) AND 
(trim(data#23, None) = a)))
    +- BatchScan[id#22L, data#23] class 
org.apache.spark.sql.connector.InMemoryTable$InMemoryBatchScan{noformat}

  was:
{code:java}
select id, data FROM testcat.ns1.ns2.table
where id =2
and md5(data) = '8cde774d6f7333752ed72cacddb05126'
and trim(data) = 'a' {code}
Based on the SQL, we currently get the filters in the following order:
{code:java}
// `(md5(cast(data#23 as binary)) = 8cde774d6f7333752ed72cacddb05126)) AND 
(trim(data#23, None) = a))` comes before `(id#22L = 2)`
== Physical Plan == *(1) Project [id#22L, data#23]
 +- *(1) Filter isnotnull(data#23) AND isnotnull(id#22L)) AND 
(md5(cast(data#23 as binary)) = 8cde774d6f7333752ed72cacddb05126)) AND 
(trim(data#23, None) = a)) AND (id#22L = 2))
    +- BatchScan[id#22L, data#23] class 
org.apache.spark.sql.connector.InMemoryTable$InMemoryBatchScan{code}
In this predicate order, all data needs to participate in the evaluation, even 
if some data does not meet the later filtering criteria and it may causes spark 
tasks to execute slowly.

 

So i think that filtering predicates that need to be evaluated should 
automatically be placed to the far right to avoid data that does not meet the 
criteria being evaluated.

 

As shown below:
{noformat}
//  `(id#22L = 2)` comes before `(md5(cast(data#23 as binary)) = 
8cde774d6f7333752ed72cacddb05126)) AND (trim(data#23, None) = a))`
== Physical Plan == *(1) Project [id#22L, data#23]
 +- *(1) Filter isnotnull(data#23) AND isnotnull(id#22L)) AND 
(md5(cast(data#23 as binary)) = 8cde774d6f7333752ed72cacddb05126)) AND 
(trim(data#23, None) = a)) AND (id#22L = 2))
    +- BatchScan[id#22L, data#23] class 
org.apache.spark.sql.connector.InMemoryTable$InMemoryBatchScan{noformat}


> The order of filtering predicates is not reasonable
> ---
>
> Key: SPARK-40045
> URL: https://issues.apache.org/jira/browse/SPARK-40045
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.2, 3.2.0, 3.3.0
>Reporter: caican
>Priority: Major
>
> {code:java}
> select id, data FROM testcat.ns1.ns2.table
> where id =2
> and md5(data) = '8cde774d6f7333752ed72cacddb05126'
> and trim(data) = 'a' {code}
> Based on the SQL, we currently get the filters in the following order:
> {code:java}
> // `(md5(cast(data#23 as binary)) = 8cde774d6f7333752ed72cacddb05126)) AND 
> (trim(data#23, None) = a))` comes before `(id#22L = 2)`
> == Physical Plan == *(1) Project [id#22L, data#23]
>  +- *(1) Filter isnotnull(data#23) AND isnotnull(id#22L)) AND 
> (md5(cast(data#23 as binary)) = 8cde774d6f7333752ed72cacddb05126)) AND 
> (trim(data#23, None) = a)) AND (id#22L = 2))
>     +- BatchScan[id#22L, data#23] class 
> org.apache.spark.sql.connector.InMemoryTable$InMemoryBatchScan{code}
> In this predicate order, all data needs to participate in the evaluation, 
> even if some data does not meet the later filtering criteria and it may 
> causes spark tasks to execute slowly.
>  
> So i think that filtering predicates that need to be evaluated should 
> automatically be placed to the far right to avoid data that does not meet the 
> criteria being evaluated.
>  
> As shown below:
> {noformat}
> //  `(id#22L = 2)` comes before `(md5(cast(data#23 as binar

[jira] [Updated] (SPARK-40045) The order of filtering predicates is not reasonable

2022-08-10 Thread caican (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

caican updated SPARK-40045:
---
Description: 
{code:java}
select id, data FROM testcat.ns1.ns2.table
where id =2
and md5(data) = '8cde774d6f7333752ed72cacddb05126'
and trim(data) = 'a' {code}
Based on the SQL, we currently get the filters in the following order:
{code:java}
// `(md5(cast(data#23 as binary)) = 8cde774d6f7333752ed72cacddb05126)) AND 
(trim(data#23, None) = a))` comes before `(id#22L = 2)`
== Physical Plan == *(1) Project [id#22L, data#23]
 +- *(1) Filter isnotnull(data#23) AND isnotnull(id#22L)) AND 
(md5(cast(data#23 as binary)) = 8cde774d6f7333752ed72cacddb05126)) AND 
(trim(data#23, None) = a)) AND (id#22L = 2))
    +- BatchScan[id#22L, data#23] class 
org.apache.spark.sql.connector.InMemoryTable$InMemoryBatchScan{code}
In this predicate order, all data needs to participate in the evaluation, even 
if some data does not meet the later filtering criteria and it may causes spark 
tasks to execute slowly.

 

So i think that filtering predicates that need to be evaluated should 
automatically be placed to the far right to avoid data that does not meet the 
criteria being evaluated.

 

As shown below:
{noformat}
//  `(id#22L = 2)` comes before `(md5(cast(data#23 as binary)) = 
8cde774d6f7333752ed72cacddb05126)) AND (trim(data#23, None) = a))`
== Physical Plan == *(1) Project [id#22L, data#23]
 +- *(1) Filter isnotnull(data#23) AND isnotnull(id#22L)) AND 
(md5(cast(data#23 as binary)) = 8cde774d6f7333752ed72cacddb05126)) AND 
(trim(data#23, None) = a)) AND (id#22L = 2))
    +- BatchScan[id#22L, data#23] class 
org.apache.spark.sql.connector.InMemoryTable$InMemoryBatchScan{noformat}

  was:
{code:java}
select id, data FROM testcat.ns1.ns2.table
where id =2
and md5(data) = '8cde774d6f7333752ed72cacddb05126'
and trim(data) = 'a' {code}
Based on the SQL, we currently get the filters in the following order:
{code:java}
// `(md5(cast(data#23 as binary)) = 8cde774d6f7333752ed72cacddb05126)) AND 
(trim(data#23, None) = a))` comes before `(id#22L = 2)`
== Physical Plan == *(1) Project [id#22L, data#23]
 +- *(1) Filter isnotnull(data#23) AND isnotnull(id#22L)) AND 
(md5(cast(data#23 as binary)) = 8cde774d6f7333752ed72cacddb05126)) AND 
(trim(data#23, None) = a)) AND (id#22L = 2))
    +- BatchScan[id#22L, data#23] class 
org.apache.spark.sql.connector.InMemoryTable$InMemoryBatchScan{code}
In this predicate order, all data needs to participate in the evaluation, even 
if some data does not meet the later filtering criteria and it may causes spark 
tasks to execute slowly.

 

So i think that filtering predicates that need to be evaluated should 
automatically be placed to the far right to avoid data that does not meet the 
criteria being evaluated.

 

As shown below:
{noformat}
== Physical Plan == *(1) Project [id#22L, data#23]
 +- *(1) Filter isnotnull(data#23) AND isnotnull(id#22L)) AND 
(md5(cast(data#23 as binary)) = 8cde774d6f7333752ed72cacddb05126)) AND 
(trim(data#23, None) = a)) AND (id#22L = 2))
    +- BatchScan[id#22L, data#23] class 
org.apache.spark.sql.connector.InMemoryTable$InMemoryBatchScan{noformat}


> The order of filtering predicates is not reasonable
> ---
>
> Key: SPARK-40045
> URL: https://issues.apache.org/jira/browse/SPARK-40045
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.2, 3.2.0, 3.3.0
>Reporter: caican
>Priority: Major
>
> {code:java}
> select id, data FROM testcat.ns1.ns2.table
> where id =2
> and md5(data) = '8cde774d6f7333752ed72cacddb05126'
> and trim(data) = 'a' {code}
> Based on the SQL, we currently get the filters in the following order:
> {code:java}
> // `(md5(cast(data#23 as binary)) = 8cde774d6f7333752ed72cacddb05126)) AND 
> (trim(data#23, None) = a))` comes before `(id#22L = 2)`
> == Physical Plan == *(1) Project [id#22L, data#23]
>  +- *(1) Filter isnotnull(data#23) AND isnotnull(id#22L)) AND 
> (md5(cast(data#23 as binary)) = 8cde774d6f7333752ed72cacddb05126)) AND 
> (trim(data#23, None) = a)) AND (id#22L = 2))
>     +- BatchScan[id#22L, data#23] class 
> org.apache.spark.sql.connector.InMemoryTable$InMemoryBatchScan{code}
> In this predicate order, all data needs to participate in the evaluation, 
> even if some data does not meet the later filtering criteria and it may 
> causes spark tasks to execute slowly.
>  
> So i think that filtering predicates that need to be evaluated should 
> automatically be placed to the far right to avoid data that does not meet the 
> criteria being evaluated.
>  
> As shown below:
> {noformat}
> //  `(id#22L = 2)` comes before `(md5(cast(data#23 as binary)) = 
> 8cde774d6f7333752ed72cacddb05126)) AND (trim(data#23, None) = a))`
> == Physical Plan == *(1) Project [id#22L, data#23]
>  +

[jira] [Updated] (SPARK-40045) The order of filtering predicates is not reasonable

2022-08-10 Thread caican (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

caican updated SPARK-40045:
---
Description: 
{code:java}
select id, data FROM testcat.ns1.ns2.table
where id =2
and md5(data) = '8cde774d6f7333752ed72cacddb05126'
and trim(data) = 'a' {code}
Based on the SQL, we currently get the filters in the following order:
{code:java}
// `(md5(cast(data#23 as binary)) = 8cde774d6f7333752ed72cacddb05126)) AND 
(trim(data#23, None) = a))` comes before `(id#22L = 2)`
== Physical Plan == *(1) Project [id#22L, data#23]
 +- *(1) Filter isnotnull(data#23) AND isnotnull(id#22L)) AND 
(md5(cast(data#23 as binary)) = 8cde774d6f7333752ed72cacddb05126)) AND 
(trim(data#23, None) = a)) AND (id#22L = 2))
    +- BatchScan[id#22L, data#23] class 
org.apache.spark.sql.connector.InMemoryTable$InMemoryBatchScan{code}
In this predicate order, all data needs to participate in the evaluation, even 
if some data does not meet the later filtering criteria and it may causes spark 
tasks to execute slowly.

 

So i think that filtering predicates that need to be evaluated should 
automatically be placed to the far right to avoid data that does not meet the 
criteria being evaluated.

 

As shown below:
{noformat}
== Physical Plan == *(1) Project [id#22L, data#23]
 +- *(1) Filter isnotnull(data#23) AND isnotnull(id#22L)) AND 
(md5(cast(data#23 as binary)) = 8cde774d6f7333752ed72cacddb05126)) AND 
(trim(data#23, None) = a)) AND (id#22L = 2))
    +- BatchScan[id#22L, data#23] class 
org.apache.spark.sql.connector.InMemoryTable$InMemoryBatchScan{noformat}

  was:
{code:java}
select id, data FROM testcat.ns1.ns2.table
where id =2
and md5(data) = '8cde774d6f7333752ed72cacddb05126'
and trim(data) = 'a' {code}
Based on the SQL, we currently get the filters in the following order:

== Physical Plan == *(1) Project [id#22L, data#23] +- *(1) Filter 
isnotnull(data#23) AND isnotnull(id#22L)) AND (md5(cast(data#23 as binary)) 
= 8cde774d6f7333752ed72cacddb05126)) AND (trim(data#23, None) = a)) AND (id#22L 
= 2))    +- BatchScan[id#22L, data#23] class 
org.apache.spark.sql.connector.InMemoryTable$InMemoryBatchScan
{code:java}
== Physical Plan == *(1) Project [id#22L, data#23]
 +- *(1) Filter isnotnull(data#23) AND isnotnull(id#22L)) AND 
(md5(cast(data#23 as binary)) = 8cde774d6f7333752ed72cacddb05126)) AND 
(trim(data#23, None) = a)) AND (id#22L = 2))
    +- BatchScan[id#22L, data#23] class 
org.apache.spark.sql.connector.InMemoryTable$InMemoryBatchScan{code}
In this predicate order, all data needs to participate in the evaluation, even 
if some data does not meet the later filtering criteria and it may causes spark 
tasks to execute slowly.

 

So i think that filtering predicates that need to be evaluated should 
automatically be placed to the far right to avoid data that does not meet the 
criteria being evaluated.

 

As shown below:
{noformat}
== Physical Plan == *(1) Project [id#22L, data#23]
 +- *(1) Filter isnotnull(data#23) AND isnotnull(id#22L)) AND 
(md5(cast(data#23 as binary)) = 8cde774d6f7333752ed72cacddb05126)) AND 
(trim(data#23, None) = a)) AND (id#22L = 2))
    +- BatchScan[id#22L, data#23] class 
org.apache.spark.sql.connector.InMemoryTable$InMemoryBatchScan{noformat}


> The order of filtering predicates is not reasonable
> ---
>
> Key: SPARK-40045
> URL: https://issues.apache.org/jira/browse/SPARK-40045
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.2, 3.2.0, 3.3.0
>Reporter: caican
>Priority: Major
>
> {code:java}
> select id, data FROM testcat.ns1.ns2.table
> where id =2
> and md5(data) = '8cde774d6f7333752ed72cacddb05126'
> and trim(data) = 'a' {code}
> Based on the SQL, we currently get the filters in the following order:
> {code:java}
> // `(md5(cast(data#23 as binary)) = 8cde774d6f7333752ed72cacddb05126)) AND 
> (trim(data#23, None) = a))` comes before `(id#22L = 2)`
> == Physical Plan == *(1) Project [id#22L, data#23]
>  +- *(1) Filter isnotnull(data#23) AND isnotnull(id#22L)) AND 
> (md5(cast(data#23 as binary)) = 8cde774d6f7333752ed72cacddb05126)) AND 
> (trim(data#23, None) = a)) AND (id#22L = 2))
>     +- BatchScan[id#22L, data#23] class 
> org.apache.spark.sql.connector.InMemoryTable$InMemoryBatchScan{code}
> In this predicate order, all data needs to participate in the evaluation, 
> even if some data does not meet the later filtering criteria and it may 
> causes spark tasks to execute slowly.
>  
> So i think that filtering predicates that need to be evaluated should 
> automatically be placed to the far right to avoid data that does not meet the 
> criteria being evaluated.
>  
> As shown below:
> {noformat}
> == Physical Plan == *(1) Project [id#22L, data#23]
>  +- *(1) Filter isnotnull(data#23) AND isnotnull(id#22L)) AND

[jira] [Updated] (SPARK-40045) The order of filtering predicates is not reasonable

2022-08-10 Thread caican (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

caican updated SPARK-40045:
---
Description: 
{code:java}
select id, data FROM testcat.ns1.ns2.table
where id =2
and md5(data) = '8cde774d6f7333752ed72cacddb05126'
and trim(data) = 'a' {code}
Based on the SQL, we currently get the filters in the following order:

== Physical Plan == *(1) Project [id#22L, data#23] +- *(1) Filter 
isnotnull(data#23) AND isnotnull(id#22L)) AND (md5(cast(data#23 as binary)) 
= 8cde774d6f7333752ed72cacddb05126)) AND (trim(data#23, None) = a)) AND (id#22L 
= 2))    +- BatchScan[id#22L, data#23] class 
org.apache.spark.sql.connector.InMemoryTable$InMemoryBatchScan
{code:java}
== Physical Plan == *(1) Project [id#22L, data#23]
 +- *(1) Filter isnotnull(data#23) AND isnotnull(id#22L)) AND 
(md5(cast(data#23 as binary)) = 8cde774d6f7333752ed72cacddb05126)) AND 
(trim(data#23, None) = a)) AND (id#22L = 2))
    +- BatchScan[id#22L, data#23] class 
org.apache.spark.sql.connector.InMemoryTable$InMemoryBatchScan{code}
In this predicate order, all data needs to participate in the evaluation, even 
if some data does not meet the later filtering criteria and it may causes spark 
tasks to execute slowly.

 

So i think that filtering predicates that need to be evaluated should 
automatically be placed to the far right to avoid data that does not meet the 
criteria being evaluated.

 

As shown below:
{noformat}
== Physical Plan == *(1) Project [id#22L, data#23]
 +- *(1) Filter isnotnull(data#23) AND isnotnull(id#22L)) AND 
(md5(cast(data#23 as binary)) = 8cde774d6f7333752ed72cacddb05126)) AND 
(trim(data#23, None) = a)) AND (id#22L = 2))
    +- BatchScan[id#22L, data#23] class 
org.apache.spark.sql.connector.InMemoryTable$InMemoryBatchScan{noformat}

  was:
{code:java}
select id, data FROM testcat.ns1.ns2.table
where id =2
and md5(data) = '8cde774d6f7333752ed72cacddb05126'
and trim(data) = 'a' {code}
Based on the SQL, we currently get the filters in the following order:
{code:java}
== Physical Plan == *(1) Project [id#22L, data#23]
 +- *(1) Filter isnotnull(data#23) AND isnotnull(id#22L)) AND 
(md5(cast(data#23 as binary)) = 8cde774d6f7333752ed72cacddb05126)) AND 
(trim(data#23, None) = a)) AND (id#22L = 2))
    +- BatchScan[id#22L, data#23] class 
org.apache.spark.sql.connector.InMemoryTable$InMemoryBatchScan{code}
In this predicate order, all data needs to participate in the evaluation, even 
if some data does not meet the later filtering criteria and it may causes spark 
tasks to execute slowly.

 

So i think that filtering predicates that need to be evaluated should 
automatically be placed to the far right to avoid data that does not meet the 
criteria being evaluated.

 

As shown below:
{noformat}
== Physical Plan == *(1) Project [id#22L, data#23]
 +- *(1) Filter isnotnull(data#23) AND isnotnull(id#22L)) AND 
(md5(cast(data#23 as binary)) = 8cde774d6f7333752ed72cacddb05126)) AND 
(trim(data#23, None) = a)) AND (id#22L = 2))
    +- BatchScan[id#22L, data#23] class 
org.apache.spark.sql.connector.InMemoryTable$InMemoryBatchScan{noformat}


> The order of filtering predicates is not reasonable
> ---
>
> Key: SPARK-40045
> URL: https://issues.apache.org/jira/browse/SPARK-40045
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.2, 3.2.0, 3.3.0
>Reporter: caican
>Priority: Major
>
> {code:java}
> select id, data FROM testcat.ns1.ns2.table
> where id =2
> and md5(data) = '8cde774d6f7333752ed72cacddb05126'
> and trim(data) = 'a' {code}
> Based on the SQL, we currently get the filters in the following order:
> == Physical Plan == *(1) Project [id#22L, data#23] +- *(1) Filter 
> isnotnull(data#23) AND isnotnull(id#22L)) AND (md5(cast(data#23 as 
> binary)) = 8cde774d6f7333752ed72cacddb05126)) AND (trim(data#23, None) = a)) 
> AND (id#22L = 2))    +- BatchScan[id#22L, data#23] class 
> org.apache.spark.sql.connector.InMemoryTable$InMemoryBatchScan
> {code:java}
> == Physical Plan == *(1) Project [id#22L, data#23]
>  +- *(1) Filter isnotnull(data#23) AND isnotnull(id#22L)) AND 
> (md5(cast(data#23 as binary)) = 8cde774d6f7333752ed72cacddb05126)) AND 
> (trim(data#23, None) = a)) AND (id#22L = 2))
>     +- BatchScan[id#22L, data#23] class 
> org.apache.spark.sql.connector.InMemoryTable$InMemoryBatchScan{code}
> In this predicate order, all data needs to participate in the evaluation, 
> even if some data does not meet the later filtering criteria and it may 
> causes spark tasks to execute slowly.
>  
> So i think that filtering predicates that need to be evaluated should 
> automatically be placed to the far right to avoid data that does not meet the 
> criteria being evaluated.
>  
> As shown below:
> {noformat}
> == Physical Plan == *(1) Project [id#2

[jira] [Updated] (SPARK-40045) The order of filtering predicates is not reasonable

2022-08-10 Thread caican (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

caican updated SPARK-40045:
---
Description: 
{code:java}
select id, data FROM testcat.ns1.ns2.table
where id =2
and md5(data) = '8cde774d6f7333752ed72cacddb05126'
and trim(data) = 'a' {code}
Based on the SQL, we currently get the filters in the following order:
{code:java}
== Physical Plan == *(1) Project [id#22L, data#23]
 +- *(1) Filter isnotnull(data#23) AND isnotnull(id#22L)) AND 
(md5(cast(data#23 as binary)) = 8cde774d6f7333752ed72cacddb05126)) AND 
(trim(data#23, None) = a)) AND (id#22L = 2))
    +- BatchScan[id#22L, data#23] class 
org.apache.spark.sql.connector.InMemoryTable$InMemoryBatchScan{code}
In this predicate order, all data needs to participate in the evaluation, even 
if some data does not meet the later filtering criteria and it may causes spark 
tasks to execute slowly.

 

So i think that filtering predicates that need to be evaluated should 
automatically be placed to the far right to avoid data that does not meet the 
criteria being evaluated.

 

As shown below:
{noformat}
== Physical Plan == *(1) Project [id#22L, data#23]
 +- *(1) Filter isnotnull(data#23) AND isnotnull(id#22L)) AND 
(md5(cast(data#23 as binary)) = 8cde774d6f7333752ed72cacddb05126)) AND 
(trim(data#23, None) = a)) AND (id#22L = 2))
    +- BatchScan[id#22L, data#23] class 
org.apache.spark.sql.connector.InMemoryTable$InMemoryBatchScan{noformat}

  was:
{code:java}
select id, data FROM testcat.ns1.ns2.table
where id =2
and md5(data) = '8cde774d6f7333752ed72cacddb05126'
and trim(data) = 'a' {code}
Based on the SQL, we currently get the filters in the following order:

 
{code:java}
== Physical Plan == *(1) Project [id#22L, data#23]
 +- *(1) Filter isnotnull(data#23) AND isnotnull(id#22L)) AND 
(md5(cast(data#23 as binary)) = 8cde774d6f7333752ed72cacddb05126)) AND 
(trim(data#23, None) = a)) AND (id#22L = 2))
    +- BatchScan[id#22L, data#23] class 
org.apache.spark.sql.connector.InMemoryTable$InMemoryBatchScan{code}
 

In this predicate order, all data needs to participate in the evaluation, even 
if some data does not meet the later filtering criteria and it may causes spark 
tasks to execute slowly.

 

So i think that filtering predicates that need to be evaluated should 
automatically be placed to the far right to avoid data that does not meet the 
criteria being evaluated.

 

As shown below:
{noformat}
 {noformat}


> The order of filtering predicates is not reasonable
> ---
>
> Key: SPARK-40045
> URL: https://issues.apache.org/jira/browse/SPARK-40045
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.2, 3.2.0, 3.3.0
>Reporter: caican
>Priority: Major
>
> {code:java}
> select id, data FROM testcat.ns1.ns2.table
> where id =2
> and md5(data) = '8cde774d6f7333752ed72cacddb05126'
> and trim(data) = 'a' {code}
> Based on the SQL, we currently get the filters in the following order:
> {code:java}
> == Physical Plan == *(1) Project [id#22L, data#23]
>  +- *(1) Filter isnotnull(data#23) AND isnotnull(id#22L)) AND 
> (md5(cast(data#23 as binary)) = 8cde774d6f7333752ed72cacddb05126)) AND 
> (trim(data#23, None) = a)) AND (id#22L = 2))
>     +- BatchScan[id#22L, data#23] class 
> org.apache.spark.sql.connector.InMemoryTable$InMemoryBatchScan{code}
> In this predicate order, all data needs to participate in the evaluation, 
> even if some data does not meet the later filtering criteria and it may 
> causes spark tasks to execute slowly.
>  
> So i think that filtering predicates that need to be evaluated should 
> automatically be placed to the far right to avoid data that does not meet the 
> criteria being evaluated.
>  
> As shown below:
> {noformat}
> == Physical Plan == *(1) Project [id#22L, data#23]
>  +- *(1) Filter isnotnull(data#23) AND isnotnull(id#22L)) AND 
> (md5(cast(data#23 as binary)) = 8cde774d6f7333752ed72cacddb05126)) AND 
> (trim(data#23, None) = a)) AND (id#22L = 2))
>     +- BatchScan[id#22L, data#23] class 
> org.apache.spark.sql.connector.InMemoryTable$InMemoryBatchScan{noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-40045) The order of filtering predicates is not reasonable

2022-08-10 Thread caican (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

caican updated SPARK-40045:
---
Description: 
{code:java}
select id, data FROM testcat.ns1.ns2.table
where id =2
and md5(data) = '8cde774d6f7333752ed72cacddb05126'
and trim(data) = 'a' {code}
Based on the SQL, we currently get the filters in the following order:

 
{code:java}
== Physical Plan == *(1) Project [id#22L, data#23]
 +- *(1) Filter isnotnull(data#23) AND isnotnull(id#22L)) AND 
(md5(cast(data#23 as binary)) = 8cde774d6f7333752ed72cacddb05126)) AND 
(trim(data#23, None) = a)) AND (id#22L = 2))
    +- BatchScan[id#22L, data#23] class 
org.apache.spark.sql.connector.InMemoryTable$InMemoryBatchScan{code}
 

In this predicate order, all data needs to participate in the evaluation, even 
if some data does not meet the later filtering criteria and it may causes spark 
tasks to execute slowly.

 

So i think that filtering predicates that need to be evaluated should 
automatically be placed to the far right to avoid data that does not meet the 
criteria being evaluated.

 

As shown below:
{noformat}
 {noformat}

  was:
{code:java}
select id, data FROM testcat.ns1.ns2.table
where id =2
and md5(data) = '8cde774d6f7333752ed72cacddb05126'
and trim(data) = 'a' {code}
Based on the SQL, we currently get the filters in the following order:

 
{code:java}
// code placeholder{code}
 

In this predicate order, all data needs to participate in the evaluation, even 
if some data does not meet the later filtering criteria and it may causes spark 
tasks to execute slowly.

 

So i think that filtering predicates that need to be evaluated should 
automatically be placed to the far right to avoid data that does not meet the 
criteria being evaluated.

 

As shown below:
{noformat}
 {noformat}


> The order of filtering predicates is not reasonable
> ---
>
> Key: SPARK-40045
> URL: https://issues.apache.org/jira/browse/SPARK-40045
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.2, 3.2.0, 3.3.0
>Reporter: caican
>Priority: Major
>
> {code:java}
> select id, data FROM testcat.ns1.ns2.table
> where id =2
> and md5(data) = '8cde774d6f7333752ed72cacddb05126'
> and trim(data) = 'a' {code}
> Based on the SQL, we currently get the filters in the following order:
>  
> {code:java}
> == Physical Plan == *(1) Project [id#22L, data#23]
>  +- *(1) Filter isnotnull(data#23) AND isnotnull(id#22L)) AND 
> (md5(cast(data#23 as binary)) = 8cde774d6f7333752ed72cacddb05126)) AND 
> (trim(data#23, None) = a)) AND (id#22L = 2))
>     +- BatchScan[id#22L, data#23] class 
> org.apache.spark.sql.connector.InMemoryTable$InMemoryBatchScan{code}
>  
> In this predicate order, all data needs to participate in the evaluation, 
> even if some data does not meet the later filtering criteria and it may 
> causes spark tasks to execute slowly.
>  
> So i think that filtering predicates that need to be evaluated should 
> automatically be placed to the far right to avoid data that does not meet the 
> criteria being evaluated.
>  
> As shown below:
> {noformat}
>  {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-40045) The order of filtering predicates is not reasonable

2022-08-10 Thread caican (Jira)

caican created SPARK-40045:
--

 Summary: The order of filtering predicates is not reasonable
 Key: SPARK-40045
 URL: https://issues.apache.org/jira/browse/SPARK-40045
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.3.0, 3.2.0, 3.1.2
Reporter: caican


{code:java}
select id, data FROM testcat.ns1.ns2.table
where id =2
and md5(data) = '8cde774d6f7333752ed72cacddb05126'
and trim(data) = 'a' {code}
Based on the SQL, we currently get the filters in the following order:

 
{code:java}
// code placeholder{code}
 

In this predicate order, all data needs to participate in the evaluation, even 
if some data does not meet the later filtering criteria and it may causes spark 
tasks to execute slowly.

 

So i think that filtering predicates that need to be evaluated should 
automatically be placed to the far right to avoid data that does not meet the 
criteria being evaluated.

 

As shown below:
{noformat}
 {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-38559) display the number of empty partitions on spark ui

2022-03-15 Thread caican (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38559?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

caican updated SPARK-38559:
---
Description: 
When demoting join from broadcast-hash to smj, i think it is necessary to 
display the number of empty partitions on spark ui.

Otherwise, users might wonder why SMJ is used when joining a small table. 
Displaying the number of empty partitions is useful for users to understand 
changes to the execution plan.

Before updated the ui:

!image-2022-03-16-10-56-46-446.png!

After updated the ui, display the number of empty partitions:

!image-2022-03-16-11-07-39-182.png!

 

 

 

 

  was:
When demoting join from broadcast-hash to smj, i think it is necessary to 
display the number of empty partitions on spark ui.

Otherwise, users might wonder why SMJ is used when joining a small table. 
Displaying the number of empty partitions is useful for users to understand 
changes to the execution plan.

Before updated the ui:)

!image-2022-03-16-10-56-46-446.png!

After updated the ui, display the number of empty partitions:)

!image-2022-03-16-11-07-39-182.png!

 

 

 

 


> display the number of empty partitions on spark ui
> --
>
> Key: SPARK-38559
> URL: https://issues.apache.org/jira/browse/SPARK-38559
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Web UI
>Affects Versions: 3.1.2
>Reporter: caican
>Priority: Major
> Attachments: image-2022-03-16-10-56-46-446.png, 
> image-2022-03-16-11-07-39-182.png
>
>
> When demoting join from broadcast-hash to smj, i think it is necessary to 
> display the number of empty partitions on spark ui.
> Otherwise, users might wonder why SMJ is used when joining a small table. 
> Displaying the number of empty partitions is useful for users to understand 
> changes to the execution plan.
> Before updated the ui:
> !image-2022-03-16-10-56-46-446.png!
> After updated the ui, display the number of empty partitions:
> !image-2022-03-16-11-07-39-182.png!
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-38559) display the number of empty partitions on spark ui

2022-03-15 Thread caican (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38559?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

caican updated SPARK-38559:
---
Description: 
When demoting join from broadcast-hash to smj, i think it is necessary to 
display the number of empty partitions on spark ui.

Otherwise, users might wonder why SMJ is used when joining a small table. 
Displaying the number of empty partitions is useful for users to understand 
changes to the execution plan.

Before updated the ui:)

!image-2022-03-16-10-56-46-446.png!

After updated the ui, display the number of empty partitions:)

!image-2022-03-16-11-07-39-182.png!

 

 

 

 

  was:
When demoting join from broadcast-hash to smj, i think it is necessary to 
display the number of empty partitions on spark ui.

Otherwise, users might wonder why SMJ is used when joining a small table. 
Displaying the number of empty partitions is useful for users to understand 
changes to the execution plan.

Before modify the ui：

!image-2022-03-16-10-56-46-446.png!


> display the number of empty partitions on spark ui
> --
>
> Key: SPARK-38559
> URL: https://issues.apache.org/jira/browse/SPARK-38559
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Web UI
>Affects Versions: 3.1.2
>Reporter: caican
>Priority: Major
> Attachments: image-2022-03-16-10-56-46-446.png, 
> image-2022-03-16-11-07-39-182.png
>
>
> When demoting join from broadcast-hash to smj, i think it is necessary to 
> display the number of empty partitions on spark ui.
> Otherwise, users might wonder why SMJ is used when joining a small table. 
> Displaying the number of empty partitions is useful for users to understand 
> changes to the execution plan.
> Before updated the ui:)
> !image-2022-03-16-10-56-46-446.png!
> After updated the ui, display the number of empty partitions:)
> !image-2022-03-16-11-07-39-182.png!
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-38559) display the number of empty partitions on spark ui

2022-03-15 Thread caican (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38559?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

caican updated SPARK-38559:
---
Attachment: image-2022-03-16-11-07-39-182.png

> display the number of empty partitions on spark ui
> --
>
> Key: SPARK-38559
> URL: https://issues.apache.org/jira/browse/SPARK-38559
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Web UI
>Affects Versions: 3.1.2
>Reporter: caican
>Priority: Major
> Attachments: image-2022-03-16-10-56-46-446.png, 
> image-2022-03-16-11-07-39-182.png
>
>
> When demoting join from broadcast-hash to smj, i think it is necessary to 
> display the number of empty partitions on spark ui.
> Otherwise, users might wonder why SMJ is used when joining a small table. 
> Displaying the number of empty partitions is useful for users to understand 
> changes to the execution plan.
> Before modify the ui：
> !image-2022-03-16-10-56-46-446.png!



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-38559) display the number of empty partitions on spark ui

2022-03-15 Thread caican (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38559?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

caican updated SPARK-38559:
---
Description: 
When demoting join from broadcast-hash to smj, i think it is necessary to 
display the number of empty partitions on spark ui.

Otherwise, users might wonder why SMJ is used when joining a small table. 
Displaying the number of empty partitions is useful for users to understand 
changes to the execution plan.

Before modify the ui：

!image-2022-03-16-10-56-46-446.png!

  was:
When demoting join from broadcast-hash to smj, i think it is necessary to 
display number of empty partitions on spark ui.

Otherwise, users might wonder why SMJ is used when joining a small table. 
Displaying the number of empty partitions is useful for users to understand 
changes to the execution plan.

Before modify the ui：

!image-2022-03-16-10-56-46-446.png!


> display the number of empty partitions on spark ui
> --
>
> Key: SPARK-38559
> URL: https://issues.apache.org/jira/browse/SPARK-38559
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Web UI
>Affects Versions: 3.1.2
>Reporter: caican
>Priority: Major
> Attachments: image-2022-03-16-10-56-46-446.png
>
>
> When demoting join from broadcast-hash to smj, i think it is necessary to 
> display the number of empty partitions on spark ui.
> Otherwise, users might wonder why SMJ is used when joining a small table. 
> Displaying the number of empty partitions is useful for users to understand 
> changes to the execution plan.
> Before modify the ui：
> !image-2022-03-16-10-56-46-446.png!



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-38559) display the number of empty partitions on spark ui

2022-03-15 Thread caican (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38559?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

caican updated SPARK-38559:
---
Summary: display the number of empty partitions on spark ui  (was: display 
number of empty partitions on spark ui)

> display the number of empty partitions on spark ui
> --
>
> Key: SPARK-38559
> URL: https://issues.apache.org/jira/browse/SPARK-38559
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Web UI
>Affects Versions: 3.1.2
>Reporter: caican
>Priority: Major
> Attachments: image-2022-03-16-10-56-46-446.png
>
>
> When demoting join from broadcast-hash to smj, i think it is necessary to 
> display number of empty partitions on spark ui.
> Otherwise, users might wonder why SMJ is used when joining a small table. 
> Displaying the number of empty partitions is useful for users to understand 
> changes to the execution plan.
> Before modify the ui：
> !image-2022-03-16-10-56-46-446.png!



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-38559) display number of empty partitions on spark ui

2022-03-15 Thread caican (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38559?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

caican updated SPARK-38559:
---
Attachment: (was: ui.png)

> display number of empty partitions on spark ui
> --
>
> Key: SPARK-38559
> URL: https://issues.apache.org/jira/browse/SPARK-38559
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Web UI
>Affects Versions: 3.1.2
>Reporter: caican
>Priority: Major
> Attachments: image-2022-03-16-10-56-46-446.png
>
>
> When demoting join from broadcast-hash to smj, i think it is necessary to 
> display number of empty partitions on spark ui.
> Otherwise, users might wonder why SMJ is used when joining a small table. 
> Displaying the number of empty partitions is useful for users to understand 
> changes to the execution plan.
> Before modify the ui：
> !image-2022-03-16-10-56-46-446.png!



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-38559) display number of empty partitions on spark ui

2022-03-15 Thread caican (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38559?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

caican updated SPARK-38559:
---
Description: 
When demoting join from broadcast-hash to smj, i think it is necessary to 
display number of empty partitions on spark ui.

Otherwise, users might wonder why SMJ is used when joining a small table. 
Displaying the number of empty partitions is useful for users to understand 
changes to the execution plan.

Before modify the ui：

!image-2022-03-16-10-56-46-446.png!

  was:
When demoting join from broadcast-hash to smj, i think it is necessary to 
display number of empty partitions on spark ui.

Otherwise, users might wonder why SMJ is used when joining a small table. 
Displaying the number of empty partitions is useful for users to understand 
changes to the execution plan.


> display number of empty partitions on spark ui
> --
>
> Key: SPARK-38559
> URL: https://issues.apache.org/jira/browse/SPARK-38559
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Web UI
>Affects Versions: 3.1.2
>Reporter: caican
>Priority: Major
> Attachments: image-2022-03-16-10-56-46-446.png, ui.png
>
>
> When demoting join from broadcast-hash to smj, i think it is necessary to 
> display number of empty partitions on spark ui.
> Otherwise, users might wonder why SMJ is used when joining a small table. 
> Displaying the number of empty partitions is useful for users to understand 
> changes to the execution plan.
> Before modify the ui：
> !image-2022-03-16-10-56-46-446.png!



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-38559) display number of empty partitions on spark ui

2022-03-15 Thread caican (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38559?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

caican updated SPARK-38559:
---
Attachment: image-2022-03-16-10-56-46-446.png

> display number of empty partitions on spark ui
> --
>
> Key: SPARK-38559
> URL: https://issues.apache.org/jira/browse/SPARK-38559
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Web UI
>Affects Versions: 3.1.2
>Reporter: caican
>Priority: Major
> Attachments: image-2022-03-16-10-56-46-446.png, ui.png
>
>
> When demoting join from broadcast-hash to smj, i think it is necessary to 
> display number of empty partitions on spark ui.
> Otherwise, users might wonder why SMJ is used when joining a small table. 
> Displaying the number of empty partitions is useful for users to understand 
> changes to the execution plan.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-38559) display number of empty partitions on spark ui

2022-03-15 Thread caican (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38559?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

caican updated SPARK-38559:
---
Attachment: ui.png

> display number of empty partitions on spark ui
> --
>
> Key: SPARK-38559
> URL: https://issues.apache.org/jira/browse/SPARK-38559
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Web UI
>Affects Versions: 3.1.2
>Reporter: caican
>Priority: Major
> Attachments: image-2022-03-16-10-56-46-446.png, ui.png
>
>
> When demoting join from broadcast-hash to smj, i think it is necessary to 
> display number of empty partitions on spark ui.
> Otherwise, users might wonder why SMJ is used when joining a small table. 
> Displaying the number of empty partitions is useful for users to understand 
> changes to the execution plan.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-38559) display number of empty partitions on spark ui

2022-03-15 Thread caican (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38559?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

caican updated SPARK-38559:
---
Attachment: (was: 小米办公20220316-105510.png)

> display number of empty partitions on spark ui
> --
>
> Key: SPARK-38559
> URL: https://issues.apache.org/jira/browse/SPARK-38559
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Web UI
>Affects Versions: 3.1.2
>Reporter: caican
>Priority: Major
>
> When demoting join from broadcast-hash to smj, i think it is necessary to 
> display number of empty partitions on spark ui.
> Otherwise, users might wonder why SMJ is used when joining a small table. 
> Displaying the number of empty partitions is useful for users to understand 
> changes to the execution plan.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-38559) display number of empty partitions on spark ui

2022-03-15 Thread caican (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38559?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

caican updated SPARK-38559:
---
Attachment: 小米办公20220316-105510.png

> display number of empty partitions on spark ui
> --
>
> Key: SPARK-38559
> URL: https://issues.apache.org/jira/browse/SPARK-38559
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Web UI
>Affects Versions: 3.1.2
>Reporter: caican
>Priority: Major
>
> When demoting join from broadcast-hash to smj, i think it is necessary to 
> display number of empty partitions on spark ui.
> Otherwise, users might wonder why SMJ is used when joining a small table. 
> Displaying the number of empty partitions is useful for users to understand 
> changes to the execution plan.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-38559) display number of empty partitions on spark ui

2022-03-15 Thread caican (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38559?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

caican updated SPARK-38559:
---
Summary: display number of empty partitions on spark ui  (was: display 
number of empty partitions on spark ui when demoting join from broadcast-hash 
to smj)

> display number of empty partitions on spark ui
> --
>
> Key: SPARK-38559
> URL: https://issues.apache.org/jira/browse/SPARK-38559
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Web UI
>Affects Versions: 3.1.2
>Reporter: caican
>Priority: Major
>
> When demoting join from broadcast-hash to smj, i think it is necessary to 
> display number of empty partitions on spark ui.
> Otherwise, users might wonder why SMJ is used when joining a small table. 
> Displaying the number of empty partitions is useful for users to understand 
> changes to the execution plan.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-38559) display number of empty partitions on spark ui when demoting join from broadcast-hash to smj

2022-03-15 Thread caican (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38559?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

caican updated SPARK-38559:
---
Description: 
When demoting join from broadcast-hash to smj, i think it is necessary to 
display number of empty partitions on spark ui.

Otherwise, users might wonder why SMJ is used when joining a small table. 
Displaying the number of empty partitions is useful for users to understand 
changes to the execution plan.

  was:
When demoting join from broadcast-hash to smj, i think it is necessary to show 
number of empty partitions on spark ui.

Otherwise, users might wonder why SMJ is used when joining a small table. 
Displaying the number of empty partitions is useful for users to understand 
changes to the execution plan.


> display number of empty partitions on spark ui when demoting join from 
> broadcast-hash to smj
> 
>
> Key: SPARK-38559
> URL: https://issues.apache.org/jira/browse/SPARK-38559
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Web UI
>Affects Versions: 3.1.2
>Reporter: caican
>Priority: Major
>
> When demoting join from broadcast-hash to smj, i think it is necessary to 
> display number of empty partitions on spark ui.
> Otherwise, users might wonder why SMJ is used when joining a small table. 
> Displaying the number of empty partitions is useful for users to understand 
> changes to the execution plan.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-38559) show number of empty partitions on spark ui when demoting join from broadcast-hash to smj

2022-03-15 Thread caican (Jira)

caican created SPARK-38559:
--

 Summary: show number of empty partitions on spark ui when demoting 
join from broadcast-hash to smj
 Key: SPARK-38559
 URL: https://issues.apache.org/jira/browse/SPARK-38559
 Project: Spark
  Issue Type: Improvement
  Components: SQL, Web UI
Affects Versions: 3.1.2
Reporter: caican


When demoting join from broadcast-hash to smj, i think it is necessary to show 
number of empty partitions on spark ui.

Otherwise, users might wonder why SMJ is used when joining a small table. 
Displaying the number of empty partitions is useful for users to understand 
changes to the execution plan.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-38559) display number of empty partitions on spark ui when demoting join from broadcast-hash to smj

2022-03-15 Thread caican (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38559?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

caican updated SPARK-38559:
---
Summary: display number of empty partitions on spark ui when demoting join 
from broadcast-hash to smj  (was: show number of empty partitions on spark ui 
when demoting join from broadcast-hash to smj)

> display number of empty partitions on spark ui when demoting join from 
> broadcast-hash to smj
> 
>
> Key: SPARK-38559
> URL: https://issues.apache.org/jira/browse/SPARK-38559
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Web UI
>Affects Versions: 3.1.2
>Reporter: caican
>Priority: Major
>
> When demoting join from broadcast-hash to smj, i think it is necessary to 
> show number of empty partitions on spark ui.
> Otherwise, users might wonder why SMJ is used when joining a small table. 
> Displaying the number of empty partitions is useful for users to understand 
> changes to the execution plan.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-38444) Automatically calculate the upper and lower bounds of partitions when no specified partition related params

2022-03-08 Thread caican (Jira)

caican created SPARK-38444:
--

 Summary: Automatically calculate the upper and lower bounds of 
partitions when no specified partition related params
 Key: SPARK-38444
 URL: https://issues.apache.org/jira/browse/SPARK-38444
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.1.2
Reporter: caican


when access rdbms, such as mysql, if partitionColumn, lowerBound, upperBound, 
numPartitions are not specified, by default only one partition to scan database 
is working. 

It makes load data from database slow and makes it difficult for users to 
configure multiple parameters to improve parallelism.

 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-38431) Support to delete matched rows from jdbc tables

2022-03-06 Thread caican (Jira)

caican created SPARK-38431:
--

 Summary: Support to delete matched rows from jdbc tables
 Key: SPARK-38431
 URL: https://issues.apache.org/jira/browse/SPARK-38431
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.1.2
Reporter: caican


The Spark SQL cannot perform delete opration when it accesses the RDBMS. I 
think that It's not friendly.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-37382) `with as` clause got inconsistent results

2021-11-22 Thread caican (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17447280#comment-17447280
 ] 

caican commented on SPARK-37382:


[~victor-wong] Does the images display nomally now?

> `with as` clause got inconsistent results
> -
>
> Key: SPARK-37382
> URL: https://issues.apache.org/jira/browse/SPARK-37382
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.2
>Reporter: caican
>Priority: Major
> Attachments: spark2.3.png, spark3.1.png
>
>
> In Spark3.1, the `with as` clause in the same SQL is executed multiple times, 
>  got different results
> `
> with tab as (
>  select 'Withas' as name, rand() as rand_number
> )
> select name, rand_number
> from tab
> union all
> select name, rand_number
> from tab
> `
> !spark3.1.png!
> But In spark2.3, it got consistent results
> `
> with tab as (
>  select 'Withas' as name, rand() as rand_number
> )
> select name, rand_number
> from tab
> union all
> select name, rand_number
> from tab
> `
> !spark2.3.png!
> Why does Spark3.1.2 return different results?
> Has anyone encountered this problem?



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-37382) `with as` clause got inconsistent results

2021-11-22 Thread caican (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

caican updated SPARK-37382:
---
Description: 
In Spark3.1, the `with as` clause in the same SQL is executed multiple times,  
got different results

`

with tab as (
 select 'Withas' as name, rand() as rand_number
)
select name, rand_number
from tab
union all
select name, rand_number
from tab

`

!spark3.1.png!

But In spark2.3, it got consistent results

`

with tab as (
 select 'Withas' as name, rand() as rand_number
)
select name, rand_number
from tab
union all
select name, rand_number
from tab

`

!spark2.3.png!

Why does Spark3.1.2 return different results?

Has anyone encountered this problem?

  was:
In Spark3.1, the `with as` clause in the same SQL is executed multiple times,  
got different results

`

with tab as (
 select 'Withas' as name, rand() as rand_number
)
select name, rand_number
from tab
union all
select name, rand_number
from tab

`

!https://internal-api-lark-file.f.mioffice.cn/api/image/keys/img_bcf6f867-6aee-4afe-bc43-30bf4f2dbdel?message_id=7032102765711097965!

But In spark2.3, it got consistent results

`

with tab as (
 select 'Withas' as name, rand() as rand_number
)
select name, rand_number
from tab
union all
select name, rand_number
from tab

`

!https://internal-api-lark-file.f.mioffice.cn/api/image/keys/img_6dc6e44b-d4a5-4b0d-bd2c-00859ec80a1l?message_id=7032104202756751468!

Why does Spark3.1.2 return different results?

Has anyone encountered this problem?


> `with as` clause got inconsistent results
> -
>
> Key: SPARK-37382
> URL: https://issues.apache.org/jira/browse/SPARK-37382
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.2
>Reporter: caican
>Priority: Major
> Attachments: spark2.3.png, spark3.1.png
>
>
> In Spark3.1, the `with as` clause in the same SQL is executed multiple times, 
>  got different results
> `
> with tab as (
>  select 'Withas' as name, rand() as rand_number
> )
> select name, rand_number
> from tab
> union all
> select name, rand_number
> from tab
> `
> !spark3.1.png!
> But In spark2.3, it got consistent results
> `
> with tab as (
>  select 'Withas' as name, rand() as rand_number
> )
> select name, rand_number
> from tab
> union all
> select name, rand_number
> from tab
> `
> !spark2.3.png!
> Why does Spark3.1.2 return different results?
> Has anyone encountered this problem?



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-37382) `with as` clause got inconsistent results

2021-11-22 Thread caican (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

caican updated SPARK-37382:
---
Attachment: spark2.3.png

> `with as` clause got inconsistent results
> -
>
> Key: SPARK-37382
> URL: https://issues.apache.org/jira/browse/SPARK-37382
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.2
>Reporter: caican
>Priority: Major
> Attachments: spark2.3.png, spark3.1.png
>
>
> In Spark3.1, the `with as` clause in the same SQL is executed multiple times, 
>  got different results
> `
> with tab as (
>  select 'Withas' as name, rand() as rand_number
> )
> select name, rand_number
> from tab
> union all
> select name, rand_number
> from tab
> `
> !https://internal-api-lark-file.f.mioffice.cn/api/image/keys/img_bcf6f867-6aee-4afe-bc43-30bf4f2dbdel?message_id=7032102765711097965!
> But In spark2.3, it got consistent results
> `
> with tab as (
>  select 'Withas' as name, rand() as rand_number
> )
> select name, rand_number
> from tab
> union all
> select name, rand_number
> from tab
> `
> !https://internal-api-lark-file.f.mioffice.cn/api/image/keys/img_6dc6e44b-d4a5-4b0d-bd2c-00859ec80a1l?message_id=7032104202756751468!
> Why does Spark3.1.2 return different results?
> Has anyone encountered this problem?



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-37382) `with as` clause got inconsistent results

2021-11-22 Thread caican (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

caican updated SPARK-37382:
---
Attachment: spark3.1.png

> `with as` clause got inconsistent results
> -
>
> Key: SPARK-37382
> URL: https://issues.apache.org/jira/browse/SPARK-37382
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.2
>Reporter: caican
>Priority: Major
> Attachments: spark3.1.png
>
>
> In Spark3.1, the `with as` clause in the same SQL is executed multiple times, 
>  got different results
> `
> with tab as (
>  select 'Withas' as name, rand() as rand_number
> )
> select name, rand_number
> from tab
> union all
> select name, rand_number
> from tab
> `
> !https://internal-api-lark-file.f.mioffice.cn/api/image/keys/img_bcf6f867-6aee-4afe-bc43-30bf4f2dbdel?message_id=7032102765711097965!
> But In spark2.3, it got consistent results
> `
> with tab as (
>  select 'Withas' as name, rand() as rand_number
> )
> select name, rand_number
> from tab
> union all
> select name, rand_number
> from tab
> `
> !https://internal-api-lark-file.f.mioffice.cn/api/image/keys/img_6dc6e44b-d4a5-4b0d-bd2c-00859ec80a1l?message_id=7032104202756751468!
> Why does Spark3.1.2 return different results?
> Has anyone encountered this problem?



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-37382) `with as` clause got inconsistent results

2021-11-22 Thread caican (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17447277#comment-17447277
 ] 

caican commented on SPARK-37382:


[~zhenw] Thank you for your reply, i will test it out.

> `with as` clause got inconsistent results
> -
>
> Key: SPARK-37382
> URL: https://issues.apache.org/jira/browse/SPARK-37382
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.2
>Reporter: caican
>Priority: Major
>
> In Spark3.1, the `with as` clause in the same SQL is executed multiple times, 
>  got different results
> `
> with tab as (
>  select 'Withas' as name, rand() as rand_number
> )
> select name, rand_number
> from tab
> union all
> select name, rand_number
> from tab
> `
> !https://internal-api-lark-file.f.mioffice.cn/api/image/keys/img_bcf6f867-6aee-4afe-bc43-30bf4f2dbdel?message_id=7032102765711097965!
> But In spark2.3, it got consistent results
> `
> with tab as (
>  select 'Withas' as name, rand() as rand_number
> )
> select name, rand_number
> from tab
> union all
> select name, rand_number
> from tab
> `
> !https://internal-api-lark-file.f.mioffice.cn/api/image/keys/img_6dc6e44b-d4a5-4b0d-bd2c-00859ec80a1l?message_id=7032104202756751468!
> Why does Spark3.1.2 return different results?
> Has anyone encountered this problem?



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-37383) Print the parsing time for each phase of a SQL

2021-11-18 Thread caican (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37383?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

caican updated SPARK-37383:
---
Affects Version/s: 2.4.0
   (was: 3.2.0)

> Print the parsing time for each phase of a SQL
> --
>
> Key: SPARK-37383
> URL: https://issues.apache.org/jira/browse/SPARK-37383
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: caican
>Priority: Major
>
> the time spent for each phase of a SQL is counted and recorded in 
> QueryPlanningTracker ,  But it doesn't show up anywhere.
> when SQL parsing is suspected to be slow, we cannot confirm which phase is 
> slow，therefore, it is necessary to print out the SQL parsing time.
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-37383) Print the parsing time for each phase of a SQL

2021-11-18 Thread caican (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37383?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

caican updated SPARK-37383:
---
Summary: Print the parsing time for each phase of a SQL  (was: Prints the 
parsing time for each phase of a SQL)

> Print the parsing time for each phase of a SQL
> --
>
> Key: SPARK-37383
> URL: https://issues.apache.org/jira/browse/SPARK-37383
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: caican
>Priority: Major
>
> the time spent for each phase of a SQL is counted and recorded in 
> QueryPlanningTracker ,  But it doesn't show up anywhere.
> when SQL parsing is suspected to be slow, we cannot confirm which phase is 
> slow，therefore, it is necessary to print out the SQL parsing time.
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-37383) Prints the parsing time for each phase of a SQL

2021-11-18 Thread caican (Jira)

caican created SPARK-37383:
--

 Summary: Prints the parsing time for each phase of a SQL
 Key: SPARK-37383
 URL: https://issues.apache.org/jira/browse/SPARK-37383
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.2.0
Reporter: caican


the time spent for each phase of a SQL is counted and recorded in 
QueryPlanningTracker ,  But it doesn't show up anywhere.

when SQL parsing is suspected to be slow, we cannot confirm which phase is 
slow，therefore, it is necessary to print out the SQL parsing time.

 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-37382) `with as` clause got inconsistent results

2021-11-18 Thread caican (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

caican updated SPARK-37382:
---
Description: 
In Spark3.1, the `with as` clause in the same SQL is executed multiple times,  
got different results

`

with tab as (
 select 'Withas' as name, rand() as rand_number
)
select name, rand_number
from tab
union all
select name, rand_number
from tab

`

!https://internal-api-lark-file.f.mioffice.cn/api/image/keys/img_bcf6f867-6aee-4afe-bc43-30bf4f2dbdel?message_id=7032102765711097965!

But In spark2.3, it got consistent results

`

with tab as (
 select 'Withas' as name, rand() as rand_number
)
select name, rand_number
from tab
union all
select name, rand_number
from tab

`

!https://internal-api-lark-file.f.mioffice.cn/api/image/keys/img_6dc6e44b-d4a5-4b0d-bd2c-00859ec80a1l?message_id=7032104202756751468!

Why does Spark3.1.2 return different results?

Has anyone encountered this problem?

  was:
In Spark3.1, the `with as` clause in the same SQL is executed multiple times, 
with different results

`

with tab as (
 select 'Withas' as name, rand() as rand_number
)
select name, rand_number
from tab
union all
select name, rand_number
from tab

`

!https://internal-api-lark-file.f.mioffice.cn/api/image/keys/img_bcf6f867-6aee-4afe-bc43-30bf4f2dbdel?message_id=7032102765711097965!

But In spark2.3, it got consistent results

`

with tab as (
 select 'Withas' as name, rand() as rand_number
)
select name, rand_number
from tab
union all
select name, rand_number
from tab

`

!https://internal-api-lark-file.f.mioffice.cn/api/image/keys/img_6dc6e44b-d4a5-4b0d-bd2c-00859ec80a1l?message_id=7032104202756751468!

Has anyone encountered this problem?


> `with as` clause got inconsistent results
> -
>
> Key: SPARK-37382
> URL: https://issues.apache.org/jira/browse/SPARK-37382
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.2
>Reporter: caican
>Priority: Major
>
> In Spark3.1, the `with as` clause in the same SQL is executed multiple times, 
>  got different results
> `
> with tab as (
>  select 'Withas' as name, rand() as rand_number
> )
> select name, rand_number
> from tab
> union all
> select name, rand_number
> from tab
> `
> !https://internal-api-lark-file.f.mioffice.cn/api/image/keys/img_bcf6f867-6aee-4afe-bc43-30bf4f2dbdel?message_id=7032102765711097965!
> But In spark2.3, it got consistent results
> `
> with tab as (
>  select 'Withas' as name, rand() as rand_number
> )
> select name, rand_number
> from tab
> union all
> select name, rand_number
> from tab
> `
> !https://internal-api-lark-file.f.mioffice.cn/api/image/keys/img_6dc6e44b-d4a5-4b0d-bd2c-00859ec80a1l?message_id=7032104202756751468!
> Why does Spark3.1.2 return different results?
> Has anyone encountered this problem?



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-37382) `with as` clause got inconsistent results

2021-11-18 Thread caican (Jira)

caican created SPARK-37382:
--

 Summary: `with as` clause got inconsistent results
 Key: SPARK-37382
 URL: https://issues.apache.org/jira/browse/SPARK-37382
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.1.2
Reporter: caican


In Spark3.1, the `with as` clause in the same SQL is executed multiple times, 
with different results

`

with tab as (
 select 'Withas' as name, rand() as rand_number
)
select name, rand_number
from tab
union all
select name, rand_number
from tab

`

!https://internal-api-lark-file.f.mioffice.cn/api/image/keys/img_bcf6f867-6aee-4afe-bc43-30bf4f2dbdel?message_id=7032102765711097965!

But In spark2.3, it got consistent results

`

with tab as (
 select 'Withas' as name, rand() as rand_number
)
select name, rand_number
from tab
union all
select name, rand_number
from tab

`

!https://internal-api-lark-file.f.mioffice.cn/api/image/keys/img_6dc6e44b-d4a5-4b0d-bd2c-00859ec80a1l?message_id=7032104202756751468!

Has anyone encountered this problem?



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

1 2 >

100 matches

Mail list logo