[jira] [Created] (SPARK-44426) optimize adaptive skew join for ExistenceJoin
caican created SPARK-44426: -- Summary: optimize adaptive skew join for ExistenceJoin Key: SPARK-44426 URL: https://issues.apache.org/jira/browse/SPARK-44426 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.4.0, 3.3.0, 3.2.0, 3.1.2 Reporter: caican For this query, InSubQuery would be cast to `ExistenceJoin` and now `ExistenceJoin` does not support automatic data skew for the left table. {code:java} SELECT * FROM skewData1 where (key1 in (select key2 from skewData2) or value1 in (select value2 from skewData2){code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44426) optimize adaptive skew join for ExistenceJoin
[ https://issues.apache.org/jira/browse/SPARK-44426?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] caican updated SPARK-44426: --- Description: For this query, InSubQuery would be cast to ExistenceJoin and now ExistenceJoin does not support automatic data skew for the left table. {code:java} SELECT * FROM skewData1 where (key1 in (select key2 from skewData2) or value1 in (select value2 from skewData2){code} was: For this query, InSubQuery would be cast to `ExistenceJoin` and now `ExistenceJoin` does not support automatic data skew for the left table. {code:java} SELECT * FROM skewData1 where (key1 in (select key2 from skewData2) or value1 in (select value2 from skewData2){code} > optimize adaptive skew join for ExistenceJoin > - > > Key: SPARK-44426 > URL: https://issues.apache.org/jira/browse/SPARK-44426 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.2, 3.2.0, 3.3.0, 3.4.0 >Reporter: caican >Priority: Major > > For this query, InSubQuery would be cast to ExistenceJoin and now > ExistenceJoin does not support automatic data skew for the left table. > {code:java} > SELECT * FROM skewData1 > where > (key1 in (select key2 from skewData2) > or value1 in (select value2 from skewData2){code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44419) Support to extract partial filters of datasource v2 table and push them down
[ https://issues.apache.org/jira/browse/SPARK-44419?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] caican updated SPARK-44419: --- Description: Run the following sql, and the date predicate in the where clause is not pushed down and it would cause a full table scan. {code:java} SELECT id, data, date FROM testcat.db.table where (date = 20221110 and udfStrLen(data) = 8) or (date = 2022 and udfStrLen(data) = 8) {code} was: Run the following sql, and the date predicate in the where clause is not pushed down and it would cause a full table scan. {code:java} SELECT id, data, date FROM testcat.db.table where (date = 20221110 and udfStrLen(data) = 8) or (date = 2022 and udfStrLen(data) = 8) {code} > Support to extract partial filters of datasource v2 table and push them down > > > Key: SPARK-44419 > URL: https://issues.apache.org/jira/browse/SPARK-44419 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.2, 3.2.0, 3.3.0, 3.4.0 >Reporter: caican >Priority: Major > > > Run the following sql, and the date predicate in the where clause is not > pushed down and it would cause a full table scan. > > {code:java} > SELECT > id, > data, > date > FROM > testcat.db.table > where > (date = 20221110 and udfStrLen(data) = 8) > or > (date = 2022 and udfStrLen(data) = 8) {code} > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44419) Support to extract partial filters of datasource v2 table and push them down
[ https://issues.apache.org/jira/browse/SPARK-44419?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] caican updated SPARK-44419: --- Description: Run the following sql, and the date predicate in the where clause is not pushed down and it would cause a full table scan. {code:java} SELECT id, data, date FROM testcat.db.table where (date = 20221110 and udfStrLen(data) = 8) or (date = 2022 and udfStrLen(data) = 8) {code} was: Run the following sql, and the date predicate in the where clause is not pushed down and it would cause a full table scan. {code:java} SELECT id, data, date FROM testcat.db.table where (date = 20221110 and udfStrLen(data) = 8) or (date = 2022 and udfStrLen(data) = 8) {code} > Support to extract partial filters of datasource v2 table and push them down > > > Key: SPARK-44419 > URL: https://issues.apache.org/jira/browse/SPARK-44419 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.2, 3.2.0, 3.3.0, 3.4.0 >Reporter: caican >Priority: Major > > > Run the following sql, and the date predicate in the where clause is not > pushed down and it would cause a full table scan. > > {code:java} > SELECT > id, > data, > date > FROM > testcat.db.table > where > (date = 20221110 and udfStrLen(data) = 8) > or > (date = 2022 and udfStrLen(data) = 8) {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-44419) Support to extract partial filters of datasource v2 table and push them down
caican created SPARK-44419: -- Summary: Support to extract partial filters of datasource v2 table and push them down Key: SPARK-44419 URL: https://issues.apache.org/jira/browse/SPARK-44419 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.4.0, 3.3.0, 3.2.0, 3.1.2 Reporter: caican Run the following sql, and the date predicate in the where clause is not pushed down and it would cause a full table scan. {code:java} SELECT id, data, date FROM testcat.db.table where (date = 20221110 and udfStrLen(data) = 8) or (date = 2022 and udfStrLen(data) = 8) {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-44414) Fixed matching check for CharType/VarcharType
caican created SPARK-44414: -- Summary: Fixed matching check for CharType/VarcharType Key: SPARK-44414 URL: https://issues.apache.org/jira/browse/SPARK-44414 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.4.0, 3.3.0, 3.2.0, 3.1.2 Reporter: caican Running the following code throws an exception {code:java} val analyzer = getAnalyzer // check varchar type val json1 = "{\"__CHAR_VARCHAR_TYPE_STRING\":\"varchar(80)\"}" val metadata1 = new MetadataBuilder().withMetadata(Metadata.fromJson(json1)).build() val query1 = TestRelation(StructType(Seq( StructField("x", StringType, metadata = metadata1), StructField("y", StringType, metadata = metadata1))).toAttributes) val table1 = TestRelation(StructType(Seq( StructField("x", StringType, metadata = metadata1), StructField("y", StringType, metadata = metadata1))).toAttributes) val parsedPlanByName1 = byName(table1, query1) analyzer.executeAndCheck(parsedPlanByName1, new QueryPlanningTracker()) {code} Exception details are as follows {code:java} org.apache.spark.sql.AnalysisException: unresolved operator 'AppendData TestRelation [x#8, y#9], true; 'AppendData TestRelation [x#8, y#9], true +- TestRelation [x#6, y#7] at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.failAnalysis(CheckAnalysis.scala:52) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.failAnalysis$(CheckAnalysis.scala:51) at org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:156) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis$47(CheckAnalysis.scala:704) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis$47$adapted(CheckAnalysis.scala:702) at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:186) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkAnalysis(CheckAnalysis.scala:702) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkAnalysis$(CheckAnalysis.scala:92) at org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:156) at org.apache.spark.sql.catalyst.analysis.Analyzer.$anonfun$executeAndCheck$1(Analyzer.scala:177) at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.markInAnalyzer(AnalysisHelper.scala:228) at org.apache.spark.sql.catalyst.analysis.Analyzer.executeAndCheck(Analyzer.scala:174) at org.apache.spark.sql.catalyst.analysis.DataSourceV2AnalysisBaseSuite.$anonfun$new$36(DataSourceV2AnalysisSuite.scala:691) {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43526) when shuffle hash join is enabled, q95 performance deteriorates
[ https://issues.apache.org/jira/browse/SPARK-43526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] caican updated SPARK-43526: --- Affects Version/s: 3.3.2 > when shuffle hash join is enabled, q95 performance deteriorates > --- > > Key: SPARK-43526 > URL: https://issues.apache.org/jira/browse/SPARK-43526 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.2, 3.2.0, 3.3.2 >Reporter: caican >Priority: Major > Attachments: image-2023-05-16-21-21-35-493.png, > image-2023-05-16-21-22-16-170.png, image-2023-05-16-21-23-35-237.png, > image-2023-05-16-21-24-09-182.png, image-2023-05-16-21-28-11-514.png, > image-2023-05-16-21-28-44-163.png, image-2023-05-17-16-53-42-302.png, > image-2023-05-17-16-54-59-053.png, image-2023-05-19-10-43-51-747.png, > shuffle1.png, sort1.png, sort2.png > > > Testing with 5TB dataset, the performance of q95 in tpcds deteriorates when > shuffle hash join is enabled and the performance is better when sortMergeJoin > is used. > > Performance difference: from 3.9min(sortMergeJoin) to > 8.1min(shuffledHashJoin) > > 1. enable shuffledHashJoin, the execution plan is as follows: > !image-2023-05-16-21-28-44-163.png|width=935,height=64! > !image-2023-05-16-21-21-35-493.png|width=924,height=502! > 2. disable shuffledHashJoin, the execution plan is as follows: > !image-2023-05-16-21-28-11-514.png|width=922,height=67! > !image-2023-05-16-21-22-16-170.png|width=934,height=477! > > And when shuffledHashJoin is enabled, gc is very serious, > !image-2023-05-16-21-23-35-237.png|width=929,height=570! > > but sortMergeJoin executes without this problem. > !image-2023-05-16-21-24-09-182.png|width=931,height=573! > > Any suggestions on how to solve it?Thanks! > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-43526) when shuffle hash join is enabled, q95 performance deteriorates
[ https://issues.apache.org/jira/browse/SPARK-43526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17724080#comment-17724080 ] caican edited comment on SPARK-43526 at 5/19/23 2:51 AM: - gently ping [~yumwang] I find that the shuffle hash join is slower than the sort merge join because the sort node is added after two shuffle hash joins, and the number of data bars of the two shuffle hash joins expands a lot. I overwrote q95, after closing shuffle hash join and adding sort operation after corresponding join nodes, q95 execution also became slow. 1. The execution plan before I rewrite q95 sql is as follows: *Sort merge join* !sort1.png|width=926,height=473! *shuffle hash join* !shuffle1.png|width=921,height=441! 2. The execution plan after I rewrite q95 sql is as follows: *sort merge join* !sort2.png|width=936,height=496! The sort operation was added after the corresponding join nodes, and the execution was slower than shuffle hash join. And it can be confirmed that the performance deteriorates after the shuffle hash join function is enabled because a large amount of data is sorted. !image-2023-05-19-10-43-51-747.png|width=708,height=38! *q95 sql with sort operation added* {code:java} set spark.sql.optimizer.excludedRules="org.apache.spark.sql.catalyst.optimizer.EliminateSorts"; set spark.sql.execution.removeRedundantSorts=false; WITH ws_wh AS ( SELECT ws1.ws_order_number, ws1.ws_warehouse_sk wh1, ws2.ws_warehouse_sk wh2 FROM web_sales ws1, web_sales ws2 WHERE ws1.ws_order_number=ws2.ws_order_number AND ws1.ws_warehouse_sk<>ws2.ws_warehouse_sk SORT BY ws1.ws_order_number ), tmp1 as ( SELECT ws_order_number FROM ws_wh ), tmp2 as ( SELECT wr_order_number FROM web_returns, ws_wh WHERE wr_order_number=ws_wh.ws_order_number SORT BY wr_order_number ) SELECT count(DISTINCT ws_order_number) AS `order count `, sum(ws_ext_ship_cost) AS `total shipping cost `, sum(ws_net_profit) AS `total net profit ` FROM web_sales ws1 left semi join tmp1 on ws1.ws_order_number=tmp1.ws_order_number left semi join tmp2 on ws1.ws_order_number=tmp2.wr_order_number join date_dim on ws1.ws_ship_date_sk=date_dim.d_date_sk join customer_address on ws1.ws_ship_addr_sk=customer_address.ca_address_sk join web_site on ws1.ws_web_site_sk=web_site.web_site_sk WHERE d_date BETWEEN '1999-02-01' AND (CAST('1999-02-01' AS DATE)+INTERVAL 60 DAY) AND ws1.ws_ship_date_sk=d_date_sk AND ws1.ws_ship_addr_sk=ca_address_sk AND ca_state='IL' AND ws1.ws_web_site_sk=web_site_sk AND web_company_name='pri' ORDER BY count(DISTINCT ws_order_number) LIMIT 100{code} was (Author: JIRAUSER280464): I find that the shuffle hash join is slower than the sort merge join because the sort node is added after two shuffle hash joins, and the number of data bars of the two shuffle hash joins expands a lot. I overwrote q95, after closing shuffle hash join and adding sort operation after corresponding join nodes, q95 execution also became slow. 1. The execution plan before I rewrite q95 sql is as follows: *Sort merge join* !sort1.png|width=926,height=473! *shuffle hash join* !shuffle1.png|width=921,height=441! 2. The execution plan after I rewrite q95 sql is as follows: *sort merge join* !sort2.png|width=936,height=496! The sort operation was added after the corresponding join nodes, and the execution was slower than shuffle hash join. And it can be confirmed that the performance deteriorates after the shuffle hash join function is enabled because a large amount of data is sorted. !image-2023-05-19-10-43-51-747.png|width=708,height=38! *q95 sql with sort operation added* {code:java} set spark.sql.optimizer.excludedRules="org.apache.spark.sql.catalyst.optimizer.EliminateSorts"; set spark.sql.execution.removeRedundantSorts=false; WITH ws_wh AS ( SELECT ws1.ws_order_number, ws1.ws_warehouse_sk wh1, ws2.ws_warehouse_sk wh2 FROM web_sales ws1, web_sales ws2 WHERE ws1.ws_order_number=ws2.ws_order_number AND ws1.ws_warehouse_sk<>ws2.ws_warehouse_sk SORT BY ws1.ws_order_number ), tmp1 as ( SELECT ws_order_number FROM ws_wh ), tmp2 as ( SELECT wr_order_number FROM web_returns, ws_wh WHERE wr_order_number=ws_wh.ws_order_number SORT BY wr_order_number ) SELECT count(DISTINCT w
[jira] [Comment Edited] (SPARK-43526) when shuffle hash join is enabled, q95 performance deteriorates
[ https://issues.apache.org/jira/browse/SPARK-43526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17724080#comment-17724080 ] caican edited comment on SPARK-43526 at 5/19/23 2:49 AM: - I find that the shuffle hash join is slower than the sort merge join because the sort node is added after two shuffle hash joins, and the number of data bars of the two shuffle hash joins expands a lot. I overwrote q95, after closing shuffle hash join and adding sort operation after corresponding join nodes, q95 execution also became slow. 1. The execution plan before I rewrite q95 sql is as follows: *Sort merge join* !sort1.png|width=926,height=473! *shuffle hash join* !shuffle1.png|width=921,height=441! 2. The execution plan after I rewrite q95 sql is as follows: *sort merge join* !sort2.png|width=936,height=496! The sort operation was added after the corresponding join nodes, and the execution was slower than shuffle hash join. And it can be confirmed that the performance deteriorates after the shuffle hash join function is enabled because a large amount of data is sorted. !image-2023-05-19-10-43-51-747.png|width=708,height=38! *q95 sql with sort operation added* {code:java} set spark.sql.optimizer.excludedRules="org.apache.spark.sql.catalyst.optimizer.EliminateSorts"; set spark.sql.execution.removeRedundantSorts=false; WITH ws_wh AS ( SELECT ws1.ws_order_number, ws1.ws_warehouse_sk wh1, ws2.ws_warehouse_sk wh2 FROM web_sales ws1, web_sales ws2 WHERE ws1.ws_order_number=ws2.ws_order_number AND ws1.ws_warehouse_sk<>ws2.ws_warehouse_sk SORT BY ws1.ws_order_number ), tmp1 as ( SELECT ws_order_number FROM ws_wh ), tmp2 as ( SELECT wr_order_number FROM web_returns, ws_wh WHERE wr_order_number=ws_wh.ws_order_number SORT BY wr_order_number ) SELECT count(DISTINCT ws_order_number) AS `order count `, sum(ws_ext_ship_cost) AS `total shipping cost `, sum(ws_net_profit) AS `total net profit ` FROM web_sales ws1 left semi join tmp1 on ws1.ws_order_number=tmp1.ws_order_number left semi join tmp2 on ws1.ws_order_number=tmp2.wr_order_number join date_dim on ws1.ws_ship_date_sk=date_dim.d_date_sk join customer_address on ws1.ws_ship_addr_sk=customer_address.ca_address_sk join web_site on ws1.ws_web_site_sk=web_site.web_site_sk WHERE d_date BETWEEN '1999-02-01' AND (CAST('1999-02-01' AS DATE)+INTERVAL 60 DAY) AND ws1.ws_ship_date_sk=d_date_sk AND ws1.ws_ship_addr_sk=ca_address_sk AND ca_state='IL' AND ws1.ws_web_site_sk=web_site_sk AND web_company_name='pri' ORDER BY count(DISTINCT ws_order_number) LIMIT 100{code} was (Author: JIRAUSER280464): I find that the shuffle hash join is slower than the sort merge join because the sort node is added after two shuffle hash joins, and the number of data bars of the two shuffle hash joins expands a lot. I overwrote q95, after closing shuffle hash join and adding sort operation after corresponding join nodes, q95 execution also became slow. 1. The execution plan before I rewrite q95 sql is as follows: *Sort merge join* !sort1.png|width=926,height=473! *shuffle hash join* !shuffle1.png|width=921,height=441! 2. The execution plan after I rewrite q95 sql is as follows: *sort merge join* !sort2.png|width=936,height=496! The sort operation was added after the corresponding join nodes, and the execution was slower than shuffle hash join. And it can be confirmed that the performance deteriorates after the shuffle hash join function is enabled because a large amount of data is sorted. !image-2023-05-19-10-43-51-747.png|width=708,height=38! *q95 sql with sort operation added* {code:java} set spark.sql.optimizer.excludedRules="org.apache.spark.sql.catalyst.optimizer.EliminateSorts"; set spark.sql.execution.removeRedundantSorts=false; WITH ws_wh AS ( SELECT ws1.ws_order_number, ws1.ws_warehouse_sk wh1, ws2.ws_warehouse_sk wh2 FROM web_sales ws1, web_sales ws2 WHERE ws1.ws_order_number=ws2.ws_order_number AND ws1.ws_warehouse_sk<>ws2.ws_warehouse_sk SORT BY ws1.ws_order_number ), tmp1 as ( SELECT ws_order_number FROM ws_wh ), tmp2 as ( SELECT wr_order_number FROM web_returns, ws_wh WHERE wr_order_number=ws_wh.ws_order_number SORT BY wr_order_number ) SELECT count(DISTINCT ws_order_number) AS `order
[jira] [Comment Edited] (SPARK-43526) when shuffle hash join is enabled, q95 performance deteriorates
[ https://issues.apache.org/jira/browse/SPARK-43526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17724080#comment-17724080 ] caican edited comment on SPARK-43526 at 5/19/23 2:48 AM: - I find that the shuffle hash join is slower than the sort merge join because the sort node is added after two shuffle hash joins, and the number of data bars of the two shuffle hash joins expands a lot. I overwrote q95, after closing shuffle hash join and adding sort operation after corresponding join nodes, q95 execution also became slow. 1. The execution plan before I rewrite q95 sql is as follows: *Sort merge join* !sort1.png|width=926,height=473! *shuffle hash join* !shuffle1.png|width=921,height=441! 2. The execution plan after I rewrite q95 sql is as follows: *sort merge join* !sort2.png|width=936,height=496! The sort operation was added after the corresponding join nodes, and the execution was slower than shuffle hash join. And it can be confirmed that the performance deteriorates after the shuffle hash join function is enabled because a large amount of data is sorted. !image-2023-05-19-10-43-51-747.png|width=708,height=38! *q95 sql with sort operation added* {code:java} set spark.sql.optimizer.excludedRules="org.apache.spark.sql.catalyst.optimizer.EliminateSorts"; set spark.sql.execution.removeRedundantSorts=false; WITH ws_wh AS ( SELECT ws1.ws_order_number, ws1.ws_warehouse_sk wh1, ws2.ws_warehouse_sk wh2 FROM web_sales ws1, web_sales ws2 WHERE ws1.ws_order_number=ws2.ws_order_number AND ws1.ws_warehouse_sk<>ws2.ws_warehouse_sk SORT BY ws1.ws_order_number ), tmp1 as ( SELECT ws_order_number FROM ws_wh ), tmp2 as ( SELECT wr_order_number FROM web_returns, ws_wh WHERE wr_order_number=ws_wh.ws_order_number SORT BY wr_order_number ) SELECT count(DISTINCT ws_order_number) AS `order count `, sum(ws_ext_ship_cost) AS `total shipping cost `, sum(ws_net_profit) AS `total net profit ` FROM web_sales ws1 left semi join tmp1 on ws1.ws_order_number=tmp1.ws_order_number left semi join tmp2 on ws1.ws_order_number=tmp2.wr_order_number join date_dim on ws1.ws_ship_date_sk=date_dim.d_date_sk join customer_address on ws1.ws_ship_addr_sk=customer_address.ca_address_sk join web_site on ws1.ws_web_site_sk=web_site.web_site_sk WHERE d_date BETWEEN '1999-02-01' AND (CAST('1999-02-01' AS DATE)+INTERVAL 60 DAY) AND ws1.ws_ship_date_sk=d_date_sk AND ws1.ws_ship_addr_sk=ca_address_sk AND ca_state='IL' AND ws1.ws_web_site_sk=web_site_sk AND web_company_name='pri' ORDER BY count(DISTINCT ws_order_number) LIMIT 100{code} was (Author: JIRAUSER280464): I find that the shuffle hash join is slower than the sort merge join because the sort node is added after two shuffle hash joins, and the number of data bars of the two shuffle hash joins expands a lot. I overwrote q95, after closing shuffle hash join and adding sort operation after corresponding join nodes, q95 execution also became slow. 1. The execution plan before I rewrite q95 sql is as follows: ** *Sort merge join* !sort1.png|width=926,height=473! *shuffle hash join* !shuffle1.png|width=921,height=441! 2. The execution plan after I rewrite q95 sql is as follows: *sort merge join* !sort2.png|width=936,height=496! The sort operation was added after the corresponding join nodes, and the execution was slower than shuffle hash join. And it can be confirmed that the performance deteriorates after the shuffle hash join function is enabled because a large amount of data is sorted. !image-2023-05-19-10-43-51-747.png|width=932,height=50! *q95 sql with sort operation added* {code:java} set spark.sql.optimizer.excludedRules="org.apache.spark.sql.catalyst.optimizer.EliminateSorts"; set spark.sql.execution.removeRedundantSorts=false; WITH ws_wh AS ( SELECT ws1.ws_order_number, ws1.ws_warehouse_sk wh1, ws2.ws_warehouse_sk wh2 FROM web_sales ws1, web_sales ws2 WHERE ws1.ws_order_number=ws2.ws_order_number AND ws1.ws_warehouse_sk<>ws2.ws_warehouse_sk SORT BY ws1.ws_order_number ), tmp1 as ( SELECT ws_order_number FROM ws_wh ), tmp2 as ( SELECT wr_order_number FROM web_returns, ws_wh WHERE wr_order_number=ws_wh.ws_order_number SORT BY wr_order_number ) SELECT count(DISTINCT ws_order_number) AS `o
[jira] [Comment Edited] (SPARK-43526) when shuffle hash join is enabled, q95 performance deteriorates
[ https://issues.apache.org/jira/browse/SPARK-43526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17724080#comment-17724080 ] caican edited comment on SPARK-43526 at 5/19/23 2:47 AM: - I find that the shuffle hash join is slower than the sort merge join because the sort node is added after two shuffle hash joins, and the number of data bars of the two shuffle hash joins expands a lot. I overwrote q95, after closing shuffle hash join and adding sort operation after corresponding join nodes, q95 execution also became slow. 1. The execution plan before I rewrite q95 sql is as follows: ** *Sort merge join* !sort1.png|width=926,height=473! *shuffle hash join* !shuffle1.png|width=921,height=441! 2. The execution plan after I rewrite q95 sql is as follows: *sort merge join* !sort2.png|width=936,height=496! The sort operation was added after the corresponding join nodes, and the execution was slower than shuffle hash join. And it can be confirmed that the performance deteriorates after the shuffle hash join function is enabled because a large amount of data is sorted. !image-2023-05-19-10-43-51-747.png|width=932,height=50! *q95 sql with sort operation added* {code:java} set spark.sql.optimizer.excludedRules="org.apache.spark.sql.catalyst.optimizer.EliminateSorts"; set spark.sql.execution.removeRedundantSorts=false; WITH ws_wh AS ( SELECT ws1.ws_order_number, ws1.ws_warehouse_sk wh1, ws2.ws_warehouse_sk wh2 FROM web_sales ws1, web_sales ws2 WHERE ws1.ws_order_number=ws2.ws_order_number AND ws1.ws_warehouse_sk<>ws2.ws_warehouse_sk SORT BY ws1.ws_order_number ), tmp1 as ( SELECT ws_order_number FROM ws_wh ), tmp2 as ( SELECT wr_order_number FROM web_returns, ws_wh WHERE wr_order_number=ws_wh.ws_order_number SORT BY wr_order_number ) SELECT count(DISTINCT ws_order_number) AS `order count `, sum(ws_ext_ship_cost) AS `total shipping cost `, sum(ws_net_profit) AS `total net profit ` FROM web_sales ws1 left semi join tmp1 on ws1.ws_order_number=tmp1.ws_order_number left semi join tmp2 on ws1.ws_order_number=tmp2.wr_order_number join date_dim on ws1.ws_ship_date_sk=date_dim.d_date_sk join customer_address on ws1.ws_ship_addr_sk=customer_address.ca_address_sk join web_site on ws1.ws_web_site_sk=web_site.web_site_sk WHERE d_date BETWEEN '1999-02-01' AND (CAST('1999-02-01' AS DATE)+INTERVAL 60 DAY) AND ws1.ws_ship_date_sk=d_date_sk AND ws1.ws_ship_addr_sk=ca_address_sk AND ca_state='IL' AND ws1.ws_web_site_sk=web_site_sk AND web_company_name='pri' ORDER BY count(DISTINCT ws_order_number) LIMIT 100{code} was (Author: JIRAUSER280464): I find that the shuffle hash join is slower than the sort merge join because the sort node is added after two shuffle hash joins, and the number of data bars of the two shuffle hash joins expands a lot. I overwrote q95, after closing shuffle hash join and adding sort operation after corresponding join nodes, q95 execution also became slow. The execution plan before I rewrite q95 sql is as follows: ```Sort merge join``` !sort1.png|width=926,height=473! ```shuffle hash join``` !shuffle1.png|width=921,height=441! The execution plan after I rewrite q95 sql is as follows: !sort2.png|width=936,height=496! the sort operation was added after the corresponding join nodes, and the execution was slower than shuffle hash join. And it can be confirmed that the performance deteriorates after the shuffle hash join function is enabled because a large amount of data is sorted. !image-2023-05-19-10-43-51-747.png|width=932,height=50! q95 sql with sort operation added``` set spark.sql.optimizer.excludedRules="org.apache.spark.sql.catalyst.optimizer.EliminateSorts"; set spark.sql.execution.removeRedundantSorts=false; WITH ws_wh AS ( SELECT ws1.ws_order_number, ws1.ws_warehouse_sk wh1, ws2.ws_warehouse_sk wh2 FROM web_sales ws1, web_sales ws2 WHERE ws1.ws_order_number=ws2.ws_order_number AND ws1.ws_warehouse_sk<>ws2.ws_warehouse_sk SORT BY ws1.ws_order_number ), tmp1 as ( SELECT ws_order_number FROM ws_wh ), tmp2 as ( SELECT wr_order_number FROM web_returns, ws_wh WHERE wr_order_number=ws_wh.ws_order_number SORT BY wr_order_number ) SELECT count(DISTINCT ws_order_number) AS `order count `, sum(ws_ext_ship_cost) AS `total shipping cost `, sum(ws_net_profit) AS `total net profit ` FROM web_sales ws1 left semi join tmp1 on ws1.ws_order_number=tmp1.ws_order_number left semi join tmp2 on ws1.ws_order_number=tmp2.wr_order_number join date_dim on ws1.ws_ship_date_sk=date_dim.d_date_sk join customer_
[jira] [Commented] (SPARK-43526) when shuffle hash join is enabled, q95 performance deteriorates
[ https://issues.apache.org/jira/browse/SPARK-43526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17724080#comment-17724080 ] caican commented on SPARK-43526: I find that the shuffle hash join is slower than the sort merge join because the sort node is added after two shuffle hash joins, and the number of data bars of the two shuffle hash joins expands a lot. I overwrote q95, after closing shuffle hash join and adding sort operation after corresponding join nodes, q95 execution also became slow. The execution plan before I rewrite q95 sql is as follows: ```Sort merge join``` !sort1.png|width=926,height=473! ```shuffle hash join``` !shuffle1.png|width=921,height=441! The execution plan after I rewrite q95 sql is as follows: !sort2.png|width=936,height=496! the sort operation was added after the corresponding join nodes, and the execution was slower than shuffle hash join. And it can be confirmed that the performance deteriorates after the shuffle hash join function is enabled because a large amount of data is sorted. !image-2023-05-19-10-43-51-747.png|width=932,height=50! q95 sql with sort operation added``` set spark.sql.optimizer.excludedRules="org.apache.spark.sql.catalyst.optimizer.EliminateSorts"; set spark.sql.execution.removeRedundantSorts=false; WITH ws_wh AS ( SELECT ws1.ws_order_number, ws1.ws_warehouse_sk wh1, ws2.ws_warehouse_sk wh2 FROM web_sales ws1, web_sales ws2 WHERE ws1.ws_order_number=ws2.ws_order_number AND ws1.ws_warehouse_sk<>ws2.ws_warehouse_sk SORT BY ws1.ws_order_number ), tmp1 as ( SELECT ws_order_number FROM ws_wh ), tmp2 as ( SELECT wr_order_number FROM web_returns, ws_wh WHERE wr_order_number=ws_wh.ws_order_number SORT BY wr_order_number ) SELECT count(DISTINCT ws_order_number) AS `order count `, sum(ws_ext_ship_cost) AS `total shipping cost `, sum(ws_net_profit) AS `total net profit ` FROM web_sales ws1 left semi join tmp1 on ws1.ws_order_number=tmp1.ws_order_number left semi join tmp2 on ws1.ws_order_number=tmp2.wr_order_number join date_dim on ws1.ws_ship_date_sk=date_dim.d_date_sk join customer_address on ws1.ws_ship_addr_sk=customer_address.ca_address_sk join web_site on ws1.ws_web_site_sk=web_site.web_site_sk WHERE d_date BETWEEN '1999-02-01' AND (CAST('1999-02-01' AS DATE)+INTERVAL 60 DAY) AND ws1.ws_ship_date_sk=d_date_sk AND ws1.ws_ship_addr_sk=ca_address_sk AND ca_state='IL' AND ws1.ws_web_site_sk=web_site_sk AND web_company_name='pri' ORDER BY count(DISTINCT ws_order_number) LIMIT 100 ``` > when shuffle hash join is enabled, q95 performance deteriorates > --- > > Key: SPARK-43526 > URL: https://issues.apache.org/jira/browse/SPARK-43526 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.2, 3.2.0 >Reporter: caican >Priority: Major > Attachments: image-2023-05-16-21-21-35-493.png, > image-2023-05-16-21-22-16-170.png, image-2023-05-16-21-23-35-237.png, > image-2023-05-16-21-24-09-182.png, image-2023-05-16-21-28-11-514.png, > image-2023-05-16-21-28-44-163.png, image-2023-05-17-16-53-42-302.png, > image-2023-05-17-16-54-59-053.png, image-2023-05-19-10-43-51-747.png, > shuffle1.png, sort1.png, sort2.png > > > Testing with 5TB dataset, the performance of q95 in tpcds deteriorates when > shuffle hash join is enabled and the performance is better when sortMergeJoin > is used. > > Performance difference: from 3.9min(sortMergeJoin) to > 8.1min(shuffledHashJoin) > > 1. enable shuffledHashJoin, the execution plan is as follows: > !image-2023-05-16-21-28-44-163.png|width=935,height=64! > !image-2023-05-16-21-21-35-493.png|width=924,height=502! > 2. disable shuffledHashJoin, the execution plan is as follows: > !image-2023-05-16-21-28-11-514.png|width=922,height=67! > !image-2023-05-16-21-22-16-170.png|width=934,height=477! > > And when shuffledHashJoin is enabled, gc is very serious, > !image-2023-05-16-21-23-35-237.png|width=929,height=570! > > but sortMergeJoin executes without this problem. > !image-2023-05-16-21-24-09-182.png|width=931,height=573! > > Any suggestions on how to solve it?Thanks! > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43526) when shuffle hash join is enabled, q95 performance deteriorates
[ https://issues.apache.org/jira/browse/SPARK-43526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] caican updated SPARK-43526: --- Attachment: sort2.png > when shuffle hash join is enabled, q95 performance deteriorates > --- > > Key: SPARK-43526 > URL: https://issues.apache.org/jira/browse/SPARK-43526 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.2, 3.2.0 >Reporter: caican >Priority: Major > Attachments: image-2023-05-16-21-21-35-493.png, > image-2023-05-16-21-22-16-170.png, image-2023-05-16-21-23-35-237.png, > image-2023-05-16-21-24-09-182.png, image-2023-05-16-21-28-11-514.png, > image-2023-05-16-21-28-44-163.png, image-2023-05-17-16-53-42-302.png, > image-2023-05-17-16-54-59-053.png, image-2023-05-19-10-43-51-747.png, > shuffle1.png, sort1.png, sort2.png > > > Testing with 5TB dataset, the performance of q95 in tpcds deteriorates when > shuffle hash join is enabled and the performance is better when sortMergeJoin > is used. > > Performance difference: from 3.9min(sortMergeJoin) to > 8.1min(shuffledHashJoin) > > 1. enable shuffledHashJoin, the execution plan is as follows: > !image-2023-05-16-21-28-44-163.png|width=935,height=64! > !image-2023-05-16-21-21-35-493.png|width=924,height=502! > 2. disable shuffledHashJoin, the execution plan is as follows: > !image-2023-05-16-21-28-11-514.png|width=922,height=67! > !image-2023-05-16-21-22-16-170.png|width=934,height=477! > > And when shuffledHashJoin is enabled, gc is very serious, > !image-2023-05-16-21-23-35-237.png|width=929,height=570! > > but sortMergeJoin executes without this problem. > !image-2023-05-16-21-24-09-182.png|width=931,height=573! > > Any suggestions on how to solve it?Thanks! > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43526) when shuffle hash join is enabled, q95 performance deteriorates
[ https://issues.apache.org/jira/browse/SPARK-43526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] caican updated SPARK-43526: --- Attachment: image-2023-05-19-10-43-51-747.png > when shuffle hash join is enabled, q95 performance deteriorates > --- > > Key: SPARK-43526 > URL: https://issues.apache.org/jira/browse/SPARK-43526 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.2, 3.2.0 >Reporter: caican >Priority: Major > Attachments: image-2023-05-16-21-21-35-493.png, > image-2023-05-16-21-22-16-170.png, image-2023-05-16-21-23-35-237.png, > image-2023-05-16-21-24-09-182.png, image-2023-05-16-21-28-11-514.png, > image-2023-05-16-21-28-44-163.png, image-2023-05-17-16-53-42-302.png, > image-2023-05-17-16-54-59-053.png, image-2023-05-19-10-43-51-747.png, > shuffle1.png, sort1.png, sort2.png > > > Testing with 5TB dataset, the performance of q95 in tpcds deteriorates when > shuffle hash join is enabled and the performance is better when sortMergeJoin > is used. > > Performance difference: from 3.9min(sortMergeJoin) to > 8.1min(shuffledHashJoin) > > 1. enable shuffledHashJoin, the execution plan is as follows: > !image-2023-05-16-21-28-44-163.png|width=935,height=64! > !image-2023-05-16-21-21-35-493.png|width=924,height=502! > 2. disable shuffledHashJoin, the execution plan is as follows: > !image-2023-05-16-21-28-11-514.png|width=922,height=67! > !image-2023-05-16-21-22-16-170.png|width=934,height=477! > > And when shuffledHashJoin is enabled, gc is very serious, > !image-2023-05-16-21-23-35-237.png|width=929,height=570! > > but sortMergeJoin executes without this problem. > !image-2023-05-16-21-24-09-182.png|width=931,height=573! > > Any suggestions on how to solve it?Thanks! > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43526) when shuffle hash join is enabled, q95 performance deteriorates
[ https://issues.apache.org/jira/browse/SPARK-43526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] caican updated SPARK-43526: --- Attachment: shuffle1.png > when shuffle hash join is enabled, q95 performance deteriorates > --- > > Key: SPARK-43526 > URL: https://issues.apache.org/jira/browse/SPARK-43526 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.2, 3.2.0 >Reporter: caican >Priority: Major > Attachments: image-2023-05-16-21-21-35-493.png, > image-2023-05-16-21-22-16-170.png, image-2023-05-16-21-23-35-237.png, > image-2023-05-16-21-24-09-182.png, image-2023-05-16-21-28-11-514.png, > image-2023-05-16-21-28-44-163.png, image-2023-05-17-16-53-42-302.png, > image-2023-05-17-16-54-59-053.png, shuffle1.png, sort1.png > > > Testing with 5TB dataset, the performance of q95 in tpcds deteriorates when > shuffle hash join is enabled and the performance is better when sortMergeJoin > is used. > > Performance difference: from 3.9min(sortMergeJoin) to > 8.1min(shuffledHashJoin) > > 1. enable shuffledHashJoin, the execution plan is as follows: > !image-2023-05-16-21-28-44-163.png|width=935,height=64! > !image-2023-05-16-21-21-35-493.png|width=924,height=502! > 2. disable shuffledHashJoin, the execution plan is as follows: > !image-2023-05-16-21-28-11-514.png|width=922,height=67! > !image-2023-05-16-21-22-16-170.png|width=934,height=477! > > And when shuffledHashJoin is enabled, gc is very serious, > !image-2023-05-16-21-23-35-237.png|width=929,height=570! > > but sortMergeJoin executes without this problem. > !image-2023-05-16-21-24-09-182.png|width=931,height=573! > > Any suggestions on how to solve it?Thanks! > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43526) when shuffle hash join is enabled, q95 performance deteriorates
[ https://issues.apache.org/jira/browse/SPARK-43526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] caican updated SPARK-43526: --- Attachment: sort1.png > when shuffle hash join is enabled, q95 performance deteriorates > --- > > Key: SPARK-43526 > URL: https://issues.apache.org/jira/browse/SPARK-43526 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.2, 3.2.0 >Reporter: caican >Priority: Major > Attachments: image-2023-05-16-21-21-35-493.png, > image-2023-05-16-21-22-16-170.png, image-2023-05-16-21-23-35-237.png, > image-2023-05-16-21-24-09-182.png, image-2023-05-16-21-28-11-514.png, > image-2023-05-16-21-28-44-163.png, image-2023-05-17-16-53-42-302.png, > image-2023-05-17-16-54-59-053.png, shuffle1.png, sort1.png > > > Testing with 5TB dataset, the performance of q95 in tpcds deteriorates when > shuffle hash join is enabled and the performance is better when sortMergeJoin > is used. > > Performance difference: from 3.9min(sortMergeJoin) to > 8.1min(shuffledHashJoin) > > 1. enable shuffledHashJoin, the execution plan is as follows: > !image-2023-05-16-21-28-44-163.png|width=935,height=64! > !image-2023-05-16-21-21-35-493.png|width=924,height=502! > 2. disable shuffledHashJoin, the execution plan is as follows: > !image-2023-05-16-21-28-11-514.png|width=922,height=67! > !image-2023-05-16-21-22-16-170.png|width=934,height=477! > > And when shuffledHashJoin is enabled, gc is very serious, > !image-2023-05-16-21-23-35-237.png|width=929,height=570! > > but sortMergeJoin executes without this problem. > !image-2023-05-16-21-24-09-182.png|width=931,height=573! > > Any suggestions on how to solve it?Thanks! > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-43526) when shuffle hash join is enabled, q95 performance deteriorates
[ https://issues.apache.org/jira/browse/SPARK-43526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17723388#comment-17723388 ] caican edited comment on SPARK-43526 at 5/17/23 9:03 AM: - [~yumwang] Tpcds tests show performance gains for most queries and we plan to use shuffledHashJoin preferentially to eliminate sort consumption when the small table meets a certain threshold, but q95 in tpcds has a serious performance regression and we are not sure if it can be turned on by default. with shuffledHashJoin: !image-2023-05-17-16-53-42-302.png|width=691,height=344! sortMergeJoin is preferred: !image-2023-05-17-16-54-59-053.png|width=722,height=319! was (Author: JIRAUSER280464): [~yumwang] Tpcds tests show performance gains for most queries and we plan to use shuffledHashJoin preferentially to eliminate sort consumption when the small table meets a certain threshold, but q95 in tpcds has a serious performance regression and we are not sure if it can be turned on by default. with shuffledHashJoin: !image-2023-05-17-16-53-42-302.png|width=691,height=344! without shuffledHashJoin: !image-2023-05-17-16-54-59-053.png|width=722,height=319! > when shuffle hash join is enabled, q95 performance deteriorates > --- > > Key: SPARK-43526 > URL: https://issues.apache.org/jira/browse/SPARK-43526 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.2, 3.2.0 >Reporter: caican >Priority: Major > Attachments: image-2023-05-16-21-21-35-493.png, > image-2023-05-16-21-22-16-170.png, image-2023-05-16-21-23-35-237.png, > image-2023-05-16-21-24-09-182.png, image-2023-05-16-21-28-11-514.png, > image-2023-05-16-21-28-44-163.png, image-2023-05-17-16-53-42-302.png, > image-2023-05-17-16-54-59-053.png > > > Testing with 5TB dataset, the performance of q95 in tpcds deteriorates when > shuffle hash join is enabled and the performance is better when sortMergeJoin > is used. > > Performance difference: from 3.9min(sortMergeJoin) to > 8.1min(shuffledHashJoin) > > 1. enable shuffledHashJoin, the execution plan is as follows: > !image-2023-05-16-21-28-44-163.png|width=935,height=64! > !image-2023-05-16-21-21-35-493.png|width=924,height=502! > 2. disable shuffledHashJoin, the execution plan is as follows: > !image-2023-05-16-21-28-11-514.png|width=922,height=67! > !image-2023-05-16-21-22-16-170.png|width=934,height=477! > > And when shuffledHashJoin is enabled, gc is very serious, > !image-2023-05-16-21-23-35-237.png|width=929,height=570! > > but sortMergeJoin executes without this problem. > !image-2023-05-16-21-24-09-182.png|width=931,height=573! > > Any suggestions on how to solve it?Thanks! > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-43526) when shuffle hash join is enabled, q95 performance deteriorates
[ https://issues.apache.org/jira/browse/SPARK-43526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17723388#comment-17723388 ] caican edited comment on SPARK-43526 at 5/17/23 9:02 AM: - [~yumwang] Tpcds tests show performance gains for most queries and we plan to use shuffledHashJoin preferentially to eliminate sort consumption when the small table meets a certain threshold, but q95 in tpcds has a serious performance regression and we are not sure if it can be turned on by default. with shuffledHashJoin: !image-2023-05-17-16-53-42-302.png|width=691,height=344! without shuffledHashJoin: !image-2023-05-17-16-54-59-053.png|width=722,height=319! was (Author: JIRAUSER280464): [~yumwang] We plan to use shuffledHashJoin preferentially to eliminate sort consumption when the small table meets a certain threshold, but q95 in tpcds has a serious performance regression and we are not sure if it can be turned on by default. with shuffledHashJoin: !image-2023-05-17-16-53-42-302.png|width=691,height=344! without shuffledHashJoin: !image-2023-05-17-16-54-59-053.png|width=722,height=319! > when shuffle hash join is enabled, q95 performance deteriorates > --- > > Key: SPARK-43526 > URL: https://issues.apache.org/jira/browse/SPARK-43526 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.2, 3.2.0 >Reporter: caican >Priority: Major > Attachments: image-2023-05-16-21-21-35-493.png, > image-2023-05-16-21-22-16-170.png, image-2023-05-16-21-23-35-237.png, > image-2023-05-16-21-24-09-182.png, image-2023-05-16-21-28-11-514.png, > image-2023-05-16-21-28-44-163.png, image-2023-05-17-16-53-42-302.png, > image-2023-05-17-16-54-59-053.png > > > Testing with 5TB dataset, the performance of q95 in tpcds deteriorates when > shuffle hash join is enabled and the performance is better when sortMergeJoin > is used. > > Performance difference: from 3.9min(sortMergeJoin) to > 8.1min(shuffledHashJoin) > > 1. enable shuffledHashJoin, the execution plan is as follows: > !image-2023-05-16-21-28-44-163.png|width=935,height=64! > !image-2023-05-16-21-21-35-493.png|width=924,height=502! > 2. disable shuffledHashJoin, the execution plan is as follows: > !image-2023-05-16-21-28-11-514.png|width=922,height=67! > !image-2023-05-16-21-22-16-170.png|width=934,height=477! > > And when shuffledHashJoin is enabled, gc is very serious, > !image-2023-05-16-21-23-35-237.png|width=929,height=570! > > but sortMergeJoin executes without this problem. > !image-2023-05-16-21-24-09-182.png|width=931,height=573! > > Any suggestions on how to solve it?Thanks! > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-43526) when shuffle hash join is enabled, q95 performance deteriorates
[ https://issues.apache.org/jira/browse/SPARK-43526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17723388#comment-17723388 ] caican commented on SPARK-43526: [~yumwang] We plan to use shuffledHashJoin preferentially to eliminate sort consumption when the small table meets a certain threshold, but q95 in tpcds has a serious performance regression and we are not sure if it can be turned on by default. with shuffledHashJoin: !image-2023-05-17-16-53-42-302.png|width=691,height=344! without shuffledHashJoin: !image-2023-05-17-16-54-59-053.png|width=722,height=319! > when shuffle hash join is enabled, q95 performance deteriorates > --- > > Key: SPARK-43526 > URL: https://issues.apache.org/jira/browse/SPARK-43526 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.2, 3.2.0 >Reporter: caican >Priority: Major > Attachments: image-2023-05-16-21-21-35-493.png, > image-2023-05-16-21-22-16-170.png, image-2023-05-16-21-23-35-237.png, > image-2023-05-16-21-24-09-182.png, image-2023-05-16-21-28-11-514.png, > image-2023-05-16-21-28-44-163.png, image-2023-05-17-16-53-42-302.png, > image-2023-05-17-16-54-59-053.png > > > Testing with 5TB dataset, the performance of q95 in tpcds deteriorates when > shuffle hash join is enabled and the performance is better when sortMergeJoin > is used. > > Performance difference: from 3.9min(sortMergeJoin) to > 8.1min(shuffledHashJoin) > > 1. enable shuffledHashJoin, the execution plan is as follows: > !image-2023-05-16-21-28-44-163.png|width=935,height=64! > !image-2023-05-16-21-21-35-493.png|width=924,height=502! > 2. disable shuffledHashJoin, the execution plan is as follows: > !image-2023-05-16-21-28-11-514.png|width=922,height=67! > !image-2023-05-16-21-22-16-170.png|width=934,height=477! > > And when shuffledHashJoin is enabled, gc is very serious, > !image-2023-05-16-21-23-35-237.png|width=929,height=570! > > but sortMergeJoin executes without this problem. > !image-2023-05-16-21-24-09-182.png|width=931,height=573! > > Any suggestions on how to solve it?Thanks! > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43526) when shuffle hash join is enabled, q95 performance deteriorates
[ https://issues.apache.org/jira/browse/SPARK-43526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] caican updated SPARK-43526: --- Attachment: image-2023-05-17-16-54-59-053.png > when shuffle hash join is enabled, q95 performance deteriorates > --- > > Key: SPARK-43526 > URL: https://issues.apache.org/jira/browse/SPARK-43526 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.2, 3.2.0 >Reporter: caican >Priority: Major > Attachments: image-2023-05-16-21-21-35-493.png, > image-2023-05-16-21-22-16-170.png, image-2023-05-16-21-23-35-237.png, > image-2023-05-16-21-24-09-182.png, image-2023-05-16-21-28-11-514.png, > image-2023-05-16-21-28-44-163.png, image-2023-05-17-16-53-42-302.png, > image-2023-05-17-16-54-59-053.png > > > Testing with 5TB dataset, the performance of q95 in tpcds deteriorates when > shuffle hash join is enabled and the performance is better when sortMergeJoin > is used. > > Performance difference: from 3.9min(sortMergeJoin) to > 8.1min(shuffledHashJoin) > > 1. enable shuffledHashJoin, the execution plan is as follows: > !image-2023-05-16-21-28-44-163.png|width=935,height=64! > !image-2023-05-16-21-21-35-493.png|width=924,height=502! > 2. disable shuffledHashJoin, the execution plan is as follows: > !image-2023-05-16-21-28-11-514.png|width=922,height=67! > !image-2023-05-16-21-22-16-170.png|width=934,height=477! > > And when shuffledHashJoin is enabled, gc is very serious, > !image-2023-05-16-21-23-35-237.png|width=929,height=570! > > but sortMergeJoin executes without this problem. > !image-2023-05-16-21-24-09-182.png|width=931,height=573! > > Any suggestions on how to solve it?Thanks! > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43526) when shuffle hash join is enabled, q95 performance deteriorates
[ https://issues.apache.org/jira/browse/SPARK-43526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] caican updated SPARK-43526: --- Attachment: image-2023-05-17-16-53-42-302.png > when shuffle hash join is enabled, q95 performance deteriorates > --- > > Key: SPARK-43526 > URL: https://issues.apache.org/jira/browse/SPARK-43526 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.2, 3.2.0 >Reporter: caican >Priority: Major > Attachments: image-2023-05-16-21-21-35-493.png, > image-2023-05-16-21-22-16-170.png, image-2023-05-16-21-23-35-237.png, > image-2023-05-16-21-24-09-182.png, image-2023-05-16-21-28-11-514.png, > image-2023-05-16-21-28-44-163.png, image-2023-05-17-16-53-42-302.png > > > Testing with 5TB dataset, the performance of q95 in tpcds deteriorates when > shuffle hash join is enabled and the performance is better when sortMergeJoin > is used. > > Performance difference: from 3.9min(sortMergeJoin) to > 8.1min(shuffledHashJoin) > > 1. enable shuffledHashJoin, the execution plan is as follows: > !image-2023-05-16-21-28-44-163.png|width=935,height=64! > !image-2023-05-16-21-21-35-493.png|width=924,height=502! > 2. disable shuffledHashJoin, the execution plan is as follows: > !image-2023-05-16-21-28-11-514.png|width=922,height=67! > !image-2023-05-16-21-22-16-170.png|width=934,height=477! > > And when shuffledHashJoin is enabled, gc is very serious, > !image-2023-05-16-21-23-35-237.png|width=929,height=570! > > but sortMergeJoin executes without this problem. > !image-2023-05-16-21-24-09-182.png|width=931,height=573! > > Any suggestions on how to solve it?Thanks! > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43526) when shuffle hash join is enabled, q95 performance deteriorates
[ https://issues.apache.org/jira/browse/SPARK-43526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] caican updated SPARK-43526: --- Attachment: (was: image-2023-05-16-21-23-33-611.png) > when shuffle hash join is enabled, q95 performance deteriorates > --- > > Key: SPARK-43526 > URL: https://issues.apache.org/jira/browse/SPARK-43526 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.2, 3.2.0 >Reporter: caican >Priority: Major > Attachments: image-2023-05-16-21-21-35-493.png, > image-2023-05-16-21-22-16-170.png, image-2023-05-16-21-23-35-237.png, > image-2023-05-16-21-24-09-182.png, image-2023-05-16-21-28-11-514.png, > image-2023-05-16-21-28-44-163.png > > > Testing with 5TB dataset, the performance of q95 in tpcds deteriorates when > shuffle hash join is enabled and the performance is better when sortMergeJoin > is used. > > Performance difference: from 3.9min(sortMergeJoin) to > 8.1min(shuffledHashJoin) > > 1. enable shuffledHashJoin, the execution plan is as follows: > !image-2023-05-16-21-28-44-163.png|width=935,height=64! > !image-2023-05-16-21-21-35-493.png|width=924,height=502! > 2. disable shuffledHashJoin, the execution plan is as follows: > !image-2023-05-16-21-28-11-514.png|width=922,height=67! > !image-2023-05-16-21-22-16-170.png|width=934,height=477! > > And when shuffledHashJoin is enabled, gc is very serious, > !image-2023-05-16-21-23-35-237.png|width=929,height=570! > > but sortMergeJoin executes without this problem. > !image-2023-05-16-21-24-09-182.png|width=931,height=573! > > Any suggestions on how to solve it?Thanks! > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43526) when shuffle hash join is enabled, q95 performance deteriorates
[ https://issues.apache.org/jira/browse/SPARK-43526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] caican updated SPARK-43526: --- Attachment: (was: image-2023-05-16-21-22-44-532.png) > when shuffle hash join is enabled, q95 performance deteriorates > --- > > Key: SPARK-43526 > URL: https://issues.apache.org/jira/browse/SPARK-43526 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.2, 3.2.0 >Reporter: caican >Priority: Major > Attachments: image-2023-05-16-21-21-35-493.png, > image-2023-05-16-21-22-16-170.png, image-2023-05-16-21-23-35-237.png, > image-2023-05-16-21-24-09-182.png, image-2023-05-16-21-28-11-514.png, > image-2023-05-16-21-28-44-163.png > > > Testing with 5TB dataset, the performance of q95 in tpcds deteriorates when > shuffle hash join is enabled and the performance is better when sortMergeJoin > is used. > > Performance difference: from 3.9min(sortMergeJoin) to > 8.1min(shuffledHashJoin) > > 1. enable shuffledHashJoin, the execution plan is as follows: > !image-2023-05-16-21-28-44-163.png|width=935,height=64! > !image-2023-05-16-21-21-35-493.png|width=924,height=502! > 2. disable shuffledHashJoin, the execution plan is as follows: > !image-2023-05-16-21-28-11-514.png|width=922,height=67! > !image-2023-05-16-21-22-16-170.png|width=934,height=477! > > And when shuffledHashJoin is enabled, gc is very serious, > !image-2023-05-16-21-23-35-237.png|width=929,height=570! > > but sortMergeJoin executes without this problem. > !image-2023-05-16-21-24-09-182.png|width=931,height=573! > > Any suggestions on how to solve it?Thanks! > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43526) when shuffle hash join is enabled, q95 performance deteriorates
[ https://issues.apache.org/jira/browse/SPARK-43526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] caican updated SPARK-43526: --- Attachment: (was: image-2023-05-16-21-20-18-727.png) > when shuffle hash join is enabled, q95 performance deteriorates > --- > > Key: SPARK-43526 > URL: https://issues.apache.org/jira/browse/SPARK-43526 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.2, 3.2.0 >Reporter: caican >Priority: Major > Attachments: image-2023-05-16-21-21-35-493.png, > image-2023-05-16-21-22-16-170.png, image-2023-05-16-21-22-44-532.png, > image-2023-05-16-21-23-33-611.png, image-2023-05-16-21-23-35-237.png, > image-2023-05-16-21-24-09-182.png, image-2023-05-16-21-28-11-514.png, > image-2023-05-16-21-28-44-163.png > > > Testing with 5TB dataset, the performance of q95 in tpcds deteriorates when > shuffle hash join is enabled and the performance is better when sortMergeJoin > is used. > > Performance difference: from 3.9min(sortMergeJoin) to > 8.1min(shuffledHashJoin) > > 1. enable shuffledHashJoin, the execution plan is as follows: > !image-2023-05-16-21-28-44-163.png|width=935,height=64! > !image-2023-05-16-21-21-35-493.png|width=924,height=502! > 2. disable shuffledHashJoin, the execution plan is as follows: > !image-2023-05-16-21-28-11-514.png|width=922,height=67! > !image-2023-05-16-21-22-16-170.png|width=934,height=477! > > And when shuffledHashJoin is enabled, gc is very serious, > !image-2023-05-16-21-23-35-237.png|width=929,height=570! > > but sortMergeJoin executes without this problem. > !image-2023-05-16-21-24-09-182.png|width=931,height=573! > > Any suggestions on how to solve it?Thanks! > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43526) when shuffle hash join is enabled, q95 performance deteriorates
[ https://issues.apache.org/jira/browse/SPARK-43526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] caican updated SPARK-43526: --- Attachment: (was: application_1684208757063_0028_90.html) > when shuffle hash join is enabled, q95 performance deteriorates > --- > > Key: SPARK-43526 > URL: https://issues.apache.org/jira/browse/SPARK-43526 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.2, 3.2.0 >Reporter: caican >Priority: Major > Attachments: image-2023-05-16-21-21-35-493.png, > image-2023-05-16-21-22-16-170.png, image-2023-05-16-21-22-44-532.png, > image-2023-05-16-21-23-33-611.png, image-2023-05-16-21-23-35-237.png, > image-2023-05-16-21-24-09-182.png, image-2023-05-16-21-28-11-514.png, > image-2023-05-16-21-28-44-163.png > > > Testing with 5TB dataset, the performance of q95 in tpcds deteriorates when > shuffle hash join is enabled and the performance is better when sortMergeJoin > is used. > > Performance difference: from 3.9min(sortMergeJoin) to > 8.1min(shuffledHashJoin) > > 1. enable shuffledHashJoin, the execution plan is as follows: > !image-2023-05-16-21-28-44-163.png|width=935,height=64! > !image-2023-05-16-21-21-35-493.png|width=924,height=502! > 2. disable shuffledHashJoin, the execution plan is as follows: > !image-2023-05-16-21-28-11-514.png|width=922,height=67! > !image-2023-05-16-21-22-16-170.png|width=934,height=477! > > And when shuffledHashJoin is enabled, gc is very serious, > !image-2023-05-16-21-23-35-237.png|width=929,height=570! > > but sortMergeJoin executes without this problem. > !image-2023-05-16-21-24-09-182.png|width=931,height=573! > > Any suggestions on how to solve it?Thanks! > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43526) when shuffle hash join is enabled, q95 performance deteriorates
[ https://issues.apache.org/jira/browse/SPARK-43526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] caican updated SPARK-43526: --- Attachment: application_1684208757063_0028_90.html > when shuffle hash join is enabled, q95 performance deteriorates > --- > > Key: SPARK-43526 > URL: https://issues.apache.org/jira/browse/SPARK-43526 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.2, 3.2.0 >Reporter: caican >Priority: Major > Attachments: application_1684208757063_0028_90.html, > image-2023-05-16-21-20-18-727.png, image-2023-05-16-21-21-35-493.png, > image-2023-05-16-21-22-16-170.png, image-2023-05-16-21-22-44-532.png, > image-2023-05-16-21-23-33-611.png, image-2023-05-16-21-23-35-237.png, > image-2023-05-16-21-24-09-182.png, image-2023-05-16-21-28-11-514.png, > image-2023-05-16-21-28-44-163.png > > > Testing with 5TB dataset, the performance of q95 in tpcds deteriorates when > shuffle hash join is enabled and the performance is better when sortMergeJoin > is used. > > Performance difference: from 3.9min(sortMergeJoin) to > 8.1min(shuffledHashJoin) > > 1. enable shuffledHashJoin, the execution plan is as follows: > !image-2023-05-16-21-28-44-163.png|width=935,height=64! > !image-2023-05-16-21-21-35-493.png|width=924,height=502! > 2. disable shuffledHashJoin, the execution plan is as follows: > !image-2023-05-16-21-28-11-514.png|width=922,height=67! > !image-2023-05-16-21-22-16-170.png|width=934,height=477! > > And when shuffledHashJoin is enabled, gc is very serious, > !image-2023-05-16-21-23-35-237.png|width=929,height=570! > > but sortMergeJoin executes without this problem. > !image-2023-05-16-21-24-09-182.png|width=931,height=573! > > Any suggestions on how to solve it?Thanks! -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43526) when shuffle hash join is enabled, q95 performance deteriorates
[ https://issues.apache.org/jira/browse/SPARK-43526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] caican updated SPARK-43526: --- Description: Testing with 5TB dataset, the performance of q95 in tpcds deteriorates when shuffle hash join is enabled and the performance is better when sortMergeJoin is used. Performance difference: from 3.9min(sortMergeJoin) to 8.1min(shuffledHashJoin) 1. enable shuffledHashJoin, the execution plan is as follows: !image-2023-05-16-21-28-44-163.png|width=935,height=64! !image-2023-05-16-21-21-35-493.png|width=924,height=502! 2. disable shuffledHashJoin, the execution plan is as follows: !image-2023-05-16-21-28-11-514.png|width=922,height=67! !image-2023-05-16-21-22-16-170.png|width=934,height=477! And when shuffledHashJoin is enabled, gc is very serious, !image-2023-05-16-21-23-35-237.png|width=929,height=570! but sortMergeJoin executes without this problem. !image-2023-05-16-21-24-09-182.png|width=931,height=573! Any suggestions on how to solve it?Thanks! was: Testing with 5TB dataset, the performance of q95 in tpcds deteriorates when shuffle hash join is enabled and the performance is better when sortMergeJoin is used. Performance difference: from 3.9min(sortMergeJoin) to 8.1min(shuffledHashJoin) 1. enable shuffledHashJoin, the execution plan is as follows: !image-2023-05-16-21-28-44-163.png|width=935,height=64! !image-2023-05-16-21-21-35-493.png|width=924,height=502! 2. disable shuffledHashJoin, the execution plan is as follows: !image-2023-05-16-21-28-11-514.png|width=922,height=67! !image-2023-05-16-21-22-16-170.png|width=934,height=477! And when shuffledHashJoin is enabled, gc is very serious, !image-2023-05-16-21-23-35-237.png|width=929,height=570! but sortMergeJoin executes without this problem. !image-2023-05-16-21-24-09-182.png|width=931,height=573! Any suggestions on how to solve it?Thanks! > when shuffle hash join is enabled, q95 performance deteriorates > --- > > Key: SPARK-43526 > URL: https://issues.apache.org/jira/browse/SPARK-43526 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.2, 3.2.0 >Reporter: caican >Priority: Major > Attachments: application_1684208757063_0028_90.html, > image-2023-05-16-21-20-18-727.png, image-2023-05-16-21-21-35-493.png, > image-2023-05-16-21-22-16-170.png, image-2023-05-16-21-22-44-532.png, > image-2023-05-16-21-23-33-611.png, image-2023-05-16-21-23-35-237.png, > image-2023-05-16-21-24-09-182.png, image-2023-05-16-21-28-11-514.png, > image-2023-05-16-21-28-44-163.png > > > Testing with 5TB dataset, the performance of q95 in tpcds deteriorates when > shuffle hash join is enabled and the performance is better when sortMergeJoin > is used. > > Performance difference: from 3.9min(sortMergeJoin) to > 8.1min(shuffledHashJoin) > > 1. enable shuffledHashJoin, the execution plan is as follows: > !image-2023-05-16-21-28-44-163.png|width=935,height=64! > !image-2023-05-16-21-21-35-493.png|width=924,height=502! > 2. disable shuffledHashJoin, the execution plan is as follows: > !image-2023-05-16-21-28-11-514.png|width=922,height=67! > !image-2023-05-16-21-22-16-170.png|width=934,height=477! > > And when shuffledHashJoin is enabled, gc is very serious, > !image-2023-05-16-21-23-35-237.png|width=929,height=570! > > but sortMergeJoin executes without this problem. > !image-2023-05-16-21-24-09-182.png|width=931,height=573! > > Any suggestions on how to solve it?Thanks! > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43526) when shuffle hash join is enabled, q95 performance deteriorates
[ https://issues.apache.org/jira/browse/SPARK-43526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] caican updated SPARK-43526: --- Description: Testing with 5TB dataset, the performance of q95 in tpcds deteriorates when shuffle hash join is enabled and the performance is better when sortMergeJoin is used. Performance difference: from 3.9min(sortMergeJoin) to 8.1min(shuffledHashJoin) 1. enable shuffledHashJoin, the execution plan is as follows: !image-2023-05-16-21-28-44-163.png|width=935,height=64! !image-2023-05-16-21-21-35-493.png|width=924,height=502! 2. disable shuffledHashJoin, the execution plan is as follows: !image-2023-05-16-21-28-11-514.png|width=922,height=67! !image-2023-05-16-21-22-16-170.png|width=934,height=477! And when shuffledHashJoin is enabled, gc is very serious, !image-2023-05-16-21-23-35-237.png|width=929,height=570! but sortMergeJoin executes without this problem. !image-2023-05-16-21-24-09-182.png|width=931,height=573! Any suggestions on how to solve it?Thanks! was: Testing with 5TB dataset, the performance of q95 in tpcds deteriorates when shuffle hash join is enabled and the performance is better when sortMergeJoin is used. Performance difference: from 3.9min(sortMergeJoin) to 8.1min(shuffledHashJoin) 1. enable shuffledHashJoin, the execution plan is as follows: !image-2023-05-16-21-20-18-727.png|width=990,height=68! !image-2023-05-16-21-21-35-493.png|width=924,height=502! 2. disable shuffledHashJoin, the execution plan is as follows: !image-2023-05-16-21-22-44-532.png|width=1114,height=73! !image-2023-05-16-21-22-16-170.png|width=934,height=477! And when shuffledHashJoin is enabled, gc is very serious, !image-2023-05-16-21-23-35-237.png|width=929,height=570! but sortMergeJoin executes without this problem. !image-2023-05-16-21-24-09-182.png|width=931,height=573! Any suggestions on how to solve it?Thanks! > when shuffle hash join is enabled, q95 performance deteriorates > --- > > Key: SPARK-43526 > URL: https://issues.apache.org/jira/browse/SPARK-43526 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.2, 3.2.0 >Reporter: caican >Priority: Major > Attachments: image-2023-05-16-21-20-18-727.png, > image-2023-05-16-21-21-35-493.png, image-2023-05-16-21-22-16-170.png, > image-2023-05-16-21-22-44-532.png, image-2023-05-16-21-23-33-611.png, > image-2023-05-16-21-23-35-237.png, image-2023-05-16-21-24-09-182.png, > image-2023-05-16-21-28-11-514.png, image-2023-05-16-21-28-44-163.png > > > Testing with 5TB dataset, the performance of q95 in tpcds deteriorates when > shuffle hash join is enabled and the performance is better when sortMergeJoin > is used. > > Performance difference: from 3.9min(sortMergeJoin) to > 8.1min(shuffledHashJoin) > > 1. enable shuffledHashJoin, the execution plan is as follows: > !image-2023-05-16-21-28-44-163.png|width=935,height=64! > !image-2023-05-16-21-21-35-493.png|width=924,height=502! > 2. disable shuffledHashJoin, the execution plan is as follows: > !image-2023-05-16-21-28-11-514.png|width=922,height=67! > !image-2023-05-16-21-22-16-170.png|width=934,height=477! > > And when shuffledHashJoin is enabled, gc is very serious, > !image-2023-05-16-21-23-35-237.png|width=929,height=570! > > but sortMergeJoin executes without this problem. > !image-2023-05-16-21-24-09-182.png|width=931,height=573! > > Any suggestions on how to solve it?Thanks! -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43526) when shuffle hash join is enabled, q95 performance deteriorates
[ https://issues.apache.org/jira/browse/SPARK-43526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] caican updated SPARK-43526: --- Attachment: image-2023-05-16-21-28-11-514.png > when shuffle hash join is enabled, q95 performance deteriorates > --- > > Key: SPARK-43526 > URL: https://issues.apache.org/jira/browse/SPARK-43526 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.2, 3.2.0 >Reporter: caican >Priority: Major > Attachments: image-2023-05-16-21-20-18-727.png, > image-2023-05-16-21-21-35-493.png, image-2023-05-16-21-22-16-170.png, > image-2023-05-16-21-22-44-532.png, image-2023-05-16-21-23-33-611.png, > image-2023-05-16-21-23-35-237.png, image-2023-05-16-21-24-09-182.png, > image-2023-05-16-21-28-11-514.png, image-2023-05-16-21-28-44-163.png > > > Testing with 5TB dataset, the performance of q95 in tpcds deteriorates when > shuffle hash join is enabled and the performance is better when sortMergeJoin > is used. > > Performance difference: from 3.9min(sortMergeJoin) to > 8.1min(shuffledHashJoin) > > 1. enable shuffledHashJoin, the execution plan is as follows: > !image-2023-05-16-21-20-18-727.png|width=990,height=68! > !image-2023-05-16-21-21-35-493.png|width=924,height=502! > 2. disable shuffledHashJoin, the execution plan is as follows: > !image-2023-05-16-21-22-44-532.png|width=1114,height=73! > !image-2023-05-16-21-22-16-170.png|width=934,height=477! > > And when shuffledHashJoin is enabled, gc is very serious, > !image-2023-05-16-21-23-35-237.png|width=929,height=570! > > but sortMergeJoin executes without this problem. > !image-2023-05-16-21-24-09-182.png|width=931,height=573! > > Any suggestions on how to solve it?Thanks! -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43526) when shuffle hash join is enabled, q95 performance deteriorates
[ https://issues.apache.org/jira/browse/SPARK-43526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] caican updated SPARK-43526: --- Attachment: image-2023-05-16-21-28-44-163.png > when shuffle hash join is enabled, q95 performance deteriorates > --- > > Key: SPARK-43526 > URL: https://issues.apache.org/jira/browse/SPARK-43526 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.2, 3.2.0 >Reporter: caican >Priority: Major > Attachments: image-2023-05-16-21-20-18-727.png, > image-2023-05-16-21-21-35-493.png, image-2023-05-16-21-22-16-170.png, > image-2023-05-16-21-22-44-532.png, image-2023-05-16-21-23-33-611.png, > image-2023-05-16-21-23-35-237.png, image-2023-05-16-21-24-09-182.png, > image-2023-05-16-21-28-11-514.png, image-2023-05-16-21-28-44-163.png > > > Testing with 5TB dataset, the performance of q95 in tpcds deteriorates when > shuffle hash join is enabled and the performance is better when sortMergeJoin > is used. > > Performance difference: from 3.9min(sortMergeJoin) to > 8.1min(shuffledHashJoin) > > 1. enable shuffledHashJoin, the execution plan is as follows: > !image-2023-05-16-21-20-18-727.png|width=990,height=68! > !image-2023-05-16-21-21-35-493.png|width=924,height=502! > 2. disable shuffledHashJoin, the execution plan is as follows: > !image-2023-05-16-21-22-44-532.png|width=1114,height=73! > !image-2023-05-16-21-22-16-170.png|width=934,height=477! > > And when shuffledHashJoin is enabled, gc is very serious, > !image-2023-05-16-21-23-35-237.png|width=929,height=570! > > but sortMergeJoin executes without this problem. > !image-2023-05-16-21-24-09-182.png|width=931,height=573! > > Any suggestions on how to solve it?Thanks! -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43526) when shuffle hash join is enabled, q95 performance deteriorates
[ https://issues.apache.org/jira/browse/SPARK-43526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] caican updated SPARK-43526: --- Description: Testing with 5TB dataset, the performance of q95 in tpcds deteriorates when shuffle hash join is enabled and the performance is better when sortMergeJoin is used. Performance difference: from 3.9min(sortMergeJoin) to 8.1min(shuffledHashJoin) 1. enable shuffledHashJoin, the execution plan is as follows: !image-2023-05-16-21-20-18-727.png|width=990,height=68! !image-2023-05-16-21-21-35-493.png|width=924,height=502! 2. disable shuffledHashJoin, the execution plan is as follows: !image-2023-05-16-21-22-44-532.png|width=1114,height=73! !image-2023-05-16-21-22-16-170.png|width=934,height=477! And when shuffledHashJoin is enabled, gc is very serious, !image-2023-05-16-21-23-35-237.png|width=929,height=570! but sortMergeJoin executes without this problem. !image-2023-05-16-21-24-09-182.png|width=931,height=573! Any suggestions on how to solve it?Thanks! was: Testing with 5TB dataset, the performance of q95 in tpcds deteriorates when shuffle hash join is enabled and the performance is better when sortMergeJoin is used. Performance difference: from 3.9min(sortMergeJoin) to 8.1min(shuffledHashJoin) 1. enable shuffledHashJoin, the execution plan is as follows: !image-2023-05-16-21-20-18-727.png|width=990,height=68! !image-2023-05-16-21-21-35-493.png|width=924,height=502! 2. disable shuffledHashJoin, the execution plan is as follows: !image-2023-05-16-21-22-44-532.png|width=1114,height=73! !image-2023-05-16-21-22-16-170.png|width=934,height=477! And when shuffledHashJoin is enabled, gc is very serious, !image-2023-05-16-21-23-35-237.png|width=929,height=570! but sortMergeJoin executes without this problem. !image-2023-05-16-21-24-09-182.png|width=931,height=573! Any suggestions on how to solve it?Thanks! > when shuffle hash join is enabled, q95 performance deteriorates > --- > > Key: SPARK-43526 > URL: https://issues.apache.org/jira/browse/SPARK-43526 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.2, 3.2.0 >Reporter: caican >Priority: Major > Attachments: image-2023-05-16-21-20-18-727.png, > image-2023-05-16-21-21-35-493.png, image-2023-05-16-21-22-16-170.png, > image-2023-05-16-21-22-44-532.png, image-2023-05-16-21-23-33-611.png, > image-2023-05-16-21-23-35-237.png, image-2023-05-16-21-24-09-182.png > > > Testing with 5TB dataset, the performance of q95 in tpcds deteriorates when > shuffle hash join is enabled and the performance is better when sortMergeJoin > is used. > > Performance difference: from 3.9min(sortMergeJoin) to > 8.1min(shuffledHashJoin) > > 1. enable shuffledHashJoin, the execution plan is as follows: > !image-2023-05-16-21-20-18-727.png|width=990,height=68! > !image-2023-05-16-21-21-35-493.png|width=924,height=502! > 2. disable shuffledHashJoin, the execution plan is as follows: > !image-2023-05-16-21-22-44-532.png|width=1114,height=73! > !image-2023-05-16-21-22-16-170.png|width=934,height=477! > > And when shuffledHashJoin is enabled, gc is very serious, > !image-2023-05-16-21-23-35-237.png|width=929,height=570! > > but sortMergeJoin executes without this problem. > !image-2023-05-16-21-24-09-182.png|width=931,height=573! > > Any suggestions on how to solve it?Thanks! -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43526) when shuffle hash join is enabled, q95 performance deteriorates
[ https://issues.apache.org/jira/browse/SPARK-43526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] caican updated SPARK-43526: --- Description: Testing with 5TB dataset, the performance of q95 in tpcds deteriorates when shuffle hash join is enabled and the performance is better when sortMergeJoin is used. Performance difference: from 3.9min(sortMergeJoin) to 8.1min(shuffledHashJoin) 1. enable shuffledHashJoin, the execution plan is as follows: !image-2023-05-16-21-20-18-727.png|width=990,height=68! !image-2023-05-16-21-21-35-493.png|width=924,height=502! 2. disable shuffledHashJoin, the execution plan is as follows: !image-2023-05-16-21-22-44-532.png|width=1114,height=73! !image-2023-05-16-21-22-16-170.png|width=934,height=477! And when shuffledHashJoin is enabled, gc is very serious, !image-2023-05-16-21-23-35-237.png|width=929,height=570! but sortMergeJoin executes without this problem. !image-2023-05-16-21-24-09-182.png|width=931,height=573! Any suggestions on how to solve it?Thanks! was: Testing with 5TB dataset, the performance of q95 in tpcds deteriorates when shuffle hash join is enabled and the performance is better when sortMergeJoin is used. Performance difference: from 3.9min(sortMergeJoin) to 8.1min(shuffledHashJoin) 1. enable shuffledHashJoin, the execution plan is as follows: !image-2023-05-16-21-20-18-727.png|width=990,height=68! !image-2023-05-16-21-21-35-493.png|width=924,height=502! 2. disable shuffledHashJoin, the execution plan is as follows: !image-2023-05-16-21-22-44-532.png|width=1190,height=78! !image-2023-05-16-21-22-16-170.png|width=934,height=477! And when shuffledHashJoin is enabled, gc is very serious, !image-2023-05-16-21-23-35-237.png|width=929,height=570! but sortMergeJoin executes without this problem. !image-2023-05-16-21-24-09-182.png|width=931,height=573! Any suggestions on how to solve it?Thanks! > when shuffle hash join is enabled, q95 performance deteriorates > --- > > Key: SPARK-43526 > URL: https://issues.apache.org/jira/browse/SPARK-43526 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.2, 3.2.0 >Reporter: caican >Priority: Major > Attachments: image-2023-05-16-21-20-18-727.png, > image-2023-05-16-21-21-35-493.png, image-2023-05-16-21-22-16-170.png, > image-2023-05-16-21-22-44-532.png, image-2023-05-16-21-23-33-611.png, > image-2023-05-16-21-23-35-237.png, image-2023-05-16-21-24-09-182.png > > > Testing with 5TB dataset, the performance of q95 in tpcds deteriorates when > shuffle hash join is enabled and the performance is better when sortMergeJoin > is used. > > Performance difference: from 3.9min(sortMergeJoin) to > 8.1min(shuffledHashJoin) > > 1. enable shuffledHashJoin, the execution plan is as follows: > !image-2023-05-16-21-20-18-727.png|width=990,height=68! > !image-2023-05-16-21-21-35-493.png|width=924,height=502! > 2. disable shuffledHashJoin, the execution plan is as follows: > !image-2023-05-16-21-22-44-532.png|width=1114,height=73! > !image-2023-05-16-21-22-16-170.png|width=934,height=477! > > And when shuffledHashJoin is enabled, gc is very serious, > !image-2023-05-16-21-23-35-237.png|width=929,height=570! > but sortMergeJoin executes without this problem. > !image-2023-05-16-21-24-09-182.png|width=931,height=573! > > Any suggestions on how to solve it?Thanks! -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43526) when shuffle hash join is enabled, q95 performance deteriorates
[ https://issues.apache.org/jira/browse/SPARK-43526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] caican updated SPARK-43526: --- Description: Testing with 5TB dataset, the performance of q95 in tpcds deteriorates when shuffle hash join is enabled and the performance is better when sortMergeJoin is used. Performance difference: from 3.9min(sortMergeJoin) to 8.1min(shuffledHashJoin) 1. enable shuffledHashJoin, the execution plan is as follows: !image-2023-05-16-21-20-18-727.png|width=990,height=68! !image-2023-05-16-21-21-35-493.png|width=924,height=502! 2. disable shuffledHashJoin, the execution plan is as follows: !image-2023-05-16-21-22-44-532.png|width=1190,height=78! !image-2023-05-16-21-22-16-170.png|width=934,height=477! And when shuffledHashJoin is enabled, gc is very serious, !image-2023-05-16-21-23-35-237.png|width=929,height=570! but sortMergeJoin executes without this problem. !image-2023-05-16-21-24-09-182.png|width=931,height=573! Any suggestions on how to solve it?Thanks! was: Testing with 5TB dataset, the performance of q95 in tpcds deteriorates when shuffle hash join is enabled and the performance is better when sortMergeJoin is used. Performance difference: from 3.9min(sortMergeJoin) to 8.1min(shuffledHashJoin) 1. enable shuffledHashJoin, the execution plan is as follows: !image-2023-05-16-21-20-18-727.png|width=990,height=68! !image-2023-05-16-21-21-35-493.png|width=924,height=502! 2. disable shuffledHashJoin, the execution plan is as follows: !image-2023-05-16-21-22-44-532.png! !image-2023-05-16-21-22-16-170.png! And when shuffledHashJoin is enabled, gc is very serious, !image-2023-05-16-21-23-35-237.png! but sortMergeJoin executes without this problem. !image-2023-05-16-21-24-09-182.png! Any suggestions on how to solve it?Thanks! > when shuffle hash join is enabled, q95 performance deteriorates > --- > > Key: SPARK-43526 > URL: https://issues.apache.org/jira/browse/SPARK-43526 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.2, 3.2.0 >Reporter: caican >Priority: Major > Attachments: image-2023-05-16-21-20-18-727.png, > image-2023-05-16-21-21-35-493.png, image-2023-05-16-21-22-16-170.png, > image-2023-05-16-21-22-44-532.png, image-2023-05-16-21-23-33-611.png, > image-2023-05-16-21-23-35-237.png, image-2023-05-16-21-24-09-182.png > > > Testing with 5TB dataset, the performance of q95 in tpcds deteriorates when > shuffle hash join is enabled and the performance is better when sortMergeJoin > is used. > > Performance difference: from 3.9min(sortMergeJoin) to 8.1min(shuffledHashJoin) > > 1. enable shuffledHashJoin, the execution plan is as follows: > !image-2023-05-16-21-20-18-727.png|width=990,height=68! > !image-2023-05-16-21-21-35-493.png|width=924,height=502! > 2. disable shuffledHashJoin, the execution plan is as follows: > !image-2023-05-16-21-22-44-532.png|width=1190,height=78! > !image-2023-05-16-21-22-16-170.png|width=934,height=477! > > And when shuffledHashJoin is enabled, gc is very serious, > !image-2023-05-16-21-23-35-237.png|width=929,height=570! > but sortMergeJoin executes without this problem. > !image-2023-05-16-21-24-09-182.png|width=931,height=573! > > Any suggestions on how to solve it?Thanks! -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43526) when shuffle hash join is enabled, q95 performance deteriorates
[ https://issues.apache.org/jira/browse/SPARK-43526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] caican updated SPARK-43526: --- Description: Testing with 5TB dataset, the performance of q95 in tpcds deteriorates when shuffle hash join is enabled and the performance is better when sortMergeJoin is used. Performance difference: from 3.9min(sortMergeJoin) to 8.1min(shuffledHashJoin) 1. enable shuffledHashJoin, the execution plan is as follows: !image-2023-05-16-21-20-18-727.png! !image-2023-05-16-21-21-35-493.png! 2. disable shuffledHashJoin, the execution plan is as follows: !image-2023-05-16-21-22-44-532.png! !image-2023-05-16-21-22-16-170.png! And when shuffledHashJoin is enabled, gc is very serious, !image-2023-05-16-21-23-35-237.png! but sortMergeJoin executes without this problem. !image-2023-05-16-21-24-09-182.png! Any suggestions on how to solve it?Thanks! was: Testing with 5TB dataset, the performance of q95 in tpcds deteriorates when shuffle hash join is enabled and the performance is better when sortMergeJoin is used. Performance difference: from 3.9min(sortMergeJoin) to 8.1min(shuffledHashJoin) 1. enable shuffledHashJoin, the execution plan is as follows: !image-2023-05-16-21-01-53-423.png! !image-2023-05-16-21-16-37-376.png! 2. disable shuffledHashJoin, the execution plan is as follows: !image-2023-05-16-21-05-45-361.png! !image-2023-05-16-21-16-13-128.png! and When shuffledHashJoin is enabled, gc is very serious. !image-2023-05-16-21-12-24-618.png! but sortMergeJoin executes without this problem. !image-2023-05-16-21-15-21-047.png! Any suggestions on how to solve it?Thanks! > when shuffle hash join is enabled, q95 performance deteriorates > --- > > Key: SPARK-43526 > URL: https://issues.apache.org/jira/browse/SPARK-43526 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.2, 3.2.0 >Reporter: caican >Priority: Major > Attachments: image-2023-05-16-21-20-18-727.png, > image-2023-05-16-21-21-35-493.png, image-2023-05-16-21-22-16-170.png, > image-2023-05-16-21-22-44-532.png, image-2023-05-16-21-23-33-611.png, > image-2023-05-16-21-23-35-237.png, image-2023-05-16-21-24-09-182.png > > > Testing with 5TB dataset, the performance of q95 in tpcds deteriorates when > shuffle hash join is enabled and the performance is better when sortMergeJoin > is used. > > Performance difference: from 3.9min(sortMergeJoin) to 8.1min(shuffledHashJoin) > > 1. enable shuffledHashJoin, the execution plan is as follows: > !image-2023-05-16-21-20-18-727.png! > !image-2023-05-16-21-21-35-493.png! > 2. disable shuffledHashJoin, the execution plan is as follows: > !image-2023-05-16-21-22-44-532.png! > !image-2023-05-16-21-22-16-170.png! > > And when shuffledHashJoin is enabled, gc is very serious, > !image-2023-05-16-21-23-35-237.png! > but sortMergeJoin executes without this problem. > !image-2023-05-16-21-24-09-182.png! > > Any suggestions on how to solve it?Thanks! -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43526) when shuffle hash join is enabled, q95 performance deteriorates
[ https://issues.apache.org/jira/browse/SPARK-43526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] caican updated SPARK-43526: --- Description: Testing with 5TB dataset, the performance of q95 in tpcds deteriorates when shuffle hash join is enabled and the performance is better when sortMergeJoin is used. Performance difference: from 3.9min(sortMergeJoin) to 8.1min(shuffledHashJoin) 1. enable shuffledHashJoin, the execution plan is as follows: !image-2023-05-16-21-20-18-727.png|width=1340,height=92! !image-2023-05-16-21-21-35-493.png! 2. disable shuffledHashJoin, the execution plan is as follows: !image-2023-05-16-21-22-44-532.png! !image-2023-05-16-21-22-16-170.png! And when shuffledHashJoin is enabled, gc is very serious, !image-2023-05-16-21-23-35-237.png! but sortMergeJoin executes without this problem. !image-2023-05-16-21-24-09-182.png! Any suggestions on how to solve it?Thanks! was: Testing with 5TB dataset, the performance of q95 in tpcds deteriorates when shuffle hash join is enabled and the performance is better when sortMergeJoin is used. Performance difference: from 3.9min(sortMergeJoin) to 8.1min(shuffledHashJoin) 1. enable shuffledHashJoin, the execution plan is as follows: !image-2023-05-16-21-20-18-727.png! !image-2023-05-16-21-21-35-493.png! 2. disable shuffledHashJoin, the execution plan is as follows: !image-2023-05-16-21-22-44-532.png! !image-2023-05-16-21-22-16-170.png! And when shuffledHashJoin is enabled, gc is very serious, !image-2023-05-16-21-23-35-237.png! but sortMergeJoin executes without this problem. !image-2023-05-16-21-24-09-182.png! Any suggestions on how to solve it?Thanks! > when shuffle hash join is enabled, q95 performance deteriorates > --- > > Key: SPARK-43526 > URL: https://issues.apache.org/jira/browse/SPARK-43526 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.2, 3.2.0 >Reporter: caican >Priority: Major > Attachments: image-2023-05-16-21-20-18-727.png, > image-2023-05-16-21-21-35-493.png, image-2023-05-16-21-22-16-170.png, > image-2023-05-16-21-22-44-532.png, image-2023-05-16-21-23-33-611.png, > image-2023-05-16-21-23-35-237.png, image-2023-05-16-21-24-09-182.png > > > Testing with 5TB dataset, the performance of q95 in tpcds deteriorates when > shuffle hash join is enabled and the performance is better when sortMergeJoin > is used. > > Performance difference: from 3.9min(sortMergeJoin) to 8.1min(shuffledHashJoin) > > 1. enable shuffledHashJoin, the execution plan is as follows: > !image-2023-05-16-21-20-18-727.png|width=1340,height=92! > !image-2023-05-16-21-21-35-493.png! > 2. disable shuffledHashJoin, the execution plan is as follows: > !image-2023-05-16-21-22-44-532.png! > !image-2023-05-16-21-22-16-170.png! > > And when shuffledHashJoin is enabled, gc is very serious, > !image-2023-05-16-21-23-35-237.png! > but sortMergeJoin executes without this problem. > !image-2023-05-16-21-24-09-182.png! > > Any suggestions on how to solve it?Thanks! -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43526) when shuffle hash join is enabled, q95 performance deteriorates
[ https://issues.apache.org/jira/browse/SPARK-43526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] caican updated SPARK-43526: --- Attachment: image-2023-05-16-21-24-09-182.png > when shuffle hash join is enabled, q95 performance deteriorates > --- > > Key: SPARK-43526 > URL: https://issues.apache.org/jira/browse/SPARK-43526 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.2, 3.2.0 >Reporter: caican >Priority: Major > Attachments: image-2023-05-16-21-20-18-727.png, > image-2023-05-16-21-21-35-493.png, image-2023-05-16-21-22-16-170.png, > image-2023-05-16-21-22-44-532.png, image-2023-05-16-21-23-33-611.png, > image-2023-05-16-21-23-35-237.png, image-2023-05-16-21-24-09-182.png > > > Testing with 5TB dataset, the performance of q95 in tpcds deteriorates when > shuffle hash join is enabled and the performance is better when sortMergeJoin > is used. > > Performance difference: from 3.9min(sortMergeJoin) to 8.1min(shuffledHashJoin) > > 1. enable shuffledHashJoin, the execution plan is as follows: > !image-2023-05-16-21-01-53-423.png! > !image-2023-05-16-21-16-37-376.png! > 2. disable shuffledHashJoin, the execution plan is as follows: > !image-2023-05-16-21-05-45-361.png! > !image-2023-05-16-21-16-13-128.png! > > and When shuffledHashJoin is enabled, gc is very serious. > !image-2023-05-16-21-12-24-618.png! > but sortMergeJoin executes without this problem. > !image-2023-05-16-21-15-21-047.png! > > Any suggestions on how to solve it?Thanks! -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43526) when shuffle hash join is enabled, q95 performance deteriorates
[ https://issues.apache.org/jira/browse/SPARK-43526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] caican updated SPARK-43526: --- Description: Testing with 5TB dataset, the performance of q95 in tpcds deteriorates when shuffle hash join is enabled and the performance is better when sortMergeJoin is used. Performance difference: from 3.9min(sortMergeJoin) to 8.1min(shuffledHashJoin) 1. enable shuffledHashJoin, the execution plan is as follows: !image-2023-05-16-21-20-18-727.png|width=990,height=68! !image-2023-05-16-21-21-35-493.png|width=924,height=502! 2. disable shuffledHashJoin, the execution plan is as follows: !image-2023-05-16-21-22-44-532.png! !image-2023-05-16-21-22-16-170.png! And when shuffledHashJoin is enabled, gc is very serious, !image-2023-05-16-21-23-35-237.png! but sortMergeJoin executes without this problem. !image-2023-05-16-21-24-09-182.png! Any suggestions on how to solve it?Thanks! was: Testing with 5TB dataset, the performance of q95 in tpcds deteriorates when shuffle hash join is enabled and the performance is better when sortMergeJoin is used. Performance difference: from 3.9min(sortMergeJoin) to 8.1min(shuffledHashJoin) 1. enable shuffledHashJoin, the execution plan is as follows: !image-2023-05-16-21-20-18-727.png|width=1340,height=92! !image-2023-05-16-21-21-35-493.png! 2. disable shuffledHashJoin, the execution plan is as follows: !image-2023-05-16-21-22-44-532.png! !image-2023-05-16-21-22-16-170.png! And when shuffledHashJoin is enabled, gc is very serious, !image-2023-05-16-21-23-35-237.png! but sortMergeJoin executes without this problem. !image-2023-05-16-21-24-09-182.png! Any suggestions on how to solve it?Thanks! > when shuffle hash join is enabled, q95 performance deteriorates > --- > > Key: SPARK-43526 > URL: https://issues.apache.org/jira/browse/SPARK-43526 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.2, 3.2.0 >Reporter: caican >Priority: Major > Attachments: image-2023-05-16-21-20-18-727.png, > image-2023-05-16-21-21-35-493.png, image-2023-05-16-21-22-16-170.png, > image-2023-05-16-21-22-44-532.png, image-2023-05-16-21-23-33-611.png, > image-2023-05-16-21-23-35-237.png, image-2023-05-16-21-24-09-182.png > > > Testing with 5TB dataset, the performance of q95 in tpcds deteriorates when > shuffle hash join is enabled and the performance is better when sortMergeJoin > is used. > > Performance difference: from 3.9min(sortMergeJoin) to 8.1min(shuffledHashJoin) > > 1. enable shuffledHashJoin, the execution plan is as follows: > !image-2023-05-16-21-20-18-727.png|width=990,height=68! > !image-2023-05-16-21-21-35-493.png|width=924,height=502! > 2. disable shuffledHashJoin, the execution plan is as follows: > !image-2023-05-16-21-22-44-532.png! > !image-2023-05-16-21-22-16-170.png! > > And when shuffledHashJoin is enabled, gc is very serious, > !image-2023-05-16-21-23-35-237.png! > but sortMergeJoin executes without this problem. > !image-2023-05-16-21-24-09-182.png! > > Any suggestions on how to solve it?Thanks! -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43526) when shuffle hash join is enabled, q95 performance deteriorates
[ https://issues.apache.org/jira/browse/SPARK-43526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] caican updated SPARK-43526: --- Attachment: image-2023-05-16-21-23-35-237.png > when shuffle hash join is enabled, q95 performance deteriorates > --- > > Key: SPARK-43526 > URL: https://issues.apache.org/jira/browse/SPARK-43526 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.2, 3.2.0 >Reporter: caican >Priority: Major > Attachments: image-2023-05-16-21-20-18-727.png, > image-2023-05-16-21-21-35-493.png, image-2023-05-16-21-22-16-170.png, > image-2023-05-16-21-22-44-532.png, image-2023-05-16-21-23-33-611.png, > image-2023-05-16-21-23-35-237.png > > > Testing with 5TB dataset, the performance of q95 in tpcds deteriorates when > shuffle hash join is enabled and the performance is better when sortMergeJoin > is used. > > Performance difference: from 3.9min(sortMergeJoin) to 8.1min(shuffledHashJoin) > > 1. enable shuffledHashJoin, the execution plan is as follows: > !image-2023-05-16-21-01-53-423.png! > !image-2023-05-16-21-16-37-376.png! > 2. disable shuffledHashJoin, the execution plan is as follows: > !image-2023-05-16-21-05-45-361.png! > !image-2023-05-16-21-16-13-128.png! > > and When shuffledHashJoin is enabled, gc is very serious. > !image-2023-05-16-21-12-24-618.png! > but sortMergeJoin executes without this problem. > !image-2023-05-16-21-15-21-047.png! > > Any suggestions on how to solve it?Thanks! -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43526) when shuffle hash join is enabled, q95 performance deteriorates
[ https://issues.apache.org/jira/browse/SPARK-43526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] caican updated SPARK-43526: --- Attachment: image-2023-05-16-21-23-33-611.png > when shuffle hash join is enabled, q95 performance deteriorates > --- > > Key: SPARK-43526 > URL: https://issues.apache.org/jira/browse/SPARK-43526 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.2, 3.2.0 >Reporter: caican >Priority: Major > Attachments: image-2023-05-16-21-20-18-727.png, > image-2023-05-16-21-21-35-493.png, image-2023-05-16-21-22-16-170.png, > image-2023-05-16-21-22-44-532.png, image-2023-05-16-21-23-33-611.png, > image-2023-05-16-21-23-35-237.png > > > Testing with 5TB dataset, the performance of q95 in tpcds deteriorates when > shuffle hash join is enabled and the performance is better when sortMergeJoin > is used. > > Performance difference: from 3.9min(sortMergeJoin) to 8.1min(shuffledHashJoin) > > 1. enable shuffledHashJoin, the execution plan is as follows: > !image-2023-05-16-21-01-53-423.png! > !image-2023-05-16-21-16-37-376.png! > 2. disable shuffledHashJoin, the execution plan is as follows: > !image-2023-05-16-21-05-45-361.png! > !image-2023-05-16-21-16-13-128.png! > > and When shuffledHashJoin is enabled, gc is very serious. > !image-2023-05-16-21-12-24-618.png! > but sortMergeJoin executes without this problem. > !image-2023-05-16-21-15-21-047.png! > > Any suggestions on how to solve it?Thanks! -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43526) when shuffle hash join is enabled, q95 performance deteriorates
[ https://issues.apache.org/jira/browse/SPARK-43526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] caican updated SPARK-43526: --- Attachment: image-2023-05-16-21-22-16-170.png > when shuffle hash join is enabled, q95 performance deteriorates > --- > > Key: SPARK-43526 > URL: https://issues.apache.org/jira/browse/SPARK-43526 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.2, 3.2.0 >Reporter: caican >Priority: Major > Attachments: image-2023-05-16-21-20-18-727.png, > image-2023-05-16-21-21-35-493.png, image-2023-05-16-21-22-16-170.png, > image-2023-05-16-21-22-44-532.png > > > Testing with 5TB dataset, the performance of q95 in tpcds deteriorates when > shuffle hash join is enabled and the performance is better when sortMergeJoin > is used. > > Performance difference: from 3.9min(sortMergeJoin) to 8.1min(shuffledHashJoin) > > 1. enable shuffledHashJoin, the execution plan is as follows: > !image-2023-05-16-21-01-53-423.png! > !image-2023-05-16-21-16-37-376.png! > 2. disable shuffledHashJoin, the execution plan is as follows: > !image-2023-05-16-21-05-45-361.png! > !image-2023-05-16-21-16-13-128.png! > > and When shuffledHashJoin is enabled, gc is very serious. > !image-2023-05-16-21-12-24-618.png! > but sortMergeJoin executes without this problem. > !image-2023-05-16-21-15-21-047.png! > > Any suggestions on how to solve it?Thanks! -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43526) when shuffle hash join is enabled, q95 performance deteriorates
[ https://issues.apache.org/jira/browse/SPARK-43526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] caican updated SPARK-43526: --- Attachment: image-2023-05-16-21-22-44-532.png > when shuffle hash join is enabled, q95 performance deteriorates > --- > > Key: SPARK-43526 > URL: https://issues.apache.org/jira/browse/SPARK-43526 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.2, 3.2.0 >Reporter: caican >Priority: Major > Attachments: image-2023-05-16-21-20-18-727.png, > image-2023-05-16-21-21-35-493.png, image-2023-05-16-21-22-16-170.png, > image-2023-05-16-21-22-44-532.png > > > Testing with 5TB dataset, the performance of q95 in tpcds deteriorates when > shuffle hash join is enabled and the performance is better when sortMergeJoin > is used. > > Performance difference: from 3.9min(sortMergeJoin) to 8.1min(shuffledHashJoin) > > 1. enable shuffledHashJoin, the execution plan is as follows: > !image-2023-05-16-21-01-53-423.png! > !image-2023-05-16-21-16-37-376.png! > 2. disable shuffledHashJoin, the execution plan is as follows: > !image-2023-05-16-21-05-45-361.png! > !image-2023-05-16-21-16-13-128.png! > > and When shuffledHashJoin is enabled, gc is very serious. > !image-2023-05-16-21-12-24-618.png! > but sortMergeJoin executes without this problem. > !image-2023-05-16-21-15-21-047.png! > > Any suggestions on how to solve it?Thanks! -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43526) when shuffle hash join is enabled, q95 performance deteriorates
[ https://issues.apache.org/jira/browse/SPARK-43526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] caican updated SPARK-43526: --- Attachment: image-2023-05-16-21-21-35-493.png > when shuffle hash join is enabled, q95 performance deteriorates > --- > > Key: SPARK-43526 > URL: https://issues.apache.org/jira/browse/SPARK-43526 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.2, 3.2.0 >Reporter: caican >Priority: Major > Attachments: image-2023-05-16-21-20-18-727.png, > image-2023-05-16-21-21-35-493.png > > > Testing with 5TB dataset, the performance of q95 in tpcds deteriorates when > shuffle hash join is enabled and the performance is better when sortMergeJoin > is used. > > Performance difference: from 3.9min(sortMergeJoin) to 8.1min(shuffledHashJoin) > > 1. enable shuffledHashJoin, the execution plan is as follows: > !image-2023-05-16-21-01-53-423.png! > !image-2023-05-16-21-16-37-376.png! > 2. disable shuffledHashJoin, the execution plan is as follows: > !image-2023-05-16-21-05-45-361.png! > !image-2023-05-16-21-16-13-128.png! > > and When shuffledHashJoin is enabled, gc is very serious. > !image-2023-05-16-21-12-24-618.png! > but sortMergeJoin executes without this problem. > !image-2023-05-16-21-15-21-047.png! > > Any suggestions on how to solve it?Thanks! -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43526) when shuffle hash join is enabled, q95 performance deteriorates
[ https://issues.apache.org/jira/browse/SPARK-43526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] caican updated SPARK-43526: --- Attachment: image-2023-05-16-21-20-18-727.png > when shuffle hash join is enabled, q95 performance deteriorates > --- > > Key: SPARK-43526 > URL: https://issues.apache.org/jira/browse/SPARK-43526 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.2, 3.2.0 >Reporter: caican >Priority: Major > Attachments: image-2023-05-16-21-20-18-727.png > > > Testing with 5TB dataset, the performance of q95 in tpcds deteriorates when > shuffle hash join is enabled and the performance is better when sortMergeJoin > is used. > > Performance difference: from 3.9min(sortMergeJoin) to 8.1min(shuffledHashJoin) > > 1. enable shuffledHashJoin, the execution plan is as follows: > !image-2023-05-16-21-01-53-423.png! > !image-2023-05-16-21-16-37-376.png! > 2. disable shuffledHashJoin, the execution plan is as follows: > !image-2023-05-16-21-05-45-361.png! > !image-2023-05-16-21-16-13-128.png! > > and When shuffledHashJoin is enabled, gc is very serious. > !image-2023-05-16-21-12-24-618.png! > but sortMergeJoin executes without this problem. > !image-2023-05-16-21-15-21-047.png! > > Any suggestions on how to solve it?Thanks! -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43526) when shuffle hash join is enabled, q95 performance deteriorates
[ https://issues.apache.org/jira/browse/SPARK-43526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] caican updated SPARK-43526: --- Description: Testing with 5TB dataset, the performance of q95 in tpcds deteriorates when shuffle hash join is enabled and the performance is better when sortMergeJoin is used. Performance difference: from 3.9min(sortMergeJoin) to 8.1min(shuffledHashJoin) 1. enable shuffledHashJoin, the execution plan is as follows: !image-2023-05-16-21-01-53-423.png! !image-2023-05-16-21-16-37-376.png! 2. disable shuffledHashJoin, the execution plan is as follows: !image-2023-05-16-21-05-45-361.png! !image-2023-05-16-21-16-13-128.png! and When shuffledHashJoin is enabled, gc is very serious. !image-2023-05-16-21-12-24-618.png! but sortMergeJoin executes without this problem. !image-2023-05-16-21-15-21-047.png! Any suggestions on how to solve it?Thanks! was: Testing with 5TB dataset, the performance of q95 in tpcds deteriorates when shuffle hash join is enabled and the performance is better when sortMergeJoin is used. >From 8.1min(shuffledHashJoin) to 3.9min(sortMergeJoin). enable shuffledHashJoin, the execution plan is as follows: !image-2023-05-16-21-01-53-423.png! !image-2023-05-16-21-16-37-376.png! disable shuffledHashJoin, the execution plan is as follows: !image-2023-05-16-21-05-45-361.png! !image-2023-05-16-21-16-13-128.png! And When shuffledHashJoin is enabled, gc is very serious !image-2023-05-16-21-12-24-618.png! But sortMergeJoin executes without this problem !image-2023-05-16-21-15-21-047.png! > when shuffle hash join is enabled, q95 performance deteriorates > --- > > Key: SPARK-43526 > URL: https://issues.apache.org/jira/browse/SPARK-43526 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.2, 3.2.0 >Reporter: caican >Priority: Major > > Testing with 5TB dataset, the performance of q95 in tpcds deteriorates when > shuffle hash join is enabled and the performance is better when sortMergeJoin > is used. > > Performance difference: from 3.9min(sortMergeJoin) to 8.1min(shuffledHashJoin) > > 1. enable shuffledHashJoin, the execution plan is as follows: > !image-2023-05-16-21-01-53-423.png! > !image-2023-05-16-21-16-37-376.png! > 2. disable shuffledHashJoin, the execution plan is as follows: > !image-2023-05-16-21-05-45-361.png! > !image-2023-05-16-21-16-13-128.png! > > and When shuffledHashJoin is enabled, gc is very serious. > !image-2023-05-16-21-12-24-618.png! > but sortMergeJoin executes without this problem. > !image-2023-05-16-21-15-21-047.png! > > Any suggestions on how to solve it?Thanks! -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-43526) when shuffle hash join is enabled, q95 performance deteriorates
caican created SPARK-43526: -- Summary: when shuffle hash join is enabled, q95 performance deteriorates Key: SPARK-43526 URL: https://issues.apache.org/jira/browse/SPARK-43526 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.2.0, 3.1.2 Reporter: caican Testing with 5TB dataset, the performance of q95 in tpcds deteriorates when shuffle hash join is enabled and the performance is better when sortMergeJoin is used. >From 8.1min(shuffledHashJoin) to 3.9min(sortMergeJoin). enable shuffledHashJoin, the execution plan is as follows: !image-2023-05-16-21-01-53-423.png! !image-2023-05-16-21-16-37-376.png! disable shuffledHashJoin, the execution plan is as follows: !image-2023-05-16-21-05-45-361.png! !image-2023-05-16-21-16-13-128.png! And When shuffledHashJoin is enabled, gc is very serious !image-2023-05-16-21-12-24-618.png! But sortMergeJoin executes without this problem !image-2023-05-16-21-15-21-047.png! -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43065) Set job description for tpcds queries
[ https://issues.apache.org/jira/browse/SPARK-43065?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] caican updated SPARK-43065: --- Description: When using Spark's TPCDSQueryBenchmark to run tpcds, the spark ui does not display the sql information !https://user-images.githubusercontent.com/94670132/230567550-9bb2842c-aecc-41a5-acb6-0ff8ea765df1.png|width=1694,height=523! was: When using Spark's TPCDSQueryBenchmark to run tpcds, the spark ui does not display the sql information !https://user-images.githubusercontent.com/94670132/230567550-9bb2842c-aecc-41a5-acb6-0ff8ea765df1.png|width=1694,height=523! > Set job description for tpcds queries > - > > Key: SPARK-43065 > URL: https://issues.apache.org/jira/browse/SPARK-43065 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.2, 3.2.0, 3.3.0 >Reporter: caican >Priority: Major > > > When using Spark's TPCDSQueryBenchmark to run tpcds, the spark ui does not > display the sql information > !https://user-images.githubusercontent.com/94670132/230567550-9bb2842c-aecc-41a5-acb6-0ff8ea765df1.png|width=1694,height=523! -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-43065) Set job description for tpcds queries
caican created SPARK-43065: -- Summary: Set job description for tpcds queries Key: SPARK-43065 URL: https://issues.apache.org/jira/browse/SPARK-43065 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.3.0, 3.2.0, 3.1.2 Reporter: caican When using Spark's TPCDSQueryBenchmark to run tpcds, the spark ui does not display the sql information !https://user-images.githubusercontent.com/94670132/230567550-9bb2842c-aecc-41a5-acb6-0ff8ea765df1.png|width=1694,height=523! -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40455) Abort result stage directly when it failed caused by FetchFailed
[ https://issues.apache.org/jira/browse/SPARK-40455?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] caican updated SPARK-40455: --- Description: Here's a very serious bug: When result stage failed caused by FetchFailedException, the previous condition to determine whether result stage retries are allowed is {color:#ff}numMissingPartitions < resultStage.numTasks{color}. If this condition holds on retry, but the other tasks at the current result stage are not killed, when result stage was resubmit, it would got wrong partitions to recalculation. {code:java} // DAGScheduler#submitMissingTasks // Figure out the indexes of partition ids to compute. val partitionsToCompute: Seq[Int] = stage.findMissingPartitions() {code} It is possible that the number of partitions to be recalculated is smaller than the actual number of partitions at result stage was: Here's a very serious bug: When result stage failed caused by FetchFailedException, the previous condition to determine whether result stage retries are allowed is {color:#FF}numMissingPartitions < resultStage.numTasks{color}. If this condition holds on retry, but the other tasks in the current result stage are not killed, when result stage was resubmit, it would got wrong partitions to recalculation. {code:java} // DAGScheduler#submitMissingTasks // Figure out the indexes of partition ids to compute. val partitionsToCompute: Seq[Int] = stage.findMissingPartitions() {code} > Abort result stage directly when it failed caused by FetchFailed > > > Key: SPARK-40455 > URL: https://issues.apache.org/jira/browse/SPARK-40455 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.0, 3.1.2, 3.2.1, 3.3.0 >Reporter: caican >Priority: Major > > Here's a very serious bug: > When result stage failed caused by FetchFailedException, the previous > condition to determine whether result stage retries are allowed is > {color:#ff}numMissingPartitions < resultStage.numTasks{color}. > > If this condition holds on retry, but the other tasks at the current result > stage are not killed, when result stage was resubmit, it would got wrong > partitions to recalculation. > {code:java} > // DAGScheduler#submitMissingTasks > > // Figure out the indexes of partition ids to compute. > val partitionsToCompute: Seq[Int] = stage.findMissingPartitions() {code} > It is possible that the number of partitions to be recalculated is smaller > than the actual number of partitions at result stage -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40455) Abort result stage directly when it failed caused by FetchFailed
[ https://issues.apache.org/jira/browse/SPARK-40455?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] caican updated SPARK-40455: --- Description: Here's a very serious bug: When result stage failed caused by FetchFailedException, the previous condition to determine whether result stage retries are allowed is {color:#FF}numMissingPartitions < resultStage.numTasks{color}. If this condition holds on retry, but the other tasks in the current result stage are not killed, when result stage was resubmit, it would got wrong partitions to recalculation. {code:java} // DAGScheduler#submitMissingTasks // Figure out the indexes of partition ids to compute. val partitionsToCompute: Seq[Int] = stage.findMissingPartitions() {code} was: Here's a very serious bug: When result stage failed caused by `FetchFailedException`, the previous condition to determine whether result stage retries are allowed is `numMissingPartitions < resultStage.numTasks`. If this condition holds on retry, but the other tasks in the current result stage are not killed, when result stage was resubmit, it would got wrong partitions to recalculation ``` ``` {code:java} // DAGScheduler#submitMissingTasks // Figure out the indexes of partition ids to compute. val partitionsToCompute: Seq[Int] = stage.findMissingPartitions() {code} > Abort result stage directly when it failed caused by FetchFailed > > > Key: SPARK-40455 > URL: https://issues.apache.org/jira/browse/SPARK-40455 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.0, 3.1.2, 3.2.1, 3.3.0 >Reporter: caican >Priority: Major > > Here's a very serious bug: > When result stage failed caused by FetchFailedException, the previous > condition to determine whether result stage retries are allowed is > {color:#FF}numMissingPartitions < resultStage.numTasks{color}. > > If this condition holds on retry, but the other tasks in the current result > stage are not killed, when result stage was resubmit, it would got wrong > partitions to recalculation. > {code:java} > // DAGScheduler#submitMissingTasks > > // Figure out the indexes of partition ids to compute. > val partitionsToCompute: Seq[Int] = stage.findMissingPartitions() {code} > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40455) Abort result stage directly when it failed caused by FetchFailed
[ https://issues.apache.org/jira/browse/SPARK-40455?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] caican updated SPARK-40455: --- Description: Here's a very serious bug: When result stage failed caused by `FetchFailedException`, the previous condition to determine whether result stage retries are allowed is `numMissingPartitions < resultStage.numTasks`. If this condition holds on retry, but the other tasks in the current result stage are not killed, when result stage was resubmit, it would got wrong partitions to recalculation ``` ``` {code:java} // DAGScheduler#submitMissingTasks // Figure out the indexes of partition ids to compute. val partitionsToCompute: Seq[Int] = stage.findMissingPartitions() {code} was: Here's a very serious bug: When result stage failed caused by `FetchFailedException`, the previous condition to determine whether result stage retries are allowed is `numMissingPartitions < resultStage.numTasks`. If this condition holds on retry, but the other tasks in the current result stage are not killed, when result stage was resubmit, it would got wrong partitions to recalculation ``` // DAGScheduler#submitMissingTasks // Figure out the indexes of partition ids to compute. val partitionsToCompute: Seq[Int] = stage.findMissingPartitions() ``` > Abort result stage directly when it failed caused by FetchFailed > > > Key: SPARK-40455 > URL: https://issues.apache.org/jira/browse/SPARK-40455 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.0, 3.1.2, 3.2.1, 3.3.0 >Reporter: caican >Priority: Major > > Here's a very serious bug: > When result stage failed caused by `FetchFailedException`, the previous > condition to determine whether result stage retries are allowed is > `numMissingPartitions < resultStage.numTasks`. > > If this condition holds on retry, but the other tasks in the current result > stage are not killed, when result stage was resubmit, it would got wrong > partitions to recalculation > ``` > > ``` > {code:java} > // DAGScheduler#submitMissingTasks > > // Figure out the indexes of partition ids to compute. > val partitionsToCompute: Seq[Int] = stage.findMissingPartitions() {code} > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40455) Abort result stage directly when it failed caused by FetchFailed
[ https://issues.apache.org/jira/browse/SPARK-40455?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] caican updated SPARK-40455: --- Description: Here's a very serious bug: When result stage failed caused by `FetchFailedException`, the previous condition to determine whether result stage retries are allowed is `numMissingPartitions < resultStage.numTasks`. If this condition holds on retry, but the other tasks in the current result stage are not killed, when result stage was resubmit, it would got wrong partitions to recalculation ``` // DAGScheduler#submitMissingTasks // Figure out the indexes of partition ids to compute. val partitionsToCompute: Seq[Int] = stage.findMissingPartitions() ``` was: Here's a very serious bug: > Abort result stage directly when it failed caused by FetchFailed > > > Key: SPARK-40455 > URL: https://issues.apache.org/jira/browse/SPARK-40455 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.0, 3.1.2, 3.2.1, 3.3.0 >Reporter: caican >Priority: Major > > Here's a very serious bug: > When result stage failed caused by `FetchFailedException`, the previous > condition to determine whether result stage retries are allowed is > `numMissingPartitions < resultStage.numTasks`. > > If this condition holds on retry, but the other tasks in the current result > stage are not killed, when result stage was resubmit, it would got wrong > partitions to recalculation > ``` > // DAGScheduler#submitMissingTasks > > // Figure out the indexes of partition ids to compute. > val partitionsToCompute: Seq[Int] = stage.findMissingPartitions() > ``` -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40455) Abort result stage directly when it failed caused by FetchFailed
[ https://issues.apache.org/jira/browse/SPARK-40455?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] caican updated SPARK-40455: --- Description: Here's a very serious bug: > Abort result stage directly when it failed caused by FetchFailed > > > Key: SPARK-40455 > URL: https://issues.apache.org/jira/browse/SPARK-40455 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.0, 3.1.2, 3.2.1, 3.3.0 >Reporter: caican >Priority: Major > > Here's a very serious bug: > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40455) Abort result stage directly when it failed caused by FetchFailed
caican created SPARK-40455: -- Summary: Abort result stage directly when it failed caused by FetchFailed Key: SPARK-40455 URL: https://issues.apache.org/jira/browse/SPARK-40455 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 3.3.0, 3.2.1, 3.1.2, 3.0.0 Reporter: caican -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-40170) StringCoding UTF8 decode slowly
[ https://issues.apache.org/jira/browse/SPARK-40170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17582935#comment-17582935 ] caican edited comment on SPARK-40170 at 8/22/22 12:13 PM: -- [~kabhwan] My program code is very simple,as shown below. ``` val rdd = spark.sql("select triggerId,adMetadata,userData from iceberg_my_cloud.mydb.myTable where date = 20220801").rdd println(rdd.count()) ``` In addition to string decode, the conversion of Tuple2 to MAP is slow and i have submitted a patch:https://github.com/apache/spark/pull/37609 to optimize it but right now I don't have a good way to optimize string decode was (Author: JIRAUSER280464): My program code is very simple,As shown below. ``` val rdd = spark.sql("select triggerId,adMetadata,userData from iceberg_my_cloud.mydb.myTable where date = 20220801").rdd println(rdd.count()) ``` > StringCoding UTF8 decode slowly > --- > > Key: SPARK-40170 > URL: https://issues.apache.org/jira/browse/SPARK-40170 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.2, 3.2.0, 3.1.3, 3.2.1, 3.3.0, 3.2.2, 3.3.1 >Reporter: caican >Priority: Major > Attachments: image-2022-08-22-10-56-54-768.png, > image-2022-08-22-10-57-11-744.png > > > When `UnsafeRow` is converted to `Row` at > `org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificSafeProjection.createExternalRow > `, UTF8String decoding and copyMemory process are very slow. > Does anyone have any ideas for optimization? > !image-2022-08-22-10-56-54-768.png! > > !image-2022-08-22-10-57-11-744.png! -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40170) StringCoding UTF8 decode slowly
[ https://issues.apache.org/jira/browse/SPARK-40170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17582935#comment-17582935 ] caican commented on SPARK-40170: My program code is very simple,As shown below. ``` val rdd = spark.sql("select triggerId,adMetadata,userData from iceberg_my_cloud.mydb.myTable where date = 20220801").rdd println(rdd.count()) ``` > StringCoding UTF8 decode slowly > --- > > Key: SPARK-40170 > URL: https://issues.apache.org/jira/browse/SPARK-40170 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.2, 3.2.0, 3.1.3, 3.2.1, 3.3.0, 3.2.2, 3.3.1 >Reporter: caican >Priority: Major > Attachments: image-2022-08-22-10-56-54-768.png, > image-2022-08-22-10-57-11-744.png > > > When `UnsafeRow` is converted to `Row` at > `org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificSafeProjection.createExternalRow > `, UTF8String decoding and copyMemory process are very slow. > Does anyone have any ideas for optimization? > !image-2022-08-22-10-56-54-768.png! > > !image-2022-08-22-10-57-11-744.png! -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40170) StringCoding UTF8 decode slowly
[ https://issues.apache.org/jira/browse/SPARK-40170?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] caican updated SPARK-40170: --- Affects Version/s: 3.2.2 3.2.1 3.1.3 3.2.0 3.1.2 3.3.1 > StringCoding UTF8 decode slowly > --- > > Key: SPARK-40170 > URL: https://issues.apache.org/jira/browse/SPARK-40170 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.2, 3.2.0, 3.1.3, 3.2.1, 3.3.0, 3.2.2, 3.3.1 >Reporter: caican >Priority: Major > Attachments: image-2022-08-22-10-56-54-768.png, > image-2022-08-22-10-57-11-744.png > > > When `UnsafeRow` is converted to `Row` at > `org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificSafeProjection.createExternalRow > `, UTF8String decoding and copyMemory process are very slow. > Does anyone have any ideas for optimization? > !image-2022-08-22-10-56-54-768.png! > > !image-2022-08-22-10-57-11-744.png! -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40175) Converting Tuple2 to Scala Map via `.toMap` is slow
[ https://issues.apache.org/jira/browse/SPARK-40175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] caican updated SPARK-40175: --- Description: Converting Tuple2 to Scala Map via `.toMap` is slow !image-2022-08-22-14-58-53-046.png! !image-2022-08-22-14-58-26-491.png! was: Converting Tuple2 to Scala Map via `.toMap` is slow !image-2022-08-22-14-56-50-280.png! !image-2022-08-22-14-57-37-954.png! > Converting Tuple2 to Scala Map via `.toMap` is slow > --- > > Key: SPARK-40175 > URL: https://issues.apache.org/jira/browse/SPARK-40175 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.2, 3.2.0, 3.1.3, 3.3.0, 3.2.2, 3.3.1 >Reporter: caican >Priority: Major > Attachments: image-2022-08-22-14-58-26-491.png, > image-2022-08-22-14-58-53-046.png > > > Converting Tuple2 to Scala Map via `.toMap` is slow > !image-2022-08-22-14-58-53-046.png! > !image-2022-08-22-14-58-26-491.png! -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40175) Converting Tuple2 to Scala Map via `.toMap` is slow
[ https://issues.apache.org/jira/browse/SPARK-40175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] caican updated SPARK-40175: --- Attachment: image-2022-08-22-14-58-53-046.png > Converting Tuple2 to Scala Map via `.toMap` is slow > --- > > Key: SPARK-40175 > URL: https://issues.apache.org/jira/browse/SPARK-40175 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.2, 3.2.0, 3.1.3, 3.3.0, 3.2.2, 3.3.1 >Reporter: caican >Priority: Major > Attachments: image-2022-08-22-14-58-26-491.png, > image-2022-08-22-14-58-53-046.png > > > Converting Tuple2 to Scala Map via `.toMap` is slow > !image-2022-08-22-14-56-50-280.png! > !image-2022-08-22-14-57-37-954.png! -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40175) Converting Tuple2 to Scala Map via `.toMap` is slow
[ https://issues.apache.org/jira/browse/SPARK-40175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] caican updated SPARK-40175: --- Attachment: image-2022-08-22-14-58-26-491.png > Converting Tuple2 to Scala Map via `.toMap` is slow > --- > > Key: SPARK-40175 > URL: https://issues.apache.org/jira/browse/SPARK-40175 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.2, 3.2.0, 3.1.3, 3.3.0, 3.2.2, 3.3.1 >Reporter: caican >Priority: Major > Attachments: image-2022-08-22-14-58-26-491.png, > image-2022-08-22-14-58-53-046.png > > > Converting Tuple2 to Scala Map via `.toMap` is slow > !image-2022-08-22-14-56-50-280.png! > !image-2022-08-22-14-57-37-954.png! -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40175) Converting Tuple2 to Scala Map via `.toMap` is slow
caican created SPARK-40175: -- Summary: Converting Tuple2 to Scala Map via `.toMap` is slow Key: SPARK-40175 URL: https://issues.apache.org/jira/browse/SPARK-40175 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.2.2, 3.3.0, 3.1.3, 3.2.0, 3.1.2, 3.3.1 Reporter: caican Converting Tuple2 to Scala Map via `.toMap` is slow !image-2022-08-22-14-56-50-280.png! !image-2022-08-22-14-57-37-954.png! -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40170) StringCoding UTF8 decode slowly
[ https://issues.apache.org/jira/browse/SPARK-40170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17582708#comment-17582708 ] caican commented on SPARK-40170: gently ping [~sowen] [~r...@databricks.com] > StringCoding UTF8 decode slowly > --- > > Key: SPARK-40170 > URL: https://issues.apache.org/jira/browse/SPARK-40170 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: caican >Priority: Major > Attachments: image-2022-08-22-10-56-54-768.png, > image-2022-08-22-10-57-11-744.png > > > When `UnsafeRow` is converted to `Row` at > `org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificSafeProjection.createExternalRow > `, UTF8String decoding and copyMemory process are very slow. > Does anyone have any ideas for optimization? > !image-2022-08-22-10-56-54-768.png! > > !image-2022-08-22-10-57-11-744.png! -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40170) StringCoding UTF8 decode slowly
[ https://issues.apache.org/jira/browse/SPARK-40170?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] caican updated SPARK-40170: --- Description: When `UnsafeRow` is converted to `Row` at `org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificSafeProjection.createExternalRow `, UTF8String decoding and copyMemory process are very slow. Does anyone have any ideas for optimization? !image-2022-08-22-10-56-54-768.png! !image-2022-08-22-10-57-11-744.png! was: When `UnsafeRow` is converted to `Row` at `org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificSafeProjection.createExternalRow `, UTF8String decoding and copyMemory process are very slow. !image-2022-08-22-10-56-54-768.png! !image-2022-08-22-10-57-11-744.png! > StringCoding UTF8 decode slowly > --- > > Key: SPARK-40170 > URL: https://issues.apache.org/jira/browse/SPARK-40170 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: caican >Priority: Major > Attachments: image-2022-08-22-10-56-54-768.png, > image-2022-08-22-10-57-11-744.png > > > When `UnsafeRow` is converted to `Row` at > `org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificSafeProjection.createExternalRow > `, UTF8String decoding and copyMemory process are very slow. > Does anyone have any ideas for optimization? > !image-2022-08-22-10-56-54-768.png! > > !image-2022-08-22-10-57-11-744.png! -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40170) StringCoding UTF8 decode slowly
[ https://issues.apache.org/jira/browse/SPARK-40170?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] caican updated SPARK-40170: --- Attachment: image-2022-08-22-10-57-11-744.png > StringCoding UTF8 decode slowly > --- > > Key: SPARK-40170 > URL: https://issues.apache.org/jira/browse/SPARK-40170 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: caican >Priority: Major > Attachments: image-2022-08-22-10-56-54-768.png, > image-2022-08-22-10-57-11-744.png > > > When `UnsafeRow` is converted to `Row` at > `org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificSafeProjection.createExternalRow > `, UTF8String decoding and copyMemory process are very slow. > !image-2022-08-22-10-51-07-542.png! > > !image-2022-08-22-10-56-04-574.png! -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40170) StringCoding UTF8 decode slowly
[ https://issues.apache.org/jira/browse/SPARK-40170?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] caican updated SPARK-40170: --- Description: When `UnsafeRow` is converted to `Row` at `org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificSafeProjection.createExternalRow `, UTF8String decoding and copyMemory process are very slow. !image-2022-08-22-10-56-54-768.png! !image-2022-08-22-10-57-11-744.png! was: When `UnsafeRow` is converted to `Row` at `org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificSafeProjection.createExternalRow `, UTF8String decoding and copyMemory process are very slow. !image-2022-08-22-10-51-07-542.png! !image-2022-08-22-10-56-04-574.png! > StringCoding UTF8 decode slowly > --- > > Key: SPARK-40170 > URL: https://issues.apache.org/jira/browse/SPARK-40170 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: caican >Priority: Major > Attachments: image-2022-08-22-10-56-54-768.png, > image-2022-08-22-10-57-11-744.png > > > When `UnsafeRow` is converted to `Row` at > `org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificSafeProjection.createExternalRow > `, UTF8String decoding and copyMemory process are very slow. > !image-2022-08-22-10-56-54-768.png! > > !image-2022-08-22-10-57-11-744.png! -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40170) StringCoding UTF8 decode slowly
[ https://issues.apache.org/jira/browse/SPARK-40170?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] caican updated SPARK-40170: --- Attachment: image-2022-08-22-10-56-54-768.png > StringCoding UTF8 decode slowly > --- > > Key: SPARK-40170 > URL: https://issues.apache.org/jira/browse/SPARK-40170 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: caican >Priority: Major > Attachments: image-2022-08-22-10-56-54-768.png > > > When `UnsafeRow` is converted to `Row` at > `org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificSafeProjection.createExternalRow > `, UTF8String decoding and copyMemory process are very slow. > !image-2022-08-22-10-51-07-542.png! > > !image-2022-08-22-10-56-04-574.png! -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40170) StringCoding UTF8 decode slowly
caican created SPARK-40170: -- Summary: StringCoding UTF8 decode slowly Key: SPARK-40170 URL: https://issues.apache.org/jira/browse/SPARK-40170 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.3.0 Reporter: caican Attachments: image-2022-08-22-10-56-54-768.png When `UnsafeRow` is converted to `Row` at `org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificSafeProjection.createExternalRow `, UTF8String decoding and copyMemory process are very slow. !image-2022-08-22-10-51-07-542.png! !image-2022-08-22-10-56-04-574.png! -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40045) The order of filtering predicates is not reasonable
[ https://issues.apache.org/jira/browse/SPARK-40045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] caican updated SPARK-40045: --- Description: {code:java} select id, data FROM testcat.ns1.ns2.table where id =2 and md5(data) = '8cde774d6f7333752ed72cacddb05126' and trim(data) = 'a' {code} Based on the SQL, we currently get the filters in the following order: {code:java} // `(md5(cast(data#23 as binary)) = 8cde774d6f7333752ed72cacddb05126)) AND (trim(data#23, None) = a))` comes before `(id#22L = 2)` == Physical Plan == *(1) Project [id#22L, data#23] +- *(1) Filter isnotnull(data#23) AND isnotnull(id#22L)) AND (md5(cast(data#23 as binary)) = 8cde774d6f7333752ed72cacddb05126)) AND (trim(data#23, None) = a)) AND (id#22L = 2)) +- BatchScan[id#22L, data#23] class org.apache.spark.sql.connector.InMemoryTable$InMemoryBatchScan{code} In this predicate order, all data needs to participate in the evaluation, even if some data does not meet the later filtering criteria and it may causes spark tasks to execute slowly. So i think that filtering predicates that need to be evaluated should automatically be placed to the far right to avoid data that does not meet the criteria being evaluated. As shown below: {noformat} // `(id#22L = 2)` comes before `(md5(cast(data#23 as binary)) = 8cde774d6f7333752ed72cacddb05126)) AND (trim(data#23, None) = a))` == Physical Plan == *(1) Project [id#22L, data#23] +- *(1) Filter isnotnull(data#23) AND isnotnull(id#22L)) AND (id#22L = 2) AND (md5(cast(data#23 as binary)) = 8cde774d6f7333752ed72cacddb05126)) AND (trim(data#23, None) = a))) +- BatchScan[id#22L, data#23] class org.apache.spark.sql.connector.InMemoryTable$InMemoryBatchScan{noformat} was: {code:java} select id, data FROM testcat.ns1.ns2.table where id =2 and md5(data) = '8cde774d6f7333752ed72cacddb05126' and trim(data) = 'a' {code} Based on the SQL, we currently get the filters in the following order: {code:java} // `(md5(cast(data#23 as binary)) = 8cde774d6f7333752ed72cacddb05126)) AND (trim(data#23, None) = a))` comes before `(id#22L = 2)` == Physical Plan == *(1) Project [id#22L, data#23] +- *(1) Filter isnotnull(data#23) AND isnotnull(id#22L)) AND (md5(cast(data#23 as binary)) = 8cde774d6f7333752ed72cacddb05126)) AND (trim(data#23, None) = a)) AND (id#22L = 2)) +- BatchScan[id#22L, data#23] class org.apache.spark.sql.connector.InMemoryTable$InMemoryBatchScan{code} In this predicate order, all data needs to participate in the evaluation, even if some data does not meet the later filtering criteria and it may causes spark tasks to execute slowly. So i think that filtering predicates that need to be evaluated should automatically be placed to the far right to avoid data that does not meet the criteria being evaluated. As shown below: {noformat} // `(id#22L = 2)` comes before `(md5(cast(data#23 as binary)) = 8cde774d6f7333752ed72cacddb05126)) AND (trim(data#23, None) = a))` == Physical Plan == *(1) Project [id#22L, data#23] +- *(1) Filter isnotnull(data#23) AND isnotnull(id#22L)) AND (md5(cast(data#23 as binary)) = 8cde774d6f7333752ed72cacddb05126)) AND (trim(data#23, None) = a)) AND (id#22L = 2)) +- BatchScan[id#22L, data#23] class org.apache.spark.sql.connector.InMemoryTable$InMemoryBatchScan{noformat} > The order of filtering predicates is not reasonable > --- > > Key: SPARK-40045 > URL: https://issues.apache.org/jira/browse/SPARK-40045 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.2, 3.2.0, 3.3.0 >Reporter: caican >Priority: Major > > {code:java} > select id, data FROM testcat.ns1.ns2.table > where id =2 > and md5(data) = '8cde774d6f7333752ed72cacddb05126' > and trim(data) = 'a' {code} > Based on the SQL, we currently get the filters in the following order: > {code:java} > // `(md5(cast(data#23 as binary)) = 8cde774d6f7333752ed72cacddb05126)) AND > (trim(data#23, None) = a))` comes before `(id#22L = 2)` > == Physical Plan == *(1) Project [id#22L, data#23] > +- *(1) Filter isnotnull(data#23) AND isnotnull(id#22L)) AND > (md5(cast(data#23 as binary)) = 8cde774d6f7333752ed72cacddb05126)) AND > (trim(data#23, None) = a)) AND (id#22L = 2)) > +- BatchScan[id#22L, data#23] class > org.apache.spark.sql.connector.InMemoryTable$InMemoryBatchScan{code} > In this predicate order, all data needs to participate in the evaluation, > even if some data does not meet the later filtering criteria and it may > causes spark tasks to execute slowly. > > So i think that filtering predicates that need to be evaluated should > automatically be placed to the far right to avoid data that does not meet the > criteria being evaluated. > > As shown below: > {noformat} > // `(id#22L = 2)` comes before `(md5(cast(data#23 as binar
[jira] [Updated] (SPARK-40045) The order of filtering predicates is not reasonable
[ https://issues.apache.org/jira/browse/SPARK-40045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] caican updated SPARK-40045: --- Description: {code:java} select id, data FROM testcat.ns1.ns2.table where id =2 and md5(data) = '8cde774d6f7333752ed72cacddb05126' and trim(data) = 'a' {code} Based on the SQL, we currently get the filters in the following order: {code:java} // `(md5(cast(data#23 as binary)) = 8cde774d6f7333752ed72cacddb05126)) AND (trim(data#23, None) = a))` comes before `(id#22L = 2)` == Physical Plan == *(1) Project [id#22L, data#23] +- *(1) Filter isnotnull(data#23) AND isnotnull(id#22L)) AND (md5(cast(data#23 as binary)) = 8cde774d6f7333752ed72cacddb05126)) AND (trim(data#23, None) = a)) AND (id#22L = 2)) +- BatchScan[id#22L, data#23] class org.apache.spark.sql.connector.InMemoryTable$InMemoryBatchScan{code} In this predicate order, all data needs to participate in the evaluation, even if some data does not meet the later filtering criteria and it may causes spark tasks to execute slowly. So i think that filtering predicates that need to be evaluated should automatically be placed to the far right to avoid data that does not meet the criteria being evaluated. As shown below: {noformat} // `(id#22L = 2)` comes before `(md5(cast(data#23 as binary)) = 8cde774d6f7333752ed72cacddb05126)) AND (trim(data#23, None) = a))` == Physical Plan == *(1) Project [id#22L, data#23] +- *(1) Filter isnotnull(data#23) AND isnotnull(id#22L)) AND (md5(cast(data#23 as binary)) = 8cde774d6f7333752ed72cacddb05126)) AND (trim(data#23, None) = a)) AND (id#22L = 2)) +- BatchScan[id#22L, data#23] class org.apache.spark.sql.connector.InMemoryTable$InMemoryBatchScan{noformat} was: {code:java} select id, data FROM testcat.ns1.ns2.table where id =2 and md5(data) = '8cde774d6f7333752ed72cacddb05126' and trim(data) = 'a' {code} Based on the SQL, we currently get the filters in the following order: {code:java} // `(md5(cast(data#23 as binary)) = 8cde774d6f7333752ed72cacddb05126)) AND (trim(data#23, None) = a))` comes before `(id#22L = 2)` == Physical Plan == *(1) Project [id#22L, data#23] +- *(1) Filter isnotnull(data#23) AND isnotnull(id#22L)) AND (md5(cast(data#23 as binary)) = 8cde774d6f7333752ed72cacddb05126)) AND (trim(data#23, None) = a)) AND (id#22L = 2)) +- BatchScan[id#22L, data#23] class org.apache.spark.sql.connector.InMemoryTable$InMemoryBatchScan{code} In this predicate order, all data needs to participate in the evaluation, even if some data does not meet the later filtering criteria and it may causes spark tasks to execute slowly. So i think that filtering predicates that need to be evaluated should automatically be placed to the far right to avoid data that does not meet the criteria being evaluated. As shown below: {noformat} == Physical Plan == *(1) Project [id#22L, data#23] +- *(1) Filter isnotnull(data#23) AND isnotnull(id#22L)) AND (md5(cast(data#23 as binary)) = 8cde774d6f7333752ed72cacddb05126)) AND (trim(data#23, None) = a)) AND (id#22L = 2)) +- BatchScan[id#22L, data#23] class org.apache.spark.sql.connector.InMemoryTable$InMemoryBatchScan{noformat} > The order of filtering predicates is not reasonable > --- > > Key: SPARK-40045 > URL: https://issues.apache.org/jira/browse/SPARK-40045 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.2, 3.2.0, 3.3.0 >Reporter: caican >Priority: Major > > {code:java} > select id, data FROM testcat.ns1.ns2.table > where id =2 > and md5(data) = '8cde774d6f7333752ed72cacddb05126' > and trim(data) = 'a' {code} > Based on the SQL, we currently get the filters in the following order: > {code:java} > // `(md5(cast(data#23 as binary)) = 8cde774d6f7333752ed72cacddb05126)) AND > (trim(data#23, None) = a))` comes before `(id#22L = 2)` > == Physical Plan == *(1) Project [id#22L, data#23] > +- *(1) Filter isnotnull(data#23) AND isnotnull(id#22L)) AND > (md5(cast(data#23 as binary)) = 8cde774d6f7333752ed72cacddb05126)) AND > (trim(data#23, None) = a)) AND (id#22L = 2)) > +- BatchScan[id#22L, data#23] class > org.apache.spark.sql.connector.InMemoryTable$InMemoryBatchScan{code} > In this predicate order, all data needs to participate in the evaluation, > even if some data does not meet the later filtering criteria and it may > causes spark tasks to execute slowly. > > So i think that filtering predicates that need to be evaluated should > automatically be placed to the far right to avoid data that does not meet the > criteria being evaluated. > > As shown below: > {noformat} > // `(id#22L = 2)` comes before `(md5(cast(data#23 as binary)) = > 8cde774d6f7333752ed72cacddb05126)) AND (trim(data#23, None) = a))` > == Physical Plan == *(1) Project [id#22L, data#23] > +
[jira] [Updated] (SPARK-40045) The order of filtering predicates is not reasonable
[ https://issues.apache.org/jira/browse/SPARK-40045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] caican updated SPARK-40045: --- Description: {code:java} select id, data FROM testcat.ns1.ns2.table where id =2 and md5(data) = '8cde774d6f7333752ed72cacddb05126' and trim(data) = 'a' {code} Based on the SQL, we currently get the filters in the following order: {code:java} // `(md5(cast(data#23 as binary)) = 8cde774d6f7333752ed72cacddb05126)) AND (trim(data#23, None) = a))` comes before `(id#22L = 2)` == Physical Plan == *(1) Project [id#22L, data#23] +- *(1) Filter isnotnull(data#23) AND isnotnull(id#22L)) AND (md5(cast(data#23 as binary)) = 8cde774d6f7333752ed72cacddb05126)) AND (trim(data#23, None) = a)) AND (id#22L = 2)) +- BatchScan[id#22L, data#23] class org.apache.spark.sql.connector.InMemoryTable$InMemoryBatchScan{code} In this predicate order, all data needs to participate in the evaluation, even if some data does not meet the later filtering criteria and it may causes spark tasks to execute slowly. So i think that filtering predicates that need to be evaluated should automatically be placed to the far right to avoid data that does not meet the criteria being evaluated. As shown below: {noformat} == Physical Plan == *(1) Project [id#22L, data#23] +- *(1) Filter isnotnull(data#23) AND isnotnull(id#22L)) AND (md5(cast(data#23 as binary)) = 8cde774d6f7333752ed72cacddb05126)) AND (trim(data#23, None) = a)) AND (id#22L = 2)) +- BatchScan[id#22L, data#23] class org.apache.spark.sql.connector.InMemoryTable$InMemoryBatchScan{noformat} was: {code:java} select id, data FROM testcat.ns1.ns2.table where id =2 and md5(data) = '8cde774d6f7333752ed72cacddb05126' and trim(data) = 'a' {code} Based on the SQL, we currently get the filters in the following order: == Physical Plan == *(1) Project [id#22L, data#23] +- *(1) Filter isnotnull(data#23) AND isnotnull(id#22L)) AND (md5(cast(data#23 as binary)) = 8cde774d6f7333752ed72cacddb05126)) AND (trim(data#23, None) = a)) AND (id#22L = 2)) +- BatchScan[id#22L, data#23] class org.apache.spark.sql.connector.InMemoryTable$InMemoryBatchScan {code:java} == Physical Plan == *(1) Project [id#22L, data#23] +- *(1) Filter isnotnull(data#23) AND isnotnull(id#22L)) AND (md5(cast(data#23 as binary)) = 8cde774d6f7333752ed72cacddb05126)) AND (trim(data#23, None) = a)) AND (id#22L = 2)) +- BatchScan[id#22L, data#23] class org.apache.spark.sql.connector.InMemoryTable$InMemoryBatchScan{code} In this predicate order, all data needs to participate in the evaluation, even if some data does not meet the later filtering criteria and it may causes spark tasks to execute slowly. So i think that filtering predicates that need to be evaluated should automatically be placed to the far right to avoid data that does not meet the criteria being evaluated. As shown below: {noformat} == Physical Plan == *(1) Project [id#22L, data#23] +- *(1) Filter isnotnull(data#23) AND isnotnull(id#22L)) AND (md5(cast(data#23 as binary)) = 8cde774d6f7333752ed72cacddb05126)) AND (trim(data#23, None) = a)) AND (id#22L = 2)) +- BatchScan[id#22L, data#23] class org.apache.spark.sql.connector.InMemoryTable$InMemoryBatchScan{noformat} > The order of filtering predicates is not reasonable > --- > > Key: SPARK-40045 > URL: https://issues.apache.org/jira/browse/SPARK-40045 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.2, 3.2.0, 3.3.0 >Reporter: caican >Priority: Major > > {code:java} > select id, data FROM testcat.ns1.ns2.table > where id =2 > and md5(data) = '8cde774d6f7333752ed72cacddb05126' > and trim(data) = 'a' {code} > Based on the SQL, we currently get the filters in the following order: > {code:java} > // `(md5(cast(data#23 as binary)) = 8cde774d6f7333752ed72cacddb05126)) AND > (trim(data#23, None) = a))` comes before `(id#22L = 2)` > == Physical Plan == *(1) Project [id#22L, data#23] > +- *(1) Filter isnotnull(data#23) AND isnotnull(id#22L)) AND > (md5(cast(data#23 as binary)) = 8cde774d6f7333752ed72cacddb05126)) AND > (trim(data#23, None) = a)) AND (id#22L = 2)) > +- BatchScan[id#22L, data#23] class > org.apache.spark.sql.connector.InMemoryTable$InMemoryBatchScan{code} > In this predicate order, all data needs to participate in the evaluation, > even if some data does not meet the later filtering criteria and it may > causes spark tasks to execute slowly. > > So i think that filtering predicates that need to be evaluated should > automatically be placed to the far right to avoid data that does not meet the > criteria being evaluated. > > As shown below: > {noformat} > == Physical Plan == *(1) Project [id#22L, data#23] > +- *(1) Filter isnotnull(data#23) AND isnotnull(id#22L)) AND
[jira] [Updated] (SPARK-40045) The order of filtering predicates is not reasonable
[ https://issues.apache.org/jira/browse/SPARK-40045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] caican updated SPARK-40045: --- Description: {code:java} select id, data FROM testcat.ns1.ns2.table where id =2 and md5(data) = '8cde774d6f7333752ed72cacddb05126' and trim(data) = 'a' {code} Based on the SQL, we currently get the filters in the following order: == Physical Plan == *(1) Project [id#22L, data#23] +- *(1) Filter isnotnull(data#23) AND isnotnull(id#22L)) AND (md5(cast(data#23 as binary)) = 8cde774d6f7333752ed72cacddb05126)) AND (trim(data#23, None) = a)) AND (id#22L = 2)) +- BatchScan[id#22L, data#23] class org.apache.spark.sql.connector.InMemoryTable$InMemoryBatchScan {code:java} == Physical Plan == *(1) Project [id#22L, data#23] +- *(1) Filter isnotnull(data#23) AND isnotnull(id#22L)) AND (md5(cast(data#23 as binary)) = 8cde774d6f7333752ed72cacddb05126)) AND (trim(data#23, None) = a)) AND (id#22L = 2)) +- BatchScan[id#22L, data#23] class org.apache.spark.sql.connector.InMemoryTable$InMemoryBatchScan{code} In this predicate order, all data needs to participate in the evaluation, even if some data does not meet the later filtering criteria and it may causes spark tasks to execute slowly. So i think that filtering predicates that need to be evaluated should automatically be placed to the far right to avoid data that does not meet the criteria being evaluated. As shown below: {noformat} == Physical Plan == *(1) Project [id#22L, data#23] +- *(1) Filter isnotnull(data#23) AND isnotnull(id#22L)) AND (md5(cast(data#23 as binary)) = 8cde774d6f7333752ed72cacddb05126)) AND (trim(data#23, None) = a)) AND (id#22L = 2)) +- BatchScan[id#22L, data#23] class org.apache.spark.sql.connector.InMemoryTable$InMemoryBatchScan{noformat} was: {code:java} select id, data FROM testcat.ns1.ns2.table where id =2 and md5(data) = '8cde774d6f7333752ed72cacddb05126' and trim(data) = 'a' {code} Based on the SQL, we currently get the filters in the following order: {code:java} == Physical Plan == *(1) Project [id#22L, data#23] +- *(1) Filter isnotnull(data#23) AND isnotnull(id#22L)) AND (md5(cast(data#23 as binary)) = 8cde774d6f7333752ed72cacddb05126)) AND (trim(data#23, None) = a)) AND (id#22L = 2)) +- BatchScan[id#22L, data#23] class org.apache.spark.sql.connector.InMemoryTable$InMemoryBatchScan{code} In this predicate order, all data needs to participate in the evaluation, even if some data does not meet the later filtering criteria and it may causes spark tasks to execute slowly. So i think that filtering predicates that need to be evaluated should automatically be placed to the far right to avoid data that does not meet the criteria being evaluated. As shown below: {noformat} == Physical Plan == *(1) Project [id#22L, data#23] +- *(1) Filter isnotnull(data#23) AND isnotnull(id#22L)) AND (md5(cast(data#23 as binary)) = 8cde774d6f7333752ed72cacddb05126)) AND (trim(data#23, None) = a)) AND (id#22L = 2)) +- BatchScan[id#22L, data#23] class org.apache.spark.sql.connector.InMemoryTable$InMemoryBatchScan{noformat} > The order of filtering predicates is not reasonable > --- > > Key: SPARK-40045 > URL: https://issues.apache.org/jira/browse/SPARK-40045 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.2, 3.2.0, 3.3.0 >Reporter: caican >Priority: Major > > {code:java} > select id, data FROM testcat.ns1.ns2.table > where id =2 > and md5(data) = '8cde774d6f7333752ed72cacddb05126' > and trim(data) = 'a' {code} > Based on the SQL, we currently get the filters in the following order: > == Physical Plan == *(1) Project [id#22L, data#23] +- *(1) Filter > isnotnull(data#23) AND isnotnull(id#22L)) AND (md5(cast(data#23 as > binary)) = 8cde774d6f7333752ed72cacddb05126)) AND (trim(data#23, None) = a)) > AND (id#22L = 2)) +- BatchScan[id#22L, data#23] class > org.apache.spark.sql.connector.InMemoryTable$InMemoryBatchScan > {code:java} > == Physical Plan == *(1) Project [id#22L, data#23] > +- *(1) Filter isnotnull(data#23) AND isnotnull(id#22L)) AND > (md5(cast(data#23 as binary)) = 8cde774d6f7333752ed72cacddb05126)) AND > (trim(data#23, None) = a)) AND (id#22L = 2)) > +- BatchScan[id#22L, data#23] class > org.apache.spark.sql.connector.InMemoryTable$InMemoryBatchScan{code} > In this predicate order, all data needs to participate in the evaluation, > even if some data does not meet the later filtering criteria and it may > causes spark tasks to execute slowly. > > So i think that filtering predicates that need to be evaluated should > automatically be placed to the far right to avoid data that does not meet the > criteria being evaluated. > > As shown below: > {noformat} > == Physical Plan == *(1) Project [id#2
[jira] [Updated] (SPARK-40045) The order of filtering predicates is not reasonable
[ https://issues.apache.org/jira/browse/SPARK-40045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] caican updated SPARK-40045: --- Description: {code:java} select id, data FROM testcat.ns1.ns2.table where id =2 and md5(data) = '8cde774d6f7333752ed72cacddb05126' and trim(data) = 'a' {code} Based on the SQL, we currently get the filters in the following order: {code:java} == Physical Plan == *(1) Project [id#22L, data#23] +- *(1) Filter isnotnull(data#23) AND isnotnull(id#22L)) AND (md5(cast(data#23 as binary)) = 8cde774d6f7333752ed72cacddb05126)) AND (trim(data#23, None) = a)) AND (id#22L = 2)) +- BatchScan[id#22L, data#23] class org.apache.spark.sql.connector.InMemoryTable$InMemoryBatchScan{code} In this predicate order, all data needs to participate in the evaluation, even if some data does not meet the later filtering criteria and it may causes spark tasks to execute slowly. So i think that filtering predicates that need to be evaluated should automatically be placed to the far right to avoid data that does not meet the criteria being evaluated. As shown below: {noformat} == Physical Plan == *(1) Project [id#22L, data#23] +- *(1) Filter isnotnull(data#23) AND isnotnull(id#22L)) AND (md5(cast(data#23 as binary)) = 8cde774d6f7333752ed72cacddb05126)) AND (trim(data#23, None) = a)) AND (id#22L = 2)) +- BatchScan[id#22L, data#23] class org.apache.spark.sql.connector.InMemoryTable$InMemoryBatchScan{noformat} was: {code:java} select id, data FROM testcat.ns1.ns2.table where id =2 and md5(data) = '8cde774d6f7333752ed72cacddb05126' and trim(data) = 'a' {code} Based on the SQL, we currently get the filters in the following order: {code:java} == Physical Plan == *(1) Project [id#22L, data#23] +- *(1) Filter isnotnull(data#23) AND isnotnull(id#22L)) AND (md5(cast(data#23 as binary)) = 8cde774d6f7333752ed72cacddb05126)) AND (trim(data#23, None) = a)) AND (id#22L = 2)) +- BatchScan[id#22L, data#23] class org.apache.spark.sql.connector.InMemoryTable$InMemoryBatchScan{code} In this predicate order, all data needs to participate in the evaluation, even if some data does not meet the later filtering criteria and it may causes spark tasks to execute slowly. So i think that filtering predicates that need to be evaluated should automatically be placed to the far right to avoid data that does not meet the criteria being evaluated. As shown below: {noformat} {noformat} > The order of filtering predicates is not reasonable > --- > > Key: SPARK-40045 > URL: https://issues.apache.org/jira/browse/SPARK-40045 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.2, 3.2.0, 3.3.0 >Reporter: caican >Priority: Major > > {code:java} > select id, data FROM testcat.ns1.ns2.table > where id =2 > and md5(data) = '8cde774d6f7333752ed72cacddb05126' > and trim(data) = 'a' {code} > Based on the SQL, we currently get the filters in the following order: > {code:java} > == Physical Plan == *(1) Project [id#22L, data#23] > +- *(1) Filter isnotnull(data#23) AND isnotnull(id#22L)) AND > (md5(cast(data#23 as binary)) = 8cde774d6f7333752ed72cacddb05126)) AND > (trim(data#23, None) = a)) AND (id#22L = 2)) > +- BatchScan[id#22L, data#23] class > org.apache.spark.sql.connector.InMemoryTable$InMemoryBatchScan{code} > In this predicate order, all data needs to participate in the evaluation, > even if some data does not meet the later filtering criteria and it may > causes spark tasks to execute slowly. > > So i think that filtering predicates that need to be evaluated should > automatically be placed to the far right to avoid data that does not meet the > criteria being evaluated. > > As shown below: > {noformat} > == Physical Plan == *(1) Project [id#22L, data#23] > +- *(1) Filter isnotnull(data#23) AND isnotnull(id#22L)) AND > (md5(cast(data#23 as binary)) = 8cde774d6f7333752ed72cacddb05126)) AND > (trim(data#23, None) = a)) AND (id#22L = 2)) > +- BatchScan[id#22L, data#23] class > org.apache.spark.sql.connector.InMemoryTable$InMemoryBatchScan{noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40045) The order of filtering predicates is not reasonable
[ https://issues.apache.org/jira/browse/SPARK-40045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] caican updated SPARK-40045: --- Description: {code:java} select id, data FROM testcat.ns1.ns2.table where id =2 and md5(data) = '8cde774d6f7333752ed72cacddb05126' and trim(data) = 'a' {code} Based on the SQL, we currently get the filters in the following order: {code:java} == Physical Plan == *(1) Project [id#22L, data#23] +- *(1) Filter isnotnull(data#23) AND isnotnull(id#22L)) AND (md5(cast(data#23 as binary)) = 8cde774d6f7333752ed72cacddb05126)) AND (trim(data#23, None) = a)) AND (id#22L = 2)) +- BatchScan[id#22L, data#23] class org.apache.spark.sql.connector.InMemoryTable$InMemoryBatchScan{code} In this predicate order, all data needs to participate in the evaluation, even if some data does not meet the later filtering criteria and it may causes spark tasks to execute slowly. So i think that filtering predicates that need to be evaluated should automatically be placed to the far right to avoid data that does not meet the criteria being evaluated. As shown below: {noformat} {noformat} was: {code:java} select id, data FROM testcat.ns1.ns2.table where id =2 and md5(data) = '8cde774d6f7333752ed72cacddb05126' and trim(data) = 'a' {code} Based on the SQL, we currently get the filters in the following order: {code:java} // code placeholder{code} In this predicate order, all data needs to participate in the evaluation, even if some data does not meet the later filtering criteria and it may causes spark tasks to execute slowly. So i think that filtering predicates that need to be evaluated should automatically be placed to the far right to avoid data that does not meet the criteria being evaluated. As shown below: {noformat} {noformat} > The order of filtering predicates is not reasonable > --- > > Key: SPARK-40045 > URL: https://issues.apache.org/jira/browse/SPARK-40045 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.2, 3.2.0, 3.3.0 >Reporter: caican >Priority: Major > > {code:java} > select id, data FROM testcat.ns1.ns2.table > where id =2 > and md5(data) = '8cde774d6f7333752ed72cacddb05126' > and trim(data) = 'a' {code} > Based on the SQL, we currently get the filters in the following order: > > {code:java} > == Physical Plan == *(1) Project [id#22L, data#23] > +- *(1) Filter isnotnull(data#23) AND isnotnull(id#22L)) AND > (md5(cast(data#23 as binary)) = 8cde774d6f7333752ed72cacddb05126)) AND > (trim(data#23, None) = a)) AND (id#22L = 2)) > +- BatchScan[id#22L, data#23] class > org.apache.spark.sql.connector.InMemoryTable$InMemoryBatchScan{code} > > In this predicate order, all data needs to participate in the evaluation, > even if some data does not meet the later filtering criteria and it may > causes spark tasks to execute slowly. > > So i think that filtering predicates that need to be evaluated should > automatically be placed to the far right to avoid data that does not meet the > criteria being evaluated. > > As shown below: > {noformat} > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40045) The order of filtering predicates is not reasonable
caican created SPARK-40045: -- Summary: The order of filtering predicates is not reasonable Key: SPARK-40045 URL: https://issues.apache.org/jira/browse/SPARK-40045 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.3.0, 3.2.0, 3.1.2 Reporter: caican {code:java} select id, data FROM testcat.ns1.ns2.table where id =2 and md5(data) = '8cde774d6f7333752ed72cacddb05126' and trim(data) = 'a' {code} Based on the SQL, we currently get the filters in the following order: {code:java} // code placeholder{code} In this predicate order, all data needs to participate in the evaluation, even if some data does not meet the later filtering criteria and it may causes spark tasks to execute slowly. So i think that filtering predicates that need to be evaluated should automatically be placed to the far right to avoid data that does not meet the criteria being evaluated. As shown below: {noformat} {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38559) display the number of empty partitions on spark ui
[ https://issues.apache.org/jira/browse/SPARK-38559?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] caican updated SPARK-38559: --- Description: When demoting join from broadcast-hash to smj, i think it is necessary to display the number of empty partitions on spark ui. Otherwise, users might wonder why SMJ is used when joining a small table. Displaying the number of empty partitions is useful for users to understand changes to the execution plan. Before updated the ui: !image-2022-03-16-10-56-46-446.png! After updated the ui, display the number of empty partitions: !image-2022-03-16-11-07-39-182.png! was: When demoting join from broadcast-hash to smj, i think it is necessary to display the number of empty partitions on spark ui. Otherwise, users might wonder why SMJ is used when joining a small table. Displaying the number of empty partitions is useful for users to understand changes to the execution plan. Before updated the ui:) !image-2022-03-16-10-56-46-446.png! After updated the ui, display the number of empty partitions:) !image-2022-03-16-11-07-39-182.png! > display the number of empty partitions on spark ui > -- > > Key: SPARK-38559 > URL: https://issues.apache.org/jira/browse/SPARK-38559 > Project: Spark > Issue Type: Improvement > Components: SQL, Web UI >Affects Versions: 3.1.2 >Reporter: caican >Priority: Major > Attachments: image-2022-03-16-10-56-46-446.png, > image-2022-03-16-11-07-39-182.png > > > When demoting join from broadcast-hash to smj, i think it is necessary to > display the number of empty partitions on spark ui. > Otherwise, users might wonder why SMJ is used when joining a small table. > Displaying the number of empty partitions is useful for users to understand > changes to the execution plan. > Before updated the ui: > !image-2022-03-16-10-56-46-446.png! > After updated the ui, display the number of empty partitions: > !image-2022-03-16-11-07-39-182.png! > > > > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38559) display the number of empty partitions on spark ui
[ https://issues.apache.org/jira/browse/SPARK-38559?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] caican updated SPARK-38559: --- Description: When demoting join from broadcast-hash to smj, i think it is necessary to display the number of empty partitions on spark ui. Otherwise, users might wonder why SMJ is used when joining a small table. Displaying the number of empty partitions is useful for users to understand changes to the execution plan. Before updated the ui:) !image-2022-03-16-10-56-46-446.png! After updated the ui, display the number of empty partitions:) !image-2022-03-16-11-07-39-182.png! was: When demoting join from broadcast-hash to smj, i think it is necessary to display the number of empty partitions on spark ui. Otherwise, users might wonder why SMJ is used when joining a small table. Displaying the number of empty partitions is useful for users to understand changes to the execution plan. Before modify the ui: !image-2022-03-16-10-56-46-446.png! > display the number of empty partitions on spark ui > -- > > Key: SPARK-38559 > URL: https://issues.apache.org/jira/browse/SPARK-38559 > Project: Spark > Issue Type: Improvement > Components: SQL, Web UI >Affects Versions: 3.1.2 >Reporter: caican >Priority: Major > Attachments: image-2022-03-16-10-56-46-446.png, > image-2022-03-16-11-07-39-182.png > > > When demoting join from broadcast-hash to smj, i think it is necessary to > display the number of empty partitions on spark ui. > Otherwise, users might wonder why SMJ is used when joining a small table. > Displaying the number of empty partitions is useful for users to understand > changes to the execution plan. > Before updated the ui:) > !image-2022-03-16-10-56-46-446.png! > After updated the ui, display the number of empty partitions:) > !image-2022-03-16-11-07-39-182.png! > > > > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38559) display the number of empty partitions on spark ui
[ https://issues.apache.org/jira/browse/SPARK-38559?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] caican updated SPARK-38559: --- Attachment: image-2022-03-16-11-07-39-182.png > display the number of empty partitions on spark ui > -- > > Key: SPARK-38559 > URL: https://issues.apache.org/jira/browse/SPARK-38559 > Project: Spark > Issue Type: Improvement > Components: SQL, Web UI >Affects Versions: 3.1.2 >Reporter: caican >Priority: Major > Attachments: image-2022-03-16-10-56-46-446.png, > image-2022-03-16-11-07-39-182.png > > > When demoting join from broadcast-hash to smj, i think it is necessary to > display the number of empty partitions on spark ui. > Otherwise, users might wonder why SMJ is used when joining a small table. > Displaying the number of empty partitions is useful for users to understand > changes to the execution plan. > Before modify the ui: > !image-2022-03-16-10-56-46-446.png! -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38559) display the number of empty partitions on spark ui
[ https://issues.apache.org/jira/browse/SPARK-38559?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] caican updated SPARK-38559: --- Description: When demoting join from broadcast-hash to smj, i think it is necessary to display the number of empty partitions on spark ui. Otherwise, users might wonder why SMJ is used when joining a small table. Displaying the number of empty partitions is useful for users to understand changes to the execution plan. Before modify the ui: !image-2022-03-16-10-56-46-446.png! was: When demoting join from broadcast-hash to smj, i think it is necessary to display number of empty partitions on spark ui. Otherwise, users might wonder why SMJ is used when joining a small table. Displaying the number of empty partitions is useful for users to understand changes to the execution plan. Before modify the ui: !image-2022-03-16-10-56-46-446.png! > display the number of empty partitions on spark ui > -- > > Key: SPARK-38559 > URL: https://issues.apache.org/jira/browse/SPARK-38559 > Project: Spark > Issue Type: Improvement > Components: SQL, Web UI >Affects Versions: 3.1.2 >Reporter: caican >Priority: Major > Attachments: image-2022-03-16-10-56-46-446.png > > > When demoting join from broadcast-hash to smj, i think it is necessary to > display the number of empty partitions on spark ui. > Otherwise, users might wonder why SMJ is used when joining a small table. > Displaying the number of empty partitions is useful for users to understand > changes to the execution plan. > Before modify the ui: > !image-2022-03-16-10-56-46-446.png! -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38559) display the number of empty partitions on spark ui
[ https://issues.apache.org/jira/browse/SPARK-38559?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] caican updated SPARK-38559: --- Summary: display the number of empty partitions on spark ui (was: display number of empty partitions on spark ui) > display the number of empty partitions on spark ui > -- > > Key: SPARK-38559 > URL: https://issues.apache.org/jira/browse/SPARK-38559 > Project: Spark > Issue Type: Improvement > Components: SQL, Web UI >Affects Versions: 3.1.2 >Reporter: caican >Priority: Major > Attachments: image-2022-03-16-10-56-46-446.png > > > When demoting join from broadcast-hash to smj, i think it is necessary to > display number of empty partitions on spark ui. > Otherwise, users might wonder why SMJ is used when joining a small table. > Displaying the number of empty partitions is useful for users to understand > changes to the execution plan. > Before modify the ui: > !image-2022-03-16-10-56-46-446.png! -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38559) display number of empty partitions on spark ui
[ https://issues.apache.org/jira/browse/SPARK-38559?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] caican updated SPARK-38559: --- Attachment: (was: ui.png) > display number of empty partitions on spark ui > -- > > Key: SPARK-38559 > URL: https://issues.apache.org/jira/browse/SPARK-38559 > Project: Spark > Issue Type: Improvement > Components: SQL, Web UI >Affects Versions: 3.1.2 >Reporter: caican >Priority: Major > Attachments: image-2022-03-16-10-56-46-446.png > > > When demoting join from broadcast-hash to smj, i think it is necessary to > display number of empty partitions on spark ui. > Otherwise, users might wonder why SMJ is used when joining a small table. > Displaying the number of empty partitions is useful for users to understand > changes to the execution plan. > Before modify the ui: > !image-2022-03-16-10-56-46-446.png! -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38559) display number of empty partitions on spark ui
[ https://issues.apache.org/jira/browse/SPARK-38559?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] caican updated SPARK-38559: --- Description: When demoting join from broadcast-hash to smj, i think it is necessary to display number of empty partitions on spark ui. Otherwise, users might wonder why SMJ is used when joining a small table. Displaying the number of empty partitions is useful for users to understand changes to the execution plan. Before modify the ui: !image-2022-03-16-10-56-46-446.png! was: When demoting join from broadcast-hash to smj, i think it is necessary to display number of empty partitions on spark ui. Otherwise, users might wonder why SMJ is used when joining a small table. Displaying the number of empty partitions is useful for users to understand changes to the execution plan. > display number of empty partitions on spark ui > -- > > Key: SPARK-38559 > URL: https://issues.apache.org/jira/browse/SPARK-38559 > Project: Spark > Issue Type: Improvement > Components: SQL, Web UI >Affects Versions: 3.1.2 >Reporter: caican >Priority: Major > Attachments: image-2022-03-16-10-56-46-446.png, ui.png > > > When demoting join from broadcast-hash to smj, i think it is necessary to > display number of empty partitions on spark ui. > Otherwise, users might wonder why SMJ is used when joining a small table. > Displaying the number of empty partitions is useful for users to understand > changes to the execution plan. > Before modify the ui: > !image-2022-03-16-10-56-46-446.png! -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38559) display number of empty partitions on spark ui
[ https://issues.apache.org/jira/browse/SPARK-38559?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] caican updated SPARK-38559: --- Attachment: image-2022-03-16-10-56-46-446.png > display number of empty partitions on spark ui > -- > > Key: SPARK-38559 > URL: https://issues.apache.org/jira/browse/SPARK-38559 > Project: Spark > Issue Type: Improvement > Components: SQL, Web UI >Affects Versions: 3.1.2 >Reporter: caican >Priority: Major > Attachments: image-2022-03-16-10-56-46-446.png, ui.png > > > When demoting join from broadcast-hash to smj, i think it is necessary to > display number of empty partitions on spark ui. > Otherwise, users might wonder why SMJ is used when joining a small table. > Displaying the number of empty partitions is useful for users to understand > changes to the execution plan. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38559) display number of empty partitions on spark ui
[ https://issues.apache.org/jira/browse/SPARK-38559?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] caican updated SPARK-38559: --- Attachment: ui.png > display number of empty partitions on spark ui > -- > > Key: SPARK-38559 > URL: https://issues.apache.org/jira/browse/SPARK-38559 > Project: Spark > Issue Type: Improvement > Components: SQL, Web UI >Affects Versions: 3.1.2 >Reporter: caican >Priority: Major > Attachments: image-2022-03-16-10-56-46-446.png, ui.png > > > When demoting join from broadcast-hash to smj, i think it is necessary to > display number of empty partitions on spark ui. > Otherwise, users might wonder why SMJ is used when joining a small table. > Displaying the number of empty partitions is useful for users to understand > changes to the execution plan. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38559) display number of empty partitions on spark ui
[ https://issues.apache.org/jira/browse/SPARK-38559?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] caican updated SPARK-38559: --- Attachment: (was: 小米办公20220316-105510.png) > display number of empty partitions on spark ui > -- > > Key: SPARK-38559 > URL: https://issues.apache.org/jira/browse/SPARK-38559 > Project: Spark > Issue Type: Improvement > Components: SQL, Web UI >Affects Versions: 3.1.2 >Reporter: caican >Priority: Major > > When demoting join from broadcast-hash to smj, i think it is necessary to > display number of empty partitions on spark ui. > Otherwise, users might wonder why SMJ is used when joining a small table. > Displaying the number of empty partitions is useful for users to understand > changes to the execution plan. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38559) display number of empty partitions on spark ui
[ https://issues.apache.org/jira/browse/SPARK-38559?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] caican updated SPARK-38559: --- Attachment: 小米办公20220316-105510.png > display number of empty partitions on spark ui > -- > > Key: SPARK-38559 > URL: https://issues.apache.org/jira/browse/SPARK-38559 > Project: Spark > Issue Type: Improvement > Components: SQL, Web UI >Affects Versions: 3.1.2 >Reporter: caican >Priority: Major > > When demoting join from broadcast-hash to smj, i think it is necessary to > display number of empty partitions on spark ui. > Otherwise, users might wonder why SMJ is used when joining a small table. > Displaying the number of empty partitions is useful for users to understand > changes to the execution plan. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38559) display number of empty partitions on spark ui
[ https://issues.apache.org/jira/browse/SPARK-38559?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] caican updated SPARK-38559: --- Summary: display number of empty partitions on spark ui (was: display number of empty partitions on spark ui when demoting join from broadcast-hash to smj) > display number of empty partitions on spark ui > -- > > Key: SPARK-38559 > URL: https://issues.apache.org/jira/browse/SPARK-38559 > Project: Spark > Issue Type: Improvement > Components: SQL, Web UI >Affects Versions: 3.1.2 >Reporter: caican >Priority: Major > > When demoting join from broadcast-hash to smj, i think it is necessary to > display number of empty partitions on spark ui. > Otherwise, users might wonder why SMJ is used when joining a small table. > Displaying the number of empty partitions is useful for users to understand > changes to the execution plan. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38559) display number of empty partitions on spark ui when demoting join from broadcast-hash to smj
[ https://issues.apache.org/jira/browse/SPARK-38559?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] caican updated SPARK-38559: --- Description: When demoting join from broadcast-hash to smj, i think it is necessary to display number of empty partitions on spark ui. Otherwise, users might wonder why SMJ is used when joining a small table. Displaying the number of empty partitions is useful for users to understand changes to the execution plan. was: When demoting join from broadcast-hash to smj, i think it is necessary to show number of empty partitions on spark ui. Otherwise, users might wonder why SMJ is used when joining a small table. Displaying the number of empty partitions is useful for users to understand changes to the execution plan. > display number of empty partitions on spark ui when demoting join from > broadcast-hash to smj > > > Key: SPARK-38559 > URL: https://issues.apache.org/jira/browse/SPARK-38559 > Project: Spark > Issue Type: Improvement > Components: SQL, Web UI >Affects Versions: 3.1.2 >Reporter: caican >Priority: Major > > When demoting join from broadcast-hash to smj, i think it is necessary to > display number of empty partitions on spark ui. > Otherwise, users might wonder why SMJ is used when joining a small table. > Displaying the number of empty partitions is useful for users to understand > changes to the execution plan. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-38559) show number of empty partitions on spark ui when demoting join from broadcast-hash to smj
caican created SPARK-38559: -- Summary: show number of empty partitions on spark ui when demoting join from broadcast-hash to smj Key: SPARK-38559 URL: https://issues.apache.org/jira/browse/SPARK-38559 Project: Spark Issue Type: Improvement Components: SQL, Web UI Affects Versions: 3.1.2 Reporter: caican When demoting join from broadcast-hash to smj, i think it is necessary to show number of empty partitions on spark ui. Otherwise, users might wonder why SMJ is used when joining a small table. Displaying the number of empty partitions is useful for users to understand changes to the execution plan. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38559) display number of empty partitions on spark ui when demoting join from broadcast-hash to smj
[ https://issues.apache.org/jira/browse/SPARK-38559?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] caican updated SPARK-38559: --- Summary: display number of empty partitions on spark ui when demoting join from broadcast-hash to smj (was: show number of empty partitions on spark ui when demoting join from broadcast-hash to smj) > display number of empty partitions on spark ui when demoting join from > broadcast-hash to smj > > > Key: SPARK-38559 > URL: https://issues.apache.org/jira/browse/SPARK-38559 > Project: Spark > Issue Type: Improvement > Components: SQL, Web UI >Affects Versions: 3.1.2 >Reporter: caican >Priority: Major > > When demoting join from broadcast-hash to smj, i think it is necessary to > show number of empty partitions on spark ui. > Otherwise, users might wonder why SMJ is used when joining a small table. > Displaying the number of empty partitions is useful for users to understand > changes to the execution plan. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-38444) Automatically calculate the upper and lower bounds of partitions when no specified partition related params
caican created SPARK-38444: -- Summary: Automatically calculate the upper and lower bounds of partitions when no specified partition related params Key: SPARK-38444 URL: https://issues.apache.org/jira/browse/SPARK-38444 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.1.2 Reporter: caican when access rdbms, such as mysql, if partitionColumn, lowerBound, upperBound, numPartitions are not specified, by default only one partition to scan database is working. It makes load data from database slow and makes it difficult for users to configure multiple parameters to improve parallelism. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-38431) Support to delete matched rows from jdbc tables
caican created SPARK-38431: -- Summary: Support to delete matched rows from jdbc tables Key: SPARK-38431 URL: https://issues.apache.org/jira/browse/SPARK-38431 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.1.2 Reporter: caican The Spark SQL cannot perform delete opration when it accesses the RDBMS. I think that It's not friendly. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37382) `with as` clause got inconsistent results
[ https://issues.apache.org/jira/browse/SPARK-37382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17447280#comment-17447280 ] caican commented on SPARK-37382: [~victor-wong] Does the images display nomally now? > `with as` clause got inconsistent results > - > > Key: SPARK-37382 > URL: https://issues.apache.org/jira/browse/SPARK-37382 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.2 >Reporter: caican >Priority: Major > Attachments: spark2.3.png, spark3.1.png > > > In Spark3.1, the `with as` clause in the same SQL is executed multiple times, > got different results > ` > with tab as ( > select 'Withas' as name, rand() as rand_number > ) > select name, rand_number > from tab > union all > select name, rand_number > from tab > ` > !spark3.1.png! > But In spark2.3, it got consistent results > ` > with tab as ( > select 'Withas' as name, rand() as rand_number > ) > select name, rand_number > from tab > union all > select name, rand_number > from tab > ` > !spark2.3.png! > Why does Spark3.1.2 return different results? > Has anyone encountered this problem? -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37382) `with as` clause got inconsistent results
[ https://issues.apache.org/jira/browse/SPARK-37382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] caican updated SPARK-37382: --- Description: In Spark3.1, the `with as` clause in the same SQL is executed multiple times, got different results ` with tab as ( select 'Withas' as name, rand() as rand_number ) select name, rand_number from tab union all select name, rand_number from tab ` !spark3.1.png! But In spark2.3, it got consistent results ` with tab as ( select 'Withas' as name, rand() as rand_number ) select name, rand_number from tab union all select name, rand_number from tab ` !spark2.3.png! Why does Spark3.1.2 return different results? Has anyone encountered this problem? was: In Spark3.1, the `with as` clause in the same SQL is executed multiple times, got different results ` with tab as ( select 'Withas' as name, rand() as rand_number ) select name, rand_number from tab union all select name, rand_number from tab ` !https://internal-api-lark-file.f.mioffice.cn/api/image/keys/img_bcf6f867-6aee-4afe-bc43-30bf4f2dbdel?message_id=7032102765711097965! But In spark2.3, it got consistent results ` with tab as ( select 'Withas' as name, rand() as rand_number ) select name, rand_number from tab union all select name, rand_number from tab ` !https://internal-api-lark-file.f.mioffice.cn/api/image/keys/img_6dc6e44b-d4a5-4b0d-bd2c-00859ec80a1l?message_id=7032104202756751468! Why does Spark3.1.2 return different results? Has anyone encountered this problem? > `with as` clause got inconsistent results > - > > Key: SPARK-37382 > URL: https://issues.apache.org/jira/browse/SPARK-37382 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.2 >Reporter: caican >Priority: Major > Attachments: spark2.3.png, spark3.1.png > > > In Spark3.1, the `with as` clause in the same SQL is executed multiple times, > got different results > ` > with tab as ( > select 'Withas' as name, rand() as rand_number > ) > select name, rand_number > from tab > union all > select name, rand_number > from tab > ` > !spark3.1.png! > But In spark2.3, it got consistent results > ` > with tab as ( > select 'Withas' as name, rand() as rand_number > ) > select name, rand_number > from tab > union all > select name, rand_number > from tab > ` > !spark2.3.png! > Why does Spark3.1.2 return different results? > Has anyone encountered this problem? -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37382) `with as` clause got inconsistent results
[ https://issues.apache.org/jira/browse/SPARK-37382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] caican updated SPARK-37382: --- Attachment: spark2.3.png > `with as` clause got inconsistent results > - > > Key: SPARK-37382 > URL: https://issues.apache.org/jira/browse/SPARK-37382 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.2 >Reporter: caican >Priority: Major > Attachments: spark2.3.png, spark3.1.png > > > In Spark3.1, the `with as` clause in the same SQL is executed multiple times, > got different results > ` > with tab as ( > select 'Withas' as name, rand() as rand_number > ) > select name, rand_number > from tab > union all > select name, rand_number > from tab > ` > !https://internal-api-lark-file.f.mioffice.cn/api/image/keys/img_bcf6f867-6aee-4afe-bc43-30bf4f2dbdel?message_id=7032102765711097965! > But In spark2.3, it got consistent results > ` > with tab as ( > select 'Withas' as name, rand() as rand_number > ) > select name, rand_number > from tab > union all > select name, rand_number > from tab > ` > !https://internal-api-lark-file.f.mioffice.cn/api/image/keys/img_6dc6e44b-d4a5-4b0d-bd2c-00859ec80a1l?message_id=7032104202756751468! > Why does Spark3.1.2 return different results? > Has anyone encountered this problem? -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37382) `with as` clause got inconsistent results
[ https://issues.apache.org/jira/browse/SPARK-37382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] caican updated SPARK-37382: --- Attachment: spark3.1.png > `with as` clause got inconsistent results > - > > Key: SPARK-37382 > URL: https://issues.apache.org/jira/browse/SPARK-37382 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.2 >Reporter: caican >Priority: Major > Attachments: spark3.1.png > > > In Spark3.1, the `with as` clause in the same SQL is executed multiple times, > got different results > ` > with tab as ( > select 'Withas' as name, rand() as rand_number > ) > select name, rand_number > from tab > union all > select name, rand_number > from tab > ` > !https://internal-api-lark-file.f.mioffice.cn/api/image/keys/img_bcf6f867-6aee-4afe-bc43-30bf4f2dbdel?message_id=7032102765711097965! > But In spark2.3, it got consistent results > ` > with tab as ( > select 'Withas' as name, rand() as rand_number > ) > select name, rand_number > from tab > union all > select name, rand_number > from tab > ` > !https://internal-api-lark-file.f.mioffice.cn/api/image/keys/img_6dc6e44b-d4a5-4b0d-bd2c-00859ec80a1l?message_id=7032104202756751468! > Why does Spark3.1.2 return different results? > Has anyone encountered this problem? -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37382) `with as` clause got inconsistent results
[ https://issues.apache.org/jira/browse/SPARK-37382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17447277#comment-17447277 ] caican commented on SPARK-37382: [~zhenw] Thank you for your reply, i will test it out. > `with as` clause got inconsistent results > - > > Key: SPARK-37382 > URL: https://issues.apache.org/jira/browse/SPARK-37382 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.2 >Reporter: caican >Priority: Major > > In Spark3.1, the `with as` clause in the same SQL is executed multiple times, > got different results > ` > with tab as ( > select 'Withas' as name, rand() as rand_number > ) > select name, rand_number > from tab > union all > select name, rand_number > from tab > ` > !https://internal-api-lark-file.f.mioffice.cn/api/image/keys/img_bcf6f867-6aee-4afe-bc43-30bf4f2dbdel?message_id=7032102765711097965! > But In spark2.3, it got consistent results > ` > with tab as ( > select 'Withas' as name, rand() as rand_number > ) > select name, rand_number > from tab > union all > select name, rand_number > from tab > ` > !https://internal-api-lark-file.f.mioffice.cn/api/image/keys/img_6dc6e44b-d4a5-4b0d-bd2c-00859ec80a1l?message_id=7032104202756751468! > Why does Spark3.1.2 return different results? > Has anyone encountered this problem? -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37383) Print the parsing time for each phase of a SQL
[ https://issues.apache.org/jira/browse/SPARK-37383?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] caican updated SPARK-37383: --- Affects Version/s: 2.4.0 (was: 3.2.0) > Print the parsing time for each phase of a SQL > -- > > Key: SPARK-37383 > URL: https://issues.apache.org/jira/browse/SPARK-37383 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: caican >Priority: Major > > the time spent for each phase of a SQL is counted and recorded in > QueryPlanningTracker , But it doesn't show up anywhere. > when SQL parsing is suspected to be slow, we cannot confirm which phase is > slow,therefore, it is necessary to print out the SQL parsing time. > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37383) Print the parsing time for each phase of a SQL
[ https://issues.apache.org/jira/browse/SPARK-37383?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] caican updated SPARK-37383: --- Summary: Print the parsing time for each phase of a SQL (was: Prints the parsing time for each phase of a SQL) > Print the parsing time for each phase of a SQL > -- > > Key: SPARK-37383 > URL: https://issues.apache.org/jira/browse/SPARK-37383 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: caican >Priority: Major > > the time spent for each phase of a SQL is counted and recorded in > QueryPlanningTracker , But it doesn't show up anywhere. > when SQL parsing is suspected to be slow, we cannot confirm which phase is > slow,therefore, it is necessary to print out the SQL parsing time. > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-37383) Prints the parsing time for each phase of a SQL
caican created SPARK-37383: -- Summary: Prints the parsing time for each phase of a SQL Key: SPARK-37383 URL: https://issues.apache.org/jira/browse/SPARK-37383 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.2.0 Reporter: caican the time spent for each phase of a SQL is counted and recorded in QueryPlanningTracker , But it doesn't show up anywhere. when SQL parsing is suspected to be slow, we cannot confirm which phase is slow,therefore, it is necessary to print out the SQL parsing time. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37382) `with as` clause got inconsistent results
[ https://issues.apache.org/jira/browse/SPARK-37382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] caican updated SPARK-37382: --- Description: In Spark3.1, the `with as` clause in the same SQL is executed multiple times, got different results ` with tab as ( select 'Withas' as name, rand() as rand_number ) select name, rand_number from tab union all select name, rand_number from tab ` !https://internal-api-lark-file.f.mioffice.cn/api/image/keys/img_bcf6f867-6aee-4afe-bc43-30bf4f2dbdel?message_id=7032102765711097965! But In spark2.3, it got consistent results ` with tab as ( select 'Withas' as name, rand() as rand_number ) select name, rand_number from tab union all select name, rand_number from tab ` !https://internal-api-lark-file.f.mioffice.cn/api/image/keys/img_6dc6e44b-d4a5-4b0d-bd2c-00859ec80a1l?message_id=7032104202756751468! Why does Spark3.1.2 return different results? Has anyone encountered this problem? was: In Spark3.1, the `with as` clause in the same SQL is executed multiple times, with different results ` with tab as ( select 'Withas' as name, rand() as rand_number ) select name, rand_number from tab union all select name, rand_number from tab ` !https://internal-api-lark-file.f.mioffice.cn/api/image/keys/img_bcf6f867-6aee-4afe-bc43-30bf4f2dbdel?message_id=7032102765711097965! But In spark2.3, it got consistent results ` with tab as ( select 'Withas' as name, rand() as rand_number ) select name, rand_number from tab union all select name, rand_number from tab ` !https://internal-api-lark-file.f.mioffice.cn/api/image/keys/img_6dc6e44b-d4a5-4b0d-bd2c-00859ec80a1l?message_id=7032104202756751468! Has anyone encountered this problem? > `with as` clause got inconsistent results > - > > Key: SPARK-37382 > URL: https://issues.apache.org/jira/browse/SPARK-37382 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.2 >Reporter: caican >Priority: Major > > In Spark3.1, the `with as` clause in the same SQL is executed multiple times, > got different results > ` > with tab as ( > select 'Withas' as name, rand() as rand_number > ) > select name, rand_number > from tab > union all > select name, rand_number > from tab > ` > !https://internal-api-lark-file.f.mioffice.cn/api/image/keys/img_bcf6f867-6aee-4afe-bc43-30bf4f2dbdel?message_id=7032102765711097965! > But In spark2.3, it got consistent results > ` > with tab as ( > select 'Withas' as name, rand() as rand_number > ) > select name, rand_number > from tab > union all > select name, rand_number > from tab > ` > !https://internal-api-lark-file.f.mioffice.cn/api/image/keys/img_6dc6e44b-d4a5-4b0d-bd2c-00859ec80a1l?message_id=7032104202756751468! > Why does Spark3.1.2 return different results? > Has anyone encountered this problem? -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-37382) `with as` clause got inconsistent results
caican created SPARK-37382: -- Summary: `with as` clause got inconsistent results Key: SPARK-37382 URL: https://issues.apache.org/jira/browse/SPARK-37382 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.1.2 Reporter: caican In Spark3.1, the `with as` clause in the same SQL is executed multiple times, with different results ` with tab as ( select 'Withas' as name, rand() as rand_number ) select name, rand_number from tab union all select name, rand_number from tab ` !https://internal-api-lark-file.f.mioffice.cn/api/image/keys/img_bcf6f867-6aee-4afe-bc43-30bf4f2dbdel?message_id=7032102765711097965! But In spark2.3, it got consistent results ` with tab as ( select 'Withas' as name, rand() as rand_number ) select name, rand_number from tab union all select name, rand_number from tab ` !https://internal-api-lark-file.f.mioffice.cn/api/image/keys/img_6dc6e44b-d4a5-4b0d-bd2c-00859ec80a1l?message_id=7032104202756751468! Has anyone encountered this problem? -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org