[jira] [Updated] (SPARK-36093) The result incorrect if the partition path case is inconsistent
[ https://issues.apache.org/jira/browse/SPARK-36093?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-36093: Labels: correctness (was: ) > The result incorrect if the partition path case is inconsistent > --- > > Key: SPARK-36093 > URL: https://issues.apache.org/jira/browse/SPARK-36093 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0 >Reporter: Yuming Wang >Priority: Major > Labels: correctness > > Please reproduce this issue using HDFS. Local HDFS can not reproduce this > issue. > {code:scala} > sql("create table t1(cal_dt date) using parquet") > sql("insert into t1 values > (date'2021-06-27'),(date'2021-06-28'),(date'2021-06-29'),(date'2021-06-30')") > sql("create view t1_v as select * from t1") > sql("CREATE TABLE t2 USING PARQUET PARTITIONED BY (CAL_DT) AS SELECT 1 AS > FLAG,CAL_DT FROM t1_v WHERE CAL_DT BETWEEN '2021-06-27' AND '2021-06-28'") > sql("INSERT INTO t2 SELECT 2 AS FLAG,CAL_DT FROM t1_v WHERE CAL_DT BETWEEN > '2021-06-29' AND '2021-06-30'") > sql("SELECT * FROM t2 WHERE CAL_DT BETWEEN '2021-06-29' AND > '2021-06-30'").show > sql("SELECT * FROM t2 ").show > {code} > {noformat} > // It should not empty. > scala> sql("SELECT * FROM t2 WHERE CAL_DT BETWEEN '2021-06-29' AND > '2021-06-30'").show > ++--+ > |FLAG|CAL_DT| > ++--+ > ++--+ > scala> sql("SELECT * FROM t2 ").show > ++--+ > |FLAG|CAL_DT| > ++--+ > | 1|2021-06-27| > | 1|2021-06-28| > ++--+ > scala> sql("SELECT 2 AS FLAG,CAL_DT FROM t1_v WHERE CAL_DT BETWEEN > '2021-06-29' AND '2021-06-30'").show > ++--+ > |FLAG|CAL_DT| > ++--+ > | 2|2021-06-29| > | 2|2021-06-30| > ++--+ > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-36093) The result incorrect if the partition path case is inconsistent
Yuming Wang created SPARK-36093: --- Summary: The result incorrect if the partition path case is inconsistent Key: SPARK-36093 URL: https://issues.apache.org/jira/browse/SPARK-36093 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.2.0 Reporter: Yuming Wang Please reproduce this issue using HDFS. Local HDFS can not reproduce this issue. {code:scala} sql("create table t1(cal_dt date) using parquet") sql("insert into t1 values (date'2021-06-27'),(date'2021-06-28'),(date'2021-06-29'),(date'2021-06-30')") sql("create view t1_v as select * from t1") sql("CREATE TABLE t2 USING PARQUET PARTITIONED BY (CAL_DT) AS SELECT 1 AS FLAG,CAL_DT FROM t1_v WHERE CAL_DT BETWEEN '2021-06-27' AND '2021-06-28'") sql("INSERT INTO t2 SELECT 2 AS FLAG,CAL_DT FROM t1_v WHERE CAL_DT BETWEEN '2021-06-29' AND '2021-06-30'") sql("SELECT * FROM t2 WHERE CAL_DT BETWEEN '2021-06-29' AND '2021-06-30'").show sql("SELECT * FROM t2 ").show {code} {noformat} // It should not empty. scala> sql("SELECT * FROM t2 WHERE CAL_DT BETWEEN '2021-06-29' AND '2021-06-30'").show ++--+ |FLAG|CAL_DT| ++--+ ++--+ scala> sql("SELECT * FROM t2 ").show ++--+ |FLAG|CAL_DT| ++--+ | 1|2021-06-27| | 1|2021-06-28| ++--+ scala> sql("SELECT 2 AS FLAG,CAL_DT FROM t1_v WHERE CAL_DT BETWEEN '2021-06-29' AND '2021-06-30'").show ++--+ |FLAG|CAL_DT| ++--+ | 2|2021-06-29| | 2|2021-06-30| ++--+ {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-36086) The case of the delta table is inconsistent with parquet
Yuming Wang created SPARK-36086: --- Summary: The case of the delta table is inconsistent with parquet Key: SPARK-36086 URL: https://issues.apache.org/jira/browse/SPARK-36086 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.1.1 Reporter: Yuming Wang How to reproduce this issue: {noformat} 1. Add delta-core_2.12-1.0.0-SNAPSHOT.jar to ${SPARK_HOME}/jars. 2. bin/spark-shell --conf spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension --conf spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog {noformat} {code:scala} spark.sql("create table t1 using parquet as select id, id as lower_id from range(5)") spark.sql("CREATE VIEW v1 as SELECT * FROM t1") spark.sql("CREATE TABLE t2 USING DELTA PARTITIONED BY (LOWER_ID) SELECT LOWER_ID, ID FROM v1") spark.sql("CREATE TABLE t3 USING PARQUET PARTITIONED BY (LOWER_ID) SELECT LOWER_ID, ID FROM v1") spark.sql("desc extended t2").show(false) spark.sql("desc extended t3").show(false) {code} {noformat} scala> spark.sql("desc extended t2").show(false) ++--+---+ |col_name|data_type |comment| ++--+---+ |lower_id|bigint | | |id |bigint | | || | | |# Partitioning | | | |Part 0 |lower_id | | || | | |# Detailed Table Information| | | |Name|default.t2 | | |Location |file:/Users/yumwang/Downloads/spark-3.1.1-bin-hadoop2.7/spark-warehouse/t2| | |Provider|delta | | |Table Properties |[Type=MANAGED,delta.minReaderVersion=1,delta.minWriterVersion=2] | | ++--+---+ scala> spark.sql("desc extended t3").show(false) ++--+---+ |col_name|data_type |comment| ++--+---+ |ID |bigint |null | |LOWER_ID|bigint |null | |# Partition Information | | | |# col_name |data_type |comment| |LOWER_ID|bigint |null | || | | |# Detailed Table Information| | | |Database|default | | |Table |t3 | | |Owner |yumwang | | |Created Time|Mon Jul 12 14:07:16 CST 2021 | | |Last Access |UNKNOWN | | |Created By |Spark 3.1.1 | | |Type|MANAGED | | |Provider|PARQUET
[jira] [Created] (SPARK-36080) Broadcast join outer join stream side
Yuming Wang created SPARK-36080: --- Summary: Broadcast join outer join stream side Key: SPARK-36080 URL: https://issues.apache.org/jira/browse/SPARK-36080 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.2.0 Reporter: Yuming Wang -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-35991) Add PlanStability suite for TPCH
[ https://issues.apache.org/jira/browse/SPARK-35991?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-35991: Affects Version/s: (was: 3.1.2) (was: 3.2.0) 3.3.0 > Add PlanStability suite for TPCH > > > Key: SPARK-35991 > URL: https://issues.apache.org/jira/browse/SPARK-35991 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: angerszhu >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-35991) Add PlanStability suite for TPCH
[ https://issues.apache.org/jira/browse/SPARK-35991?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-35991: Issue Type: Improvement (was: Bug) > Add PlanStability suite for TPCH > > > Key: SPARK-35991 > URL: https://issues.apache.org/jira/browse/SPARK-35991 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.2, 3.2.0 >Reporter: angerszhu >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-35908) Remove repartition if the child maximum number of rows less than or equal to 1
[ https://issues.apache.org/jira/browse/SPARK-35908?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang resolved SPARK-35908. - Resolution: Not A Problem > Remove repartition if the child maximum number of rows less than or equal to 1 > -- > > Key: SPARK-35908 > URL: https://issues.apache.org/jira/browse/SPARK-35908 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: Yuming Wang >Priority: Major > > {code:scala} > spark.sql("select count(*) from range(1, 10, 2, > 2)").repartition(10).explain("cost") > {code} > Current optimized logical plan: > {noformat} > == Optimized Logical Plan == > Repartition 10, true, Statistics(sizeInBytes=16.0 B, rowCount=1) > +- Aggregate [count(1) AS count(1)#2L], Statistics(sizeInBytes=16.0 B, > rowCount=1) >+- Project, Statistics(sizeInBytes=20.0 B) > +- Range (1, 10, step=2, splits=Some(2)), Statistics(sizeInBytes=40.0 > B, rowCount=5) > {noformat} > Expected optimized logical plan: > {noformat} > == Optimized Logical Plan == > Aggregate [count(1) AS count(1)#2L], Statistics(sizeInBytes=16.0 B, > rowCount=1) > +- Project, Statistics(sizeInBytes=20.0 B) >+- Range (1, 10, step=2, splits=Some(2)), Statistics(sizeInBytes=40.0 B, > rowCount=5) > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-35967) Update nullability based on column statistics
[ https://issues.apache.org/jira/browse/SPARK-35967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-35967: Summary: Update nullability based on column statistics (was: Update nullability base on column statistics) > Update nullability based on column statistics > - > > Key: SPARK-35967 > URL: https://issues.apache.org/jira/browse/SPARK-35967 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: Yuming Wang >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-35967) Update nullability base on column statistics
Yuming Wang created SPARK-35967: --- Summary: Update nullability base on column statistics Key: SPARK-35967 URL: https://issues.apache.org/jira/browse/SPARK-35967 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.2.0 Reporter: Yuming Wang -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-35904) Collapse above RebalancePartitions
[ https://issues.apache.org/jira/browse/SPARK-35904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-35904: Fix Version/s: (was: 3.2.0) > Collapse above RebalancePartitions > -- > > Key: SPARK-35904 > URL: https://issues.apache.org/jira/browse/SPARK-35904 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: Yuming Wang >Assignee: Yuming Wang >Priority: Major > > Make RebalancePartitions extends RepartitionOperation. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-35904) Collapse above RebalancePartitions
[ https://issues.apache.org/jira/browse/SPARK-35904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang reopened SPARK-35904: - Reverted at https://github.com/apache/spark/commit/108635af1708173a72bec0e36bf3f2cea5b088c4 > Collapse above RebalancePartitions > -- > > Key: SPARK-35904 > URL: https://issues.apache.org/jira/browse/SPARK-35904 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: Yuming Wang >Assignee: Yuming Wang >Priority: Major > Fix For: 3.2.0 > > > Make RebalancePartitions extends RepartitionOperation. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-35908) Remove repartition if the child maximum number of rows less than or equal to 1
[ https://issues.apache.org/jira/browse/SPARK-35908?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-35908: Description: {code:scala} spark.sql("select count(*) from range(1, 10, 2, 2)").repartition(10).explain("cost") {code} Current optimized logical plan: {noformat} == Optimized Logical Plan == Repartition 10, true, Statistics(sizeInBytes=16.0 B, rowCount=1) +- Aggregate [count(1) AS count(1)#2L], Statistics(sizeInBytes=16.0 B, rowCount=1) +- Project, Statistics(sizeInBytes=20.0 B) +- Range (1, 10, step=2, splits=Some(2)), Statistics(sizeInBytes=40.0 B, rowCount=5) {noformat} Expected optimized logical plan: {noformat} == Optimized Logical Plan == Aggregate [count(1) AS count(1)#2L], Statistics(sizeInBytes=16.0 B, rowCount=1) +- Project, Statistics(sizeInBytes=20.0 B) +- Range (1, 10, step=2, splits=Some(2)), Statistics(sizeInBytes=40.0 B, rowCount=5) {noformat} was: {code:scala} spark.sql("select count(*) from range(1, 10, 2, 2) order by 1 limit 10").explain("cost") {code} Current optimized logical plan: {noformat} == Optimized Logical Plan == Sort [count(1)#2L ASC NULLS FIRST], true, Statistics(sizeInBytes=16.0 B) +- Aggregate [count(1) AS count(1)#2L], Statistics(sizeInBytes=16.0 B, rowCount=1) +- Project, Statistics(sizeInBytes=20.0 B) +- Range (1, 10, step=2, splits=Some(2)), Statistics(sizeInBytes=40.0 B, rowCount=5) {noformat} Expected optimized logical plan: {noformat} == Optimized Logical Plan == Aggregate [count(1) AS count(1)#2L], Statistics(sizeInBytes=16.0 B, rowCount=1) +- Project, Statistics(sizeInBytes=20.0 B) +- Range (1, 10, step=2, splits=Some(2)), Statistics(sizeInBytes=40.0 B, rowCount=5) {noformat} > Remove repartition if the child maximum number of rows less than or equal to 1 > -- > > Key: SPARK-35908 > URL: https://issues.apache.org/jira/browse/SPARK-35908 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: Yuming Wang >Priority: Major > > {code:scala} > spark.sql("select count(*) from range(1, 10, 2, > 2)").repartition(10).explain("cost") > {code} > Current optimized logical plan: > {noformat} > == Optimized Logical Plan == > Repartition 10, true, Statistics(sizeInBytes=16.0 B, rowCount=1) > +- Aggregate [count(1) AS count(1)#2L], Statistics(sizeInBytes=16.0 B, > rowCount=1) >+- Project, Statistics(sizeInBytes=20.0 B) > +- Range (1, 10, step=2, splits=Some(2)), Statistics(sizeInBytes=40.0 > B, rowCount=5) > {noformat} > Expected optimized logical plan: > {noformat} > == Optimized Logical Plan == > Aggregate [count(1) AS count(1)#2L], Statistics(sizeInBytes=16.0 B, > rowCount=1) > +- Project, Statistics(sizeInBytes=20.0 B) >+- Range (1, 10, step=2, splits=Some(2)), Statistics(sizeInBytes=40.0 B, > rowCount=5) > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-35908) Remove repartition if the child maximum number of rows less than or equal to 1
Yuming Wang created SPARK-35908: --- Summary: Remove repartition if the child maximum number of rows less than or equal to 1 Key: SPARK-35908 URL: https://issues.apache.org/jira/browse/SPARK-35908 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.2.0 Reporter: Yuming Wang {code:scala} spark.sql("select count(*) from range(1, 10, 2, 2) order by 1 limit 10").explain("cost") {code} Current optimized logical plan: {noformat} == Optimized Logical Plan == Sort [count(1)#2L ASC NULLS FIRST], true, Statistics(sizeInBytes=16.0 B) +- Aggregate [count(1) AS count(1)#2L], Statistics(sizeInBytes=16.0 B, rowCount=1) +- Project, Statistics(sizeInBytes=20.0 B) +- Range (1, 10, step=2, splits=Some(2)), Statistics(sizeInBytes=40.0 B, rowCount=5) {noformat} Expected optimized logical plan: {noformat} == Optimized Logical Plan == Aggregate [count(1) AS count(1)#2L], Statistics(sizeInBytes=16.0 B, rowCount=1) +- Project, Statistics(sizeInBytes=20.0 B) +- Range (1, 10, step=2, splits=Some(2)), Statistics(sizeInBytes=40.0 B, rowCount=5) {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-35906) Remove order by if the maximum number of rows less than or equal to 1
Yuming Wang created SPARK-35906: --- Summary: Remove order by if the maximum number of rows less than or equal to 1 Key: SPARK-35906 URL: https://issues.apache.org/jira/browse/SPARK-35906 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.2.0 Reporter: Yuming Wang {code:scala} spark.sql("select count(*) from range(1, 10, 2, 2) order by 1 limit 10").explain("cost") {code} Current optimized logical plan: {noformat} == Optimized Logical Plan == Sort [count(1)#2L ASC NULLS FIRST], true, Statistics(sizeInBytes=16.0 B) +- Aggregate [count(1) AS count(1)#2L], Statistics(sizeInBytes=16.0 B, rowCount=1) +- Project, Statistics(sizeInBytes=20.0 B) +- Range (1, 10, step=2, splits=Some(2)), Statistics(sizeInBytes=40.0 B, rowCount=5) {noformat} Expected optimized logical plan: {noformat} == Optimized Logical Plan == Aggregate [count(1) AS count(1)#2L], Statistics(sizeInBytes=16.0 B, rowCount=1) +- Project, Statistics(sizeInBytes=20.0 B) +- Range (1, 10, step=2, splits=Some(2)), Statistics(sizeInBytes=40.0 B, rowCount=5) {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-35904) Collapse above RebalancePartitions
[ https://issues.apache.org/jira/browse/SPARK-35904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-35904: Description: Make RebalancePartitions extends RepartitionOperation. > Collapse above RebalancePartitions > -- > > Key: SPARK-35904 > URL: https://issues.apache.org/jira/browse/SPARK-35904 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: Yuming Wang >Priority: Major > > Make RebalancePartitions extends RepartitionOperation. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-35904) Collapse above RebalancePartitions
Yuming Wang created SPARK-35904: --- Summary: Collapse above RebalancePartitions Key: SPARK-35904 URL: https://issues.apache.org/jira/browse/SPARK-35904 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.2.0 Reporter: Yuming Wang -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-35886) Codegen issue for decimal type
Yuming Wang created SPARK-35886: --- Summary: Codegen issue for decimal type Key: SPARK-35886 URL: https://issues.apache.org/jira/browse/SPARK-35886 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.2.0 Reporter: Yuming Wang How to reproduce this issue: {code:scala} spark.sql( """ |CREATE TABLE t1 ( | c1 DECIMAL(18,6), | c2 DECIMAL(18,6), | c3 DECIMAL(18,6)) |USING parquet; |""".stripMargin) spark.sql("SELECT sum(c1 * c3) + sum(c2 * c3) FROM t1").show {code} {noformat} 20:23:36.272 ERROR org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator: failed to compile: org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 56, Column 6: Expression "agg_exprIsNull_2_0" is not an rvalue org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 56, Column 6: Expression "agg_exprIsNull_2_0" is not an rvalue at org.codehaus.janino.UnitCompiler.compileError(UnitCompiler.java:12675) at org.codehaus.janino.UnitCompiler.toRvalueOrCompileException(UnitCompiler.java:7676) {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-34807) Push down filter through window after TransposeWindow
[ https://issues.apache.org/jira/browse/SPARK-34807?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang reassigned SPARK-34807: --- Assignee: Tanel Kiis > Push down filter through window after TransposeWindow > - > > Key: SPARK-34807 > URL: https://issues.apache.org/jira/browse/SPARK-34807 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: Yuming Wang >Assignee: Tanel Kiis >Priority: Major > > {code:scala} > spark.range(10).selectExpr("id AS a", "id AS b", "id AS c", "id AS > d").createTempView("t1") > val df = spark.sql( > """ > |SELECT * > | FROM ( > |SELECT b, > | sum(d) OVER (PARTITION BY a, b), > | rank() OVER (PARTITION BY a ORDER BY c) > |FROM t1 > | ) v1 > |WHERE b = 2 > |""".stripMargin) > {code} > Current optimized plan: > {noformat} > == Optimized Logical Plan == > Project [b#221L, sum(d) OVER (PARTITION BY a, b ROWS BETWEEN UNBOUNDED > PRECEDING AND UNBOUNDED FOLLOWING)#231L, RANK() OVER (PARTITION BY a ORDER BY > c ASC NULLS FIRST ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)#232] > +- Filter (b#221L = 2) >+- Window [rank(c#222L) windowspecdefinition(a#220L, c#222L ASC NULLS > FIRST, specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) > AS RANK() OVER (PARTITION BY a ORDER BY c ASC NULLS FIRST ROWS BETWEEN > UNBOUNDED PRECEDING AND CURRENT ROW)#232], [a#220L], [c#222L ASC NULLS FIRST] > +- Project [b#221L, a#220L, c#222L, sum(d) OVER (PARTITION BY a, b ROWS > BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)#231L] > +- Window [sum(d#223L) windowspecdefinition(a#220L, b#221L, > specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) > AS sum(d) OVER (PARTITION BY a, b ROWS BETWEEN UNBOUNDED PRECEDING AND > UNBOUNDED FOLLOWING)#231L], [a#220L, b#221L] > +- Project [id#218L AS b#221L, id#218L AS d#223L, id#218L AS > a#220L, id#218L AS c#222L] >+- Range (0, 10, step=1, splits=Some(2)) > {noformat} > Expected optimized plan: > {noformat} > == Optimized Logical Plan == > Project [b#221L, sum(d) OVER (PARTITION BY a, b ROWS BETWEEN UNBOUNDED > PRECEDING AND UNBOUNDED FOLLOWING)#231L, RANK() OVER (PARTITION BY a ORDER BY > c ASC NULLS FIRST ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)#232] > +- Window [sum(d#223L) windowspecdefinition(a#220L, b#221L, > specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) > AS sum(d) OVER (PARTITION BY a, b ROWS BETWEEN UNBOUNDED PRECEDING AND > UNBOUNDED FOLLOWING)#231L], [a#220L, b#221L] >+- Project [b#221L, d#223L, a#220L, RANK() OVER (PARTITION BY a ORDER BY c > ASC NULLS FIRST ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)#232] > +- Filter (b#221L = 2) > +- Window [rank(c#222L) windowspecdefinition(a#220L, c#222L ASC > NULLS FIRST, specifiedwindowframe(RowFrame, unboundedpreceding$(), > currentrow$())) AS RANK() OVER (PARTITION BY a ORDER BY c ASC NULLS FIRST > ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)#232], [a#220L], [c#222L ASC > NULLS FIRST] > +- Project [id#218L AS b#221L, id#218L AS d#223L, id#218L AS > a#220L, id#218L AS c#222L] >+- Range (0, 10, step=1, splits=Some(2)) > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-34807) Push down filter through window after TransposeWindow
[ https://issues.apache.org/jira/browse/SPARK-34807?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang resolved SPARK-34807. - Fix Version/s: 3.2.0 Resolution: Fixed Issue resolved by pull request 31980 [https://github.com/apache/spark/pull/31980] > Push down filter through window after TransposeWindow > - > > Key: SPARK-34807 > URL: https://issues.apache.org/jira/browse/SPARK-34807 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: Yuming Wang >Assignee: Tanel Kiis >Priority: Major > Fix For: 3.2.0 > > > {code:scala} > spark.range(10).selectExpr("id AS a", "id AS b", "id AS c", "id AS > d").createTempView("t1") > val df = spark.sql( > """ > |SELECT * > | FROM ( > |SELECT b, > | sum(d) OVER (PARTITION BY a, b), > | rank() OVER (PARTITION BY a ORDER BY c) > |FROM t1 > | ) v1 > |WHERE b = 2 > |""".stripMargin) > {code} > Current optimized plan: > {noformat} > == Optimized Logical Plan == > Project [b#221L, sum(d) OVER (PARTITION BY a, b ROWS BETWEEN UNBOUNDED > PRECEDING AND UNBOUNDED FOLLOWING)#231L, RANK() OVER (PARTITION BY a ORDER BY > c ASC NULLS FIRST ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)#232] > +- Filter (b#221L = 2) >+- Window [rank(c#222L) windowspecdefinition(a#220L, c#222L ASC NULLS > FIRST, specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) > AS RANK() OVER (PARTITION BY a ORDER BY c ASC NULLS FIRST ROWS BETWEEN > UNBOUNDED PRECEDING AND CURRENT ROW)#232], [a#220L], [c#222L ASC NULLS FIRST] > +- Project [b#221L, a#220L, c#222L, sum(d) OVER (PARTITION BY a, b ROWS > BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)#231L] > +- Window [sum(d#223L) windowspecdefinition(a#220L, b#221L, > specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) > AS sum(d) OVER (PARTITION BY a, b ROWS BETWEEN UNBOUNDED PRECEDING AND > UNBOUNDED FOLLOWING)#231L], [a#220L, b#221L] > +- Project [id#218L AS b#221L, id#218L AS d#223L, id#218L AS > a#220L, id#218L AS c#222L] >+- Range (0, 10, step=1, splits=Some(2)) > {noformat} > Expected optimized plan: > {noformat} > == Optimized Logical Plan == > Project [b#221L, sum(d) OVER (PARTITION BY a, b ROWS BETWEEN UNBOUNDED > PRECEDING AND UNBOUNDED FOLLOWING)#231L, RANK() OVER (PARTITION BY a ORDER BY > c ASC NULLS FIRST ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)#232] > +- Window [sum(d#223L) windowspecdefinition(a#220L, b#221L, > specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) > AS sum(d) OVER (PARTITION BY a, b ROWS BETWEEN UNBOUNDED PRECEDING AND > UNBOUNDED FOLLOWING)#231L], [a#220L, b#221L] >+- Project [b#221L, d#223L, a#220L, RANK() OVER (PARTITION BY a ORDER BY c > ASC NULLS FIRST ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)#232] > +- Filter (b#221L = 2) > +- Window [rank(c#222L) windowspecdefinition(a#220L, c#222L ASC > NULLS FIRST, specifiedwindowframe(RowFrame, unboundedpreceding$(), > currentrow$())) AS RANK() OVER (PARTITION BY a ORDER BY c ASC NULLS FIRST > ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)#232], [a#220L], [c#222L ASC > NULLS FIRST] > +- Project [id#218L AS b#221L, id#218L AS d#223L, id#218L AS > a#220L, id#218L AS c#222L] >+- Range (0, 10, step=1, splits=Some(2)) > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-35837) Recommendations for Common Query Problems
[ https://issues.apache.org/jira/browse/SPARK-35837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-35837: Description: Teradata supports [Recommendations for Common Query Problems|https://docs.teradata.com/r/wada1XMYPkZVTqPKz2CNaw/JE7PEg6H~4nBZYEGphxxsg]. We can implement a similar feature. 1. Detect the most skew values for join. The user decides whether these are needed. 2. Detect the most skew values for window function. The user decides whether these are needed. 3. Detect the bucket read, for example, Analyzer add a cast to bucket column. 4. Recommend the user add a filter condition to the partition column of the partition table. 5. Check the condition of join, for example, the result of cast string to double may be incorrect. For example: {code:sql} 0: jdbc:hive2://localhost:1/default> EXPLAIN RECOMMENDATION 0: jdbc:hive2://localhost:1/default> SELECT a.*, 0: jdbc:hive2://localhost:1/default>CASE 0: jdbc:hive2://localhost:1/default> WHEN ( NOT ( a.exclude = 1 0: jdbc:hive2://localhost:1/default> AND a.cobrand = 6 0: jdbc:hive2://localhost:1/default> AND a.primary_app_id IN ( 1462, 2878, 2571 ) ) ) 0: jdbc:hive2://localhost:1/default> AND ( a.valid_page_count = 1 ) THEN 1 0: jdbc:hive2://localhost:1/default> ELSE 0 0: jdbc:hive2://localhost:1/default>END AS is_singlepage, 0: jdbc:hive2://localhost:1/default>ca.bsns_vrtcl_name 0: jdbc:hive2://localhost:1/default> FROM (SELECT * 0: jdbc:hive2://localhost:1/default> FROM (SELECT *, 0: jdbc:hive2://localhost:1/default>'VI' AS page_type 0: jdbc:hive2://localhost:1/default> FROM tbl1 0: jdbc:hive2://localhost:1/default> UNION 0: jdbc:hive2://localhost:1/default> SELECT *, 0: jdbc:hive2://localhost:1/default>'SRP' AS page_type 0: jdbc:hive2://localhost:1/default> FROM tbl2 0: jdbc:hive2://localhost:1/default> UNION 0: jdbc:hive2://localhost:1/default> SELECT *, 0: jdbc:hive2://localhost:1/default>'My Garage' AS page_type 0: jdbc:hive2://localhost:1/default> FROM tbl3 0: jdbc:hive2://localhost:1/default> UNION 0: jdbc:hive2://localhost:1/default> SELECT *, 0: jdbc:hive2://localhost:1/default>'Motors Homepage' AS page_type 0: jdbc:hive2://localhost:1/default> FROM tbl4) t 0: jdbc:hive2://localhost:1/default> WHERE session_start_dt BETWEEN ( '2020-01-01' ) AND ( 0: jdbc:hive2://localhost:1/default> CURRENT_DATE() - 10 )) a 0: jdbc:hive2://localhost:1/default>LEFT JOIN (SELECT item_id, 0: jdbc:hive2://localhost:1/default> item_site_id, 0: jdbc:hive2://localhost:1/default> auct_end_dt, 0: jdbc:hive2://localhost:1/default> leaf_categ_id 0: jdbc:hive2://localhost:1/default> FROM tbl5 0: jdbc:hive2://localhost:1/default> WHERE auct_end_dt >= ( '2020-01-01' )) itm 0: jdbc:hive2://localhost:1/default> ON a.item_id = itm.item_id 0: jdbc:hive2://localhost:1/default>LEFT JOIN tbl6 ca 0: jdbc:hive2://localhost:1/default> ON itm.leaf_categ_id = ca.leaf_categ_id 0: jdbc:hive2://localhost:1/default> AND itm.item_site_id = ca.site_id; +-+--+ | result | +-+--+ | 1. Detect the most skew values for join | | Check join: Join LeftOuter, ((leaf_categ_id#1453 = leaf_categ_id#3020) AND (cast(item_site_id#1444 as decimal(9,0)) = site_id#3022)) | | table: tbl5 | | leaf_categ_id, item_site_id, count | | 171243, 0, 115412614 | | 176984, 3, 81003252
[jira] [Updated] (SPARK-35837) Recommendations for Common Query Problems
[ https://issues.apache.org/jira/browse/SPARK-35837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-35837: Description: Teradata supports [Recommendations for Common Query Problems|https://docs.teradata.com/r/wada1XMYPkZVTqPKz2CNaw/JE7PEg6H~4nBZYEGphxxsg]. We can implement a similar feature. 1. Detect the most skew values for join. The user decides whether these are needed. 2. Detect the most skew values for window function. The user decides whether these are needed. 3. Detect the bucket read, for example, Analyzer add a cast to bucket column. 4. Recommend the user add a filter condition to the partition column of the partition table. 5. Check the condition of join, for example, the result of cast string to double may be incorrect. For example: {code:sql} 0: jdbc:hive2://localhost:1/default> EXPLAIN RECOMMENDATION 0: jdbc:hive2://localhost:1/default> SELECT a.*, 0: jdbc:hive2://localhost:1/default>CASE 0: jdbc:hive2://localhost:1/default> WHEN ( NOT ( a.exclude = 1 0: jdbc:hive2://localhost:1/default> AND a.cobrand = 6 0: jdbc:hive2://localhost:1/default> AND a.primary_app_id IN ( 1462, 2878, 2571 ) ) ) 0: jdbc:hive2://localhost:1/default> AND ( a.valid_page_count = 1 ) THEN 1 0: jdbc:hive2://localhost:1/default> ELSE 0 0: jdbc:hive2://localhost:1/default>END AS is_singlepage, 0: jdbc:hive2://localhost:1/default>ca.bsns_vrtcl_name 0: jdbc:hive2://localhost:1/default> FROM (SELECT * 0: jdbc:hive2://localhost:1/default> FROM (SELECT *, 0: jdbc:hive2://localhost:1/default>'VI' AS page_type 0: jdbc:hive2://localhost:1/default> FROM tbl1 0: jdbc:hive2://localhost:1/default> UNION 0: jdbc:hive2://localhost:1/default> SELECT *, 0: jdbc:hive2://localhost:1/default>'SRP' AS page_type 0: jdbc:hive2://localhost:1/default> FROM tbl2 0: jdbc:hive2://localhost:1/default> UNION 0: jdbc:hive2://localhost:1/default> SELECT *, 0: jdbc:hive2://localhost:1/default>'My Garage' AS page_type 0: jdbc:hive2://localhost:1/default> FROM tbl3 0: jdbc:hive2://localhost:1/default> UNION 0: jdbc:hive2://localhost:1/default> SELECT *, 0: jdbc:hive2://localhost:1/default>'Motors Homepage' AS page_type 0: jdbc:hive2://localhost:1/default> FROM tbl4) t 0: jdbc:hive2://localhost:1/default> WHERE session_start_dt BETWEEN ( '2020-01-01' ) AND ( 0: jdbc:hive2://localhost:1/default> CURRENT_DATE() - 10 )) a 0: jdbc:hive2://localhost:1/default>LEFT JOIN (SELECT item_id, 0: jdbc:hive2://localhost:1/default> item_site_id, 0: jdbc:hive2://localhost:1/default> auct_end_dt, 0: jdbc:hive2://localhost:1/default> leaf_categ_id 0: jdbc:hive2://localhost:1/default> FROM tbl5 0: jdbc:hive2://localhost:1/default> WHERE auct_end_dt >= ( '2020-01-01' )) itm 0: jdbc:hive2://localhost:1/default> ON a.item_id = itm.item_id 0: jdbc:hive2://localhost:1/default>LEFT JOIN tbl6 ca 0: jdbc:hive2://localhost:1/default> ON itm.leaf_categ_id = ca.leaf_categ_id 0: jdbc:hive2://localhost:1/default> AND itm.item_site_id = ca.site_id; +-+--+ | result | +-+--+ | 1. Detect the most skew values for join | | Check join: Join LeftOuter, ((leaf_categ_id#1453 = leaf_categ_id#3020) AND (cast(item_site_id#1444 as decimal(9,0)) = site_id#3022)) | | table: tbl5 | | leaf_categ_id, item_site_id, count | | 171243, 0, 115412614 | | 176984, 3, 81003252
[jira] [Updated] (SPARK-35837) Recommendations for Common Query Problems
[ https://issues.apache.org/jira/browse/SPARK-35837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-35837: Description: Teradata supports [Recommendations for Common Query Problems|https://docs.teradata.com/r/wada1XMYPkZVTqPKz2CNaw/JE7PEg6H~4nBZYEGphxxsg]. We can implement a similar feature. 1. Detect the most skew values for join. The user decides whether these are needed. 2. Detect the most skew values for window function. The user decides whether these are needed. 3. Detect the bucket read, for example, Analyzer add a cast to bucket column. 4. Recommend the user add a filter condition to the partition column of the partition table. 5. Check the condition of join, for example, the result of cast string to double may be incorrect. For example: {code:sql} 0: jdbc:hive2://localhost:1/default> EXPLAIN OPTIMIZE 0: jdbc:hive2://localhost:1/default> SELECT a.*, 0: jdbc:hive2://localhost:1/default>CASE 0: jdbc:hive2://localhost:1/default> WHEN ( NOT ( a.exclude = 1 0: jdbc:hive2://localhost:1/default> AND a.cobrand = 6 0: jdbc:hive2://localhost:1/default> AND a.primary_app_id IN ( 1462, 2878, 2571 ) ) ) 0: jdbc:hive2://localhost:1/default> AND ( a.valid_page_count = 1 ) THEN 1 0: jdbc:hive2://localhost:1/default> ELSE 0 0: jdbc:hive2://localhost:1/default>END AS is_singlepage, 0: jdbc:hive2://localhost:1/default>ca.bsns_vrtcl_name 0: jdbc:hive2://localhost:1/default> FROM (SELECT * 0: jdbc:hive2://localhost:1/default> FROM (SELECT *, 0: jdbc:hive2://localhost:1/default>'VI' AS page_type 0: jdbc:hive2://localhost:1/default> FROM tbl1 0: jdbc:hive2://localhost:1/default> UNION 0: jdbc:hive2://localhost:1/default> SELECT *, 0: jdbc:hive2://localhost:1/default>'SRP' AS page_type 0: jdbc:hive2://localhost:1/default> FROM tbl2 0: jdbc:hive2://localhost:1/default> UNION 0: jdbc:hive2://localhost:1/default> SELECT *, 0: jdbc:hive2://localhost:1/default>'My Garage' AS page_type 0: jdbc:hive2://localhost:1/default> FROM tbl3 0: jdbc:hive2://localhost:1/default> UNION 0: jdbc:hive2://localhost:1/default> SELECT *, 0: jdbc:hive2://localhost:1/default>'Motors Homepage' AS page_type 0: jdbc:hive2://localhost:1/default> FROM tbl4) t 0: jdbc:hive2://localhost:1/default> WHERE session_start_dt BETWEEN ( '2020-01-01' ) AND ( 0: jdbc:hive2://localhost:1/default> CURRENT_DATE() - 10 )) a 0: jdbc:hive2://localhost:1/default>LEFT JOIN (SELECT item_id, 0: jdbc:hive2://localhost:1/default> item_site_id, 0: jdbc:hive2://localhost:1/default> auct_end_dt, 0: jdbc:hive2://localhost:1/default> leaf_categ_id 0: jdbc:hive2://localhost:1/default> FROM tbl5 0: jdbc:hive2://localhost:1/default> WHERE auct_end_dt >= ( '2020-01-01' )) itm 0: jdbc:hive2://localhost:1/default> ON a.item_id = itm.item_id 0: jdbc:hive2://localhost:1/default>LEFT JOIN tbl6 ca 0: jdbc:hive2://localhost:1/default> ON itm.leaf_categ_id = ca.leaf_categ_id 0: jdbc:hive2://localhost:1/default> AND itm.item_site_id = ca.site_id; +-+--+ | result | +-+--+ | 1. Detect the most skew values for join | | Check join: Join LeftOuter, ((leaf_categ_id#1453 = leaf_categ_id#3020) AND (cast(item_site_id#1444 as decimal(9,0)) = site_id#3022)) | | table: tbl5 | | leaf_categ_id, item_site_id, count | | 171243, 0, 115412614 | | 176984, 3, 81003252
[jira] [Created] (SPARK-35837) Recommendations for Common Query Problems
Yuming Wang created SPARK-35837: --- Summary: Recommendations for Common Query Problems Key: SPARK-35837 URL: https://issues.apache.org/jira/browse/SPARK-35837 Project: Spark Issue Type: New Feature Components: SQL Affects Versions: 3.2.0 Reporter: Yuming Wang Teradata supports [Recommendations for Common Query Problems|https://docs.teradata.com/r/wada1XMYPkZVTqPKz2CNaw/JE7PEg6H~4nBZYEGphxxsg]. We can implement a similar feature. 1. Detect the most skew values for join. The user decides whether these are needed. 2. Detect the most skew values for window function. The user decides whether these are needed. 3. Detect it can be optimized to bucket read, for example, Analyzer add a cast to bucket column. 4.Recommend the user add a filter condition to the partition column of the partition table. 5. Check the condition of join, for example, the result of cast string to double may be incorrect. For example: {code:sql} 0: jdbc:hive2://localhost:1/default> EXPLAIN OPTIMIZE 0: jdbc:hive2://localhost:1/default> SELECT a.*, 0: jdbc:hive2://localhost:1/default>CASE 0: jdbc:hive2://localhost:1/default> WHEN ( NOT ( a.exclude = 1 0: jdbc:hive2://localhost:1/default> AND a.cobrand = 6 0: jdbc:hive2://localhost:1/default> AND a.primary_app_id IN ( 1462, 2878, 2571 ) ) ) 0: jdbc:hive2://localhost:1/default> AND ( a.valid_page_count = 1 ) THEN 1 0: jdbc:hive2://localhost:1/default> ELSE 0 0: jdbc:hive2://localhost:1/default>END AS is_singlepage, 0: jdbc:hive2://localhost:1/default>ca.bsns_vrtcl_name 0: jdbc:hive2://localhost:1/default> FROM (SELECT * 0: jdbc:hive2://localhost:1/default> FROM (SELECT *, 0: jdbc:hive2://localhost:1/default>'VI' AS page_type 0: jdbc:hive2://localhost:1/default> FROM tbl1 0: jdbc:hive2://localhost:1/default> UNION 0: jdbc:hive2://localhost:1/default> SELECT *, 0: jdbc:hive2://localhost:1/default>'SRP' AS page_type 0: jdbc:hive2://localhost:1/default> FROM tbl2 0: jdbc:hive2://localhost:1/default> UNION 0: jdbc:hive2://localhost:1/default> SELECT *, 0: jdbc:hive2://localhost:1/default>'My Garage' AS page_type 0: jdbc:hive2://localhost:1/default> FROM tbl3 0: jdbc:hive2://localhost:1/default> UNION 0: jdbc:hive2://localhost:1/default> SELECT *, 0: jdbc:hive2://localhost:1/default>'Motors Homepage' AS page_type 0: jdbc:hive2://localhost:1/default> FROM tbl4) t 0: jdbc:hive2://localhost:1/default> WHERE session_start_dt BETWEEN ( '2020-01-01' ) AND ( 0: jdbc:hive2://localhost:1/default> CURRENT_DATE() - 10 )) a 0: jdbc:hive2://localhost:1/default>LEFT JOIN (SELECT item_id, 0: jdbc:hive2://localhost:1/default> item_site_id, 0: jdbc:hive2://localhost:1/default> auct_end_dt, 0: jdbc:hive2://localhost:1/default> leaf_categ_id 0: jdbc:hive2://localhost:1/default> FROM tbl5 0: jdbc:hive2://localhost:1/default> WHERE auct_end_dt >= ( '2020-01-01' )) itm 0: jdbc:hive2://localhost:1/default> ON a.item_id = itm.item_id 0: jdbc:hive2://localhost:1/default>LEFT JOIN tbl6 ca 0: jdbc:hive2://localhost:1/default> ON itm.leaf_categ_id = ca.leaf_categ_id 0: jdbc:hive2://localhost:1/default> AND itm.item_site_id = ca.site_id; +-+--+ | result | +-+--+ | 1. Detect the most skew values for join | | Check join: Join LeftOuter, ((leaf_categ_id#1453 = leaf_categ_id#3020) AND (cast(item_site_id#1444 as decimal(9,0)) = site_id#3022)) | | table: tbl5 | | leaf_categ_id, item_site_id, count | | 171243, 0, 115412614
[jira] [Commented] (SPARK-35797) patternToRegex failed when pattern is star
[ https://issues.apache.org/jira/browse/SPARK-35797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17366080#comment-17366080 ] Yuming Wang commented on SPARK-35797: - How to reproduce this issue? > patternToRegex failed when pattern is star > -- > > Key: SPARK-35797 > URL: https://issues.apache.org/jira/browse/SPARK-35797 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.1, 3.1.2 > Environment: Spark 3.1.1 >Reporter: YuanGuanhu >Priority: Major > Attachments: patternFailed.txt > > > we are using hue to connect Spark JDBCServer with spark3.1.1, when we queru > this `select * from tb1` we got exception attached file -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-34120) Improve the statistics estimation
[ https://issues.apache.org/jira/browse/SPARK-34120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang reassigned SPARK-34120: --- Assignee: Yuming Wang > Improve the statistics estimation > - > > Key: SPARK-34120 > URL: https://issues.apache.org/jira/browse/SPARK-34120 > Project: Spark > Issue Type: Umbrella > Components: SQL >Affects Versions: 3.2.0 >Reporter: Yuming Wang >Assignee: Yuming Wang >Priority: Major > Fix For: 3.2.0 > > > This umbrella tickets aim to track improvements in statistics estimates. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-34120) Improve the statistics estimation
[ https://issues.apache.org/jira/browse/SPARK-34120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang resolved SPARK-34120. - Fix Version/s: 3.2.0 Resolution: Fixed > Improve the statistics estimation > - > > Key: SPARK-34120 > URL: https://issues.apache.org/jira/browse/SPARK-34120 > Project: Spark > Issue Type: Umbrella > Components: SQL >Affects Versions: 3.2.0 >Reporter: Yuming Wang >Priority: Major > Fix For: 3.2.0 > > > This umbrella tickets aim to track improvements in statistics estimates. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-35185) Improve Distinct statistics estimation
[ https://issues.apache.org/jira/browse/SPARK-35185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang reassigned SPARK-35185: --- Assignee: Yuming Wang > Improve Distinct statistics estimation > -- > > Key: SPARK-35185 > URL: https://issues.apache.org/jira/browse/SPARK-35185 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: Yuming Wang >Assignee: Yuming Wang >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-35185) Improve Distinct statistics estimation
[ https://issues.apache.org/jira/browse/SPARK-35185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang resolved SPARK-35185. - Fix Version/s: 3.2.0 Resolution: Fixed Issue resolved by pull request 32291 [https://github.com/apache/spark/pull/32291] > Improve Distinct statistics estimation > -- > > Key: SPARK-35185 > URL: https://issues.apache.org/jira/browse/SPARK-35185 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: Yuming Wang >Assignee: Yuming Wang >Priority: Major > Fix For: 3.2.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-35786) Support optimize repartition by expression in AQE
[ https://issues.apache.org/jira/browse/SPARK-35786?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-35786: Parent: SPARK-35793 Issue Type: Sub-task (was: Improvement) > Support optimize repartition by expression in AQE > - > > Key: SPARK-35786 > URL: https://issues.apache.org/jira/browse/SPARK-35786 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: XiDuo You >Priority: Major > > Currently, we only support use LocalShuffleReader for `REPARTITION`. In some > case, user also want to do this optimization with `REPARTITION_BY_COL`. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30538) A not very elegant way to control ouput small file
[ https://issues.apache.org/jira/browse/SPARK-30538?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-30538: Parent: SPARK-35793 Issue Type: Sub-task (was: Improvement) > A not very elegant way to control ouput small file > > > Key: SPARK-30538 > URL: https://issues.apache.org/jira/browse/SPARK-30538 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: angerszhu >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-35335) Improve CoalesceShufflePartitions to avoid generating small files
[ https://issues.apache.org/jira/browse/SPARK-35335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-35335: Parent: SPARK-35793 Issue Type: Sub-task (was: Improvement) > Improve CoalesceShufflePartitions to avoid generating small files > -- > > Key: SPARK-35335 > URL: https://issues.apache.org/jira/browse/SPARK-35335 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: Yuming Wang >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-35650) Coalesce small output files through AQE
[ https://issues.apache.org/jira/browse/SPARK-35650?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-35650: Parent: SPARK-35793 Issue Type: Sub-task (was: New Feature) > Coalesce small output files through AQE > --- > > Key: SPARK-35650 > URL: https://issues.apache.org/jira/browse/SPARK-35650 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: Yuming Wang >Assignee: Yuming Wang >Priority: Major > Fix For: 3.2.0 > > > Add a new API to support coalesce small output files through AQE. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-35725) Support repartition expand partitions in AQE
[ https://issues.apache.org/jira/browse/SPARK-35725?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-35725: Parent Issue: SPARK-35793 (was: SPARK-33828) > Support repartition expand partitions in AQE > > > Key: SPARK-35725 > URL: https://issues.apache.org/jira/browse/SPARK-35725 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: XiDuo You >Priority: Major > > Currently, we don't support expand partition dynamically in AQE which is not > friendly for some data skew job. > Let's say if we have a simple query: > {code:java} > SELECT * FROM table DISTRIBUTE BY col > {code} > The column of `col` is skewed, then some shuffle partitions would handle too > much data than others. > If we haven't inroduced extra shuffle, we can optimize this case by expanding > partitions in AQE. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31264) Repartition by dynamic partition columns before insert table
[ https://issues.apache.org/jira/browse/SPARK-31264?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-31264: Parent: SPARK-35793 Issue Type: Sub-task (was: Improvement) > Repartition by dynamic partition columns before insert table > > > Key: SPARK-31264 > URL: https://issues.apache.org/jira/browse/SPARK-31264 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: Yuming Wang >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-35793) Repartition before writing data source tables
Yuming Wang created SPARK-35793: --- Summary: Repartition before writing data source tables Key: SPARK-35793 URL: https://issues.apache.org/jira/browse/SPARK-35793 Project: Spark Issue Type: Umbrella Components: SQL Affects Versions: 3.2.0 Reporter: Yuming Wang This umbrella ticket aim to track repartition before writing data source tables. It contains: # repartition by dynamic partition column before writing dynamic partition tables. # repartition before writing normal tables to avoid generating too many small files. # Improve local shuffle reader. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-35556) Remove the close HiveClient's SessionState
[ https://issues.apache.org/jira/browse/SPARK-35556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang resolved SPARK-35556. - Fix Version/s: 3.2.0 Resolution: Fixed Issue resolved by pull request 32693 [https://github.com/apache/spark/pull/32693] > Remove the close HiveClient's SessionState > -- > > Key: SPARK-35556 > URL: https://issues.apache.org/jira/browse/SPARK-35556 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Minor > Fix For: 3.2.0 > > > There are some logs related to `java.lang.NoSuchMethodError: > org.apache.hadoop.hive.ql.session.SessionState.getTmpErrOutputFile()Ljava/io/File;` > when run sql/hive module UTs > For example, when execute > {code:java} > mvn test -pl sql/hive > -DwildcardSuites=org.apache.spark.sql.hive.client.VersionsSuite -Dtest=none > {code} > the all tests passed but there will be some error log as follow: > {code:java} > Run completed in 17 minutes, 18 seconds. > Total number of tests run: 867 > Suites: completed 2, aborted 0 > Tests: succeeded 867, failed 0, canceled 0, ignored 1, pending 0 > All tests passed. > 15:04:02.407 WARN org.apache.hadoop.hive.metastore.ObjectStore: Version > information not found in metastore. hive.metastore.schema.verification is not > enabled so recording the schema version 2.3.0 > 15:04:02.408 WARN org.apache.hadoop.hive.metastore.ObjectStore: > setMetaStoreSchemaVersion called but recording version is disabled: version = > 2.3.0, comment = Set by MetaStore yangjie01@0.2.30.21 > 15:04:02.441 WARN org.apache.hadoop.hive.metastore.ObjectStore: Failed to get > database default, returning NoSuchObjectException > 15:04:03.140 ERROR org.apache.spark.util.Utils: Uncaught exception in thread > shutdown-hook-0 > java.lang.NoSuchMethodError: > org.apache.hadoop.hive.ql.session.SessionState.getTmpErrOutputFile()Ljava/io/File; > at > org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$closeState$1(HiveClientImpl.scala:168) > at > scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) > at > org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$withHiveState$1(HiveClientImpl.scala:312) > at > org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:243) > at > org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:242) > at > org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:292) > at > org.apache.spark.sql.hive.client.HiveClientImpl.closeState(HiveClientImpl.scala:158) > at > org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$new$1(HiveClientImpl.scala:175) > at > org.apache.spark.util.SparkShutdownHook.run(ShutdownHookManager.scala:214) > at > org.apache.spark.util.SparkShutdownHookManager.$anonfun$runAll$2(ShutdownHookManager.scala:188) > at > scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) > at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1994) > at > org.apache.spark.util.SparkShutdownHookManager.$anonfun$runAll$1(ShutdownHookManager.scala:188) > at > scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) > at scala.util.Try$.apply(Try.scala:213) > at > org.apache.spark.util.SparkShutdownHookManager.runAll(ShutdownHookManager.scala:188) > at > org.apache.spark.util.SparkShutdownHookManager$$anon$2.run(ShutdownHookManager.scala:178) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > 15:04:03.141 WARN org.apache.hadoop.util.ShutdownHookManager: ShutdownHook > '$anon$2' failed, java.util.concurrent.ExecutionException: > java.lang.NoSuchMethodError: > org.apache.hadoop.hive.ql.session.SessionState.getTmpErrOutputFile()Ljava/io/File; > java.util.concurrent.ExecutionException: java.lang.NoSuchMethodError: > org.apache.hadoop.hive.ql.session.SessionState.getTmpErrOutputFile()Ljava/io/File; > at java.util.concurrent.FutureTask.report(FutureTask.java:122) > at java.util.concurrent.FutureTask.get(FutureTask.java:206) > at > org.apache.hadoop.util.ShutdownHookManager.executeShutdown(ShutdownHookManager.java:124) > at > org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:95) > Caused by: java.lang.NoSuchMethodError: >
[jira] [Assigned] (SPARK-35556) Remove the close HiveClient's SessionState
[ https://issues.apache.org/jira/browse/SPARK-35556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang reassigned SPARK-35556: --- Assignee: Yang Jie > Remove the close HiveClient's SessionState > -- > > Key: SPARK-35556 > URL: https://issues.apache.org/jira/browse/SPARK-35556 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Minor > > There are some logs related to `java.lang.NoSuchMethodError: > org.apache.hadoop.hive.ql.session.SessionState.getTmpErrOutputFile()Ljava/io/File;` > when run sql/hive module UTs > For example, when execute > {code:java} > mvn test -pl sql/hive > -DwildcardSuites=org.apache.spark.sql.hive.client.VersionsSuite -Dtest=none > {code} > the all tests passed but there will be some error log as follow: > {code:java} > Run completed in 17 minutes, 18 seconds. > Total number of tests run: 867 > Suites: completed 2, aborted 0 > Tests: succeeded 867, failed 0, canceled 0, ignored 1, pending 0 > All tests passed. > 15:04:02.407 WARN org.apache.hadoop.hive.metastore.ObjectStore: Version > information not found in metastore. hive.metastore.schema.verification is not > enabled so recording the schema version 2.3.0 > 15:04:02.408 WARN org.apache.hadoop.hive.metastore.ObjectStore: > setMetaStoreSchemaVersion called but recording version is disabled: version = > 2.3.0, comment = Set by MetaStore yangjie01@0.2.30.21 > 15:04:02.441 WARN org.apache.hadoop.hive.metastore.ObjectStore: Failed to get > database default, returning NoSuchObjectException > 15:04:03.140 ERROR org.apache.spark.util.Utils: Uncaught exception in thread > shutdown-hook-0 > java.lang.NoSuchMethodError: > org.apache.hadoop.hive.ql.session.SessionState.getTmpErrOutputFile()Ljava/io/File; > at > org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$closeState$1(HiveClientImpl.scala:168) > at > scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) > at > org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$withHiveState$1(HiveClientImpl.scala:312) > at > org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:243) > at > org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:242) > at > org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:292) > at > org.apache.spark.sql.hive.client.HiveClientImpl.closeState(HiveClientImpl.scala:158) > at > org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$new$1(HiveClientImpl.scala:175) > at > org.apache.spark.util.SparkShutdownHook.run(ShutdownHookManager.scala:214) > at > org.apache.spark.util.SparkShutdownHookManager.$anonfun$runAll$2(ShutdownHookManager.scala:188) > at > scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) > at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1994) > at > org.apache.spark.util.SparkShutdownHookManager.$anonfun$runAll$1(ShutdownHookManager.scala:188) > at > scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) > at scala.util.Try$.apply(Try.scala:213) > at > org.apache.spark.util.SparkShutdownHookManager.runAll(ShutdownHookManager.scala:188) > at > org.apache.spark.util.SparkShutdownHookManager$$anon$2.run(ShutdownHookManager.scala:178) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > 15:04:03.141 WARN org.apache.hadoop.util.ShutdownHookManager: ShutdownHook > '$anon$2' failed, java.util.concurrent.ExecutionException: > java.lang.NoSuchMethodError: > org.apache.hadoop.hive.ql.session.SessionState.getTmpErrOutputFile()Ljava/io/File; > java.util.concurrent.ExecutionException: java.lang.NoSuchMethodError: > org.apache.hadoop.hive.ql.session.SessionState.getTmpErrOutputFile()Ljava/io/File; > at java.util.concurrent.FutureTask.report(FutureTask.java:122) > at java.util.concurrent.FutureTask.get(FutureTask.java:206) > at > org.apache.hadoop.util.ShutdownHookManager.executeShutdown(ShutdownHookManager.java:124) > at > org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:95) > Caused by: java.lang.NoSuchMethodError: > org.apache.hadoop.hive.ql.session.SessionState.getTmpErrOutputFile()Ljava/io/File; > at >
[jira] [Updated] (SPARK-35556) Avoid log NoSuchMethodError when HiveClientImpl.state close
[ https://issues.apache.org/jira/browse/SPARK-35556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-35556: Component/s: (was: Tests) > Avoid log NoSuchMethodError when HiveClientImpl.state close > > > Key: SPARK-35556 > URL: https://issues.apache.org/jira/browse/SPARK-35556 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: Yang Jie >Priority: Minor > > There are some logs related to `java.lang.NoSuchMethodError: > org.apache.hadoop.hive.ql.session.SessionState.getTmpErrOutputFile()Ljava/io/File;` > when run sql/hive module UTs > For example, when execute > {code:java} > mvn test -pl sql/hive > -DwildcardSuites=org.apache.spark.sql.hive.client.VersionsSuite -Dtest=none > {code} > the all tests passed but there will be some error log as follow: > {code:java} > Run completed in 17 minutes, 18 seconds. > Total number of tests run: 867 > Suites: completed 2, aborted 0 > Tests: succeeded 867, failed 0, canceled 0, ignored 1, pending 0 > All tests passed. > 15:04:02.407 WARN org.apache.hadoop.hive.metastore.ObjectStore: Version > information not found in metastore. hive.metastore.schema.verification is not > enabled so recording the schema version 2.3.0 > 15:04:02.408 WARN org.apache.hadoop.hive.metastore.ObjectStore: > setMetaStoreSchemaVersion called but recording version is disabled: version = > 2.3.0, comment = Set by MetaStore yangjie01@0.2.30.21 > 15:04:02.441 WARN org.apache.hadoop.hive.metastore.ObjectStore: Failed to get > database default, returning NoSuchObjectException > 15:04:03.140 ERROR org.apache.spark.util.Utils: Uncaught exception in thread > shutdown-hook-0 > java.lang.NoSuchMethodError: > org.apache.hadoop.hive.ql.session.SessionState.getTmpErrOutputFile()Ljava/io/File; > at > org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$closeState$1(HiveClientImpl.scala:168) > at > scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) > at > org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$withHiveState$1(HiveClientImpl.scala:312) > at > org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:243) > at > org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:242) > at > org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:292) > at > org.apache.spark.sql.hive.client.HiveClientImpl.closeState(HiveClientImpl.scala:158) > at > org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$new$1(HiveClientImpl.scala:175) > at > org.apache.spark.util.SparkShutdownHook.run(ShutdownHookManager.scala:214) > at > org.apache.spark.util.SparkShutdownHookManager.$anonfun$runAll$2(ShutdownHookManager.scala:188) > at > scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) > at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1994) > at > org.apache.spark.util.SparkShutdownHookManager.$anonfun$runAll$1(ShutdownHookManager.scala:188) > at > scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) > at scala.util.Try$.apply(Try.scala:213) > at > org.apache.spark.util.SparkShutdownHookManager.runAll(ShutdownHookManager.scala:188) > at > org.apache.spark.util.SparkShutdownHookManager$$anon$2.run(ShutdownHookManager.scala:178) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > 15:04:03.141 WARN org.apache.hadoop.util.ShutdownHookManager: ShutdownHook > '$anon$2' failed, java.util.concurrent.ExecutionException: > java.lang.NoSuchMethodError: > org.apache.hadoop.hive.ql.session.SessionState.getTmpErrOutputFile()Ljava/io/File; > java.util.concurrent.ExecutionException: java.lang.NoSuchMethodError: > org.apache.hadoop.hive.ql.session.SessionState.getTmpErrOutputFile()Ljava/io/File; > at java.util.concurrent.FutureTask.report(FutureTask.java:122) > at java.util.concurrent.FutureTask.get(FutureTask.java:206) > at > org.apache.hadoop.util.ShutdownHookManager.executeShutdown(ShutdownHookManager.java:124) > at > org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:95) > Caused by: java.lang.NoSuchMethodError: > org.apache.hadoop.hive.ql.session.SessionState.getTmpErrOutputFile()Ljava/io/File; > at >
[jira] [Updated] (SPARK-35556) Remove the close HiveClient's SessionState
[ https://issues.apache.org/jira/browse/SPARK-35556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-35556: Summary: Remove the close HiveClient's SessionState (was: Avoid log NoSuchMethodError when HiveClientImpl.state close) > Remove the close HiveClient's SessionState > -- > > Key: SPARK-35556 > URL: https://issues.apache.org/jira/browse/SPARK-35556 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: Yang Jie >Priority: Minor > > There are some logs related to `java.lang.NoSuchMethodError: > org.apache.hadoop.hive.ql.session.SessionState.getTmpErrOutputFile()Ljava/io/File;` > when run sql/hive module UTs > For example, when execute > {code:java} > mvn test -pl sql/hive > -DwildcardSuites=org.apache.spark.sql.hive.client.VersionsSuite -Dtest=none > {code} > the all tests passed but there will be some error log as follow: > {code:java} > Run completed in 17 minutes, 18 seconds. > Total number of tests run: 867 > Suites: completed 2, aborted 0 > Tests: succeeded 867, failed 0, canceled 0, ignored 1, pending 0 > All tests passed. > 15:04:02.407 WARN org.apache.hadoop.hive.metastore.ObjectStore: Version > information not found in metastore. hive.metastore.schema.verification is not > enabled so recording the schema version 2.3.0 > 15:04:02.408 WARN org.apache.hadoop.hive.metastore.ObjectStore: > setMetaStoreSchemaVersion called but recording version is disabled: version = > 2.3.0, comment = Set by MetaStore yangjie01@0.2.30.21 > 15:04:02.441 WARN org.apache.hadoop.hive.metastore.ObjectStore: Failed to get > database default, returning NoSuchObjectException > 15:04:03.140 ERROR org.apache.spark.util.Utils: Uncaught exception in thread > shutdown-hook-0 > java.lang.NoSuchMethodError: > org.apache.hadoop.hive.ql.session.SessionState.getTmpErrOutputFile()Ljava/io/File; > at > org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$closeState$1(HiveClientImpl.scala:168) > at > scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) > at > org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$withHiveState$1(HiveClientImpl.scala:312) > at > org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:243) > at > org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:242) > at > org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:292) > at > org.apache.spark.sql.hive.client.HiveClientImpl.closeState(HiveClientImpl.scala:158) > at > org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$new$1(HiveClientImpl.scala:175) > at > org.apache.spark.util.SparkShutdownHook.run(ShutdownHookManager.scala:214) > at > org.apache.spark.util.SparkShutdownHookManager.$anonfun$runAll$2(ShutdownHookManager.scala:188) > at > scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) > at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1994) > at > org.apache.spark.util.SparkShutdownHookManager.$anonfun$runAll$1(ShutdownHookManager.scala:188) > at > scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) > at scala.util.Try$.apply(Try.scala:213) > at > org.apache.spark.util.SparkShutdownHookManager.runAll(ShutdownHookManager.scala:188) > at > org.apache.spark.util.SparkShutdownHookManager$$anon$2.run(ShutdownHookManager.scala:178) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > 15:04:03.141 WARN org.apache.hadoop.util.ShutdownHookManager: ShutdownHook > '$anon$2' failed, java.util.concurrent.ExecutionException: > java.lang.NoSuchMethodError: > org.apache.hadoop.hive.ql.session.SessionState.getTmpErrOutputFile()Ljava/io/File; > java.util.concurrent.ExecutionException: java.lang.NoSuchMethodError: > org.apache.hadoop.hive.ql.session.SessionState.getTmpErrOutputFile()Ljava/io/File; > at java.util.concurrent.FutureTask.report(FutureTask.java:122) > at java.util.concurrent.FutureTask.get(FutureTask.java:206) > at > org.apache.hadoop.util.ShutdownHookManager.executeShutdown(ShutdownHookManager.java:124) > at > org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:95) > Caused by: java.lang.NoSuchMethodError: > org.apache.hadoop.hive.ql.session.SessionState.getTmpErrOutputFile()Ljava/io/File; > at >
[jira] [Updated] (SPARK-28560) Optimize shuffle reader to local shuffle reader when smj converted to bhj in adaptive execution
[ https://issues.apache.org/jira/browse/SPARK-28560?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-28560: Attachment: localShuffleReader.png > Optimize shuffle reader to local shuffle reader when smj converted to bhj in > adaptive execution > --- > > Key: SPARK-28560 > URL: https://issues.apache.org/jira/browse/SPARK-28560 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Ke Jia >Assignee: Ke Jia >Priority: Major > Fix For: 3.0.0 > > Attachments: localShuffleReader.png > > > Implement a rule in the new adaptive execution framework introduced in > SPARK-23128. This rule is used to optimize the shuffle reader to local > shuffle reader when smj is converted to bhj in adaptive execution. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28560) Optimize shuffle reader to local shuffle reader when smj converted to bhj in adaptive execution
[ https://issues.apache.org/jira/browse/SPARK-28560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17364079#comment-17364079 ] Yuming Wang commented on SPARK-28560: - This is very useful if there is data skew at the probe side. !localShuffleReader.png! > Optimize shuffle reader to local shuffle reader when smj converted to bhj in > adaptive execution > --- > > Key: SPARK-28560 > URL: https://issues.apache.org/jira/browse/SPARK-28560 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Ke Jia >Assignee: Ke Jia >Priority: Major > Fix For: 3.0.0 > > Attachments: localShuffleReader.png > > > Implement a rule in the new adaptive execution framework introduced in > SPARK-23128. This rule is used to optimize the shuffle reader to local > shuffle reader when smj is converted to bhj in adaptive execution. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27714) Support Join Reorder based on Genetic Algorithm when the # of joined tables > 12
[ https://issues.apache.org/jira/browse/SPARK-27714?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-27714: Target Version/s: (was: 3.0.0) > Support Join Reorder based on Genetic Algorithm when the # of joined tables > > 12 > > > Key: SPARK-27714 > URL: https://issues.apache.org/jira/browse/SPARK-27714 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Xianyin Xin >Assignee: Xianyin Xin >Priority: Major > > Now the join reorder logic is based on dynamic planning which can find the > most optimized plan theoretically, but the searching cost grows rapidly with > the # of joined tables grows. It would be better to introduce Genetic > algorithm (GA) to overcome this problem. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-35321) Spark 3.x can't talk to HMS 1.2.x and lower due to get_all_functions Thrift API missing
[ https://issues.apache.org/jira/browse/SPARK-35321?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang reassigned SPARK-35321: --- Assignee: Chao Sun > Spark 3.x can't talk to HMS 1.2.x and lower due to get_all_functions Thrift > API missing > --- > > Key: SPARK-35321 > URL: https://issues.apache.org/jira/browse/SPARK-35321 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.2, 3.1.1, 3.2.0 >Reporter: Chao Sun >Assignee: Chao Sun >Priority: Major > > https://issues.apache.org/jira/browse/HIVE-10319 introduced a new API > {{get_all_functions}} which is only supported in Hive 1.3.0/2.0.0 and up. > This is called when creating a new {{Hive}} object: > {code} > private Hive(HiveConf c, boolean doRegisterAllFns) throws HiveException { > conf = c; > if (doRegisterAllFns) { > registerAllFunctionsOnce(); > } > } > {code} > {{registerAllFunctionsOnce}} will reload all the permanent functions by > calling the {{get_all_functions}} API from the megastore. In Spark, we always > pass {{doRegisterAllFns}} as true, and this will cause failure: > {code} > Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: > org.apache.thrift.TApplicationException: Invalid method name: > 'get_all_functions' > at > org.apache.hadoop.hive.ql.metadata.Hive.getAllFunctions(Hive.java:3897) > at > org.apache.hadoop.hive.ql.metadata.Hive.reloadFunctions(Hive.java:248) > at > org.apache.hadoop.hive.ql.metadata.Hive.registerAllFunctionsOnce(Hive.java:231) > ... 96 more > Caused by: org.apache.thrift.TApplicationException: Invalid method name: > 'get_all_functions' > at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:79) > at > org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.recv_get_all_functions(ThriftHiveMetastore.java:3845) > at > org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.get_all_functions(ThriftHiveMetastore.java:3833) > {code} > It looks like Spark doesn't really need to call {{registerAllFunctionsOnce}} > since it loads the Hive permanent function directly from HMS API. The Hive > {{FunctionRegistry}} is only used for loading Hive built-in functions. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-35321) Spark 3.x can't talk to HMS 1.2.x and lower due to get_all_functions Thrift API missing
[ https://issues.apache.org/jira/browse/SPARK-35321?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang resolved SPARK-35321. - Fix Version/s: 3.2.0 Resolution: Fixed Issue resolved by pull request 32887 [https://github.com/apache/spark/pull/32887] > Spark 3.x can't talk to HMS 1.2.x and lower due to get_all_functions Thrift > API missing > --- > > Key: SPARK-35321 > URL: https://issues.apache.org/jira/browse/SPARK-35321 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.2, 3.1.1, 3.2.0 >Reporter: Chao Sun >Assignee: Chao Sun >Priority: Major > Fix For: 3.2.0 > > > https://issues.apache.org/jira/browse/HIVE-10319 introduced a new API > {{get_all_functions}} which is only supported in Hive 1.3.0/2.0.0 and up. > This is called when creating a new {{Hive}} object: > {code} > private Hive(HiveConf c, boolean doRegisterAllFns) throws HiveException { > conf = c; > if (doRegisterAllFns) { > registerAllFunctionsOnce(); > } > } > {code} > {{registerAllFunctionsOnce}} will reload all the permanent functions by > calling the {{get_all_functions}} API from the megastore. In Spark, we always > pass {{doRegisterAllFns}} as true, and this will cause failure: > {code} > Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: > org.apache.thrift.TApplicationException: Invalid method name: > 'get_all_functions' > at > org.apache.hadoop.hive.ql.metadata.Hive.getAllFunctions(Hive.java:3897) > at > org.apache.hadoop.hive.ql.metadata.Hive.reloadFunctions(Hive.java:248) > at > org.apache.hadoop.hive.ql.metadata.Hive.registerAllFunctionsOnce(Hive.java:231) > ... 96 more > Caused by: org.apache.thrift.TApplicationException: Invalid method name: > 'get_all_functions' > at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:79) > at > org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.recv_get_all_functions(ThriftHiveMetastore.java:3845) > at > org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.get_all_functions(ThriftHiveMetastore.java:3833) > {code} > It looks like Spark doesn't really need to call {{registerAllFunctionsOnce}} > since it loads the Hive permanent function directly from HMS API. The Hive > {{FunctionRegistry}} is only used for loading Hive built-in functions. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-35650) Coalesce small output files through AQE
Yuming Wang created SPARK-35650: --- Summary: Coalesce small output files through AQE Key: SPARK-35650 URL: https://issues.apache.org/jira/browse/SPARK-35650 Project: Spark Issue Type: New Feature Components: SQL Affects Versions: 3.2.0 Reporter: Yuming Wang Add a new API to support coalesce small output files through AQE. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-34808) Removes outer join if it only has distinct on streamed side
[ https://issues.apache.org/jira/browse/SPARK-34808?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang reassigned SPARK-34808: --- Assignee: Yuming Wang > Removes outer join if it only has distinct on streamed side > --- > > Key: SPARK-34808 > URL: https://issues.apache.org/jira/browse/SPARK-34808 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: Yuming Wang >Assignee: Yuming Wang >Priority: Major > > For example: > {code:scala} > spark.range(200L).selectExpr("id AS a").createTempView("t1") > spark.range(300L).selectExpr("id AS b").createTempView("t2") > spark.sql("SELECT DISTINCT a FROM t1 LEFT JOIN t2 ON a = b").explain(true) > {code} > Current optimized plan: > {noformat} > == Optimized Logical Plan == > Aggregate [a#2L], [a#2L] > +- Project [a#2L] >+- Join LeftOuter, (a#2L = b#6L) > :- Project [id#0L AS a#2L] > : +- Range (0, 200, step=1, splits=Some(2)) > +- Project [id#4L AS b#6L] > +- Range (0, 300, step=1, splits=Some(2)) > {noformat} > Expected optimized plan: > {noformat} > == Optimized Logical Plan == > Aggregate [a#2L], [a#2L] > +- Project [id#0L AS a#2L] >+- Range (0, 200, step=1, splits=Some(2)) > {noformat} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-34808) Removes outer join if it only has distinct on streamed side
[ https://issues.apache.org/jira/browse/SPARK-34808?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang resolved SPARK-34808. - Fix Version/s: 3.2.0 Resolution: Fixed Issue resolved by pull request 31908 [https://github.com/apache/spark/pull/31908] > Removes outer join if it only has distinct on streamed side > --- > > Key: SPARK-34808 > URL: https://issues.apache.org/jira/browse/SPARK-34808 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: Yuming Wang >Assignee: Yuming Wang >Priority: Major > Fix For: 3.2.0 > > > For example: > {code:scala} > spark.range(200L).selectExpr("id AS a").createTempView("t1") > spark.range(300L).selectExpr("id AS b").createTempView("t2") > spark.sql("SELECT DISTINCT a FROM t1 LEFT JOIN t2 ON a = b").explain(true) > {code} > Current optimized plan: > {noformat} > == Optimized Logical Plan == > Aggregate [a#2L], [a#2L] > +- Project [a#2L] >+- Join LeftOuter, (a#2L = b#6L) > :- Project [id#0L AS a#2L] > : +- Range (0, 200, step=1, splits=Some(2)) > +- Project [id#4L AS b#6L] > +- Range (0, 300, step=1, splits=Some(2)) > {noformat} > Expected optimized plan: > {noformat} > == Optimized Logical Plan == > Aggregate [a#2L], [a#2L] > +- Project [id#0L AS a#2L] >+- Range (0, 200, step=1, splits=Some(2)) > {noformat} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-35571) tag v3.0.0 org.apache.spark.sql.catalyst.parser.AstBuilder import error
[ https://issues.apache.org/jira/browse/SPARK-35571?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-35571: Target Version/s: (was: 3.0.0) > tag v3.0.0 org.apache.spark.sql.catalyst.parser.AstBuilder import error > --- > > Key: SPARK-35571 > URL: https://issues.apache.org/jira/browse/SPARK-35571 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: geekyouth >Priority: Major > > org.apache.spark.sql.catalyst.parser.AstBuilder: > https://github.com/apache/spark/blob/v3.0.0/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala > line 36: > `import org.apache.spark.sql.catalyst.parser.SqlBaseParser._` > SqlBaseParser do not exists in pkg `org.apache.spark.sql.catalyst.parser` > https://github.com/apache/spark/tree/v3.0.0/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser > also line 54 : > SqlBaseBaseVisitor does not import and could not compile -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35568) UnsupportedOperationException: WholeStageCodegen (3) does not implement doExecuteBroadcast
[ https://issues.apache.org/jira/browse/SPARK-35568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17354174#comment-17354174 ] Yuming Wang commented on SPARK-35568: - Thank you [~dongjoon]. > UnsupportedOperationException: WholeStageCodegen (3) does not implement > doExecuteBroadcast > -- > > Key: SPARK-35568 > URL: https://issues.apache.org/jira/browse/SPARK-35568 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: Yuming Wang >Priority: Major > > How to reproduce: > {code:scala} > sql( > """ > |SELECT s.store_id, f.product_id > |FROM (SELECT DISTINCT * FROM fact_sk) f > | JOIN (SELECT > | *, > | ROW_NUMBER() OVER (PARTITION BY store_id ORDER BY > state_province DESC) AS rn > |FROM dim_store) s > | ON f.store_id = s.store_id > |WHERE s.country = 'DE' AND s.rn = 1 > |""".stripMargin).show > {code} > {noformat} > Caused by: java.lang.UnsupportedOperationException: WholeStageCodegen (3) > does not implement doExecuteBroadcast > at > org.apache.spark.sql.execution.SparkPlan.doExecuteBroadcast(SparkPlan.scala:297) > at > org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.doExecuteBroadcast(AdaptiveSparkPlanExec.scala:323) > at > org.apache.spark.sql.execution.SparkPlan.$anonfun$executeBroadcast$1(SparkPlan.scala:192) > at > org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:217) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:214) > at > org.apache.spark.sql.execution.SparkPlan.executeBroadcast(SparkPlan.scala:188) > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-35568) UnsupportedOperationException: WholeStageCodegen (3) does not implement doExecuteBroadcast
[ https://issues.apache.org/jira/browse/SPARK-35568?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-35568: Description: How to reproduce: {code:scala} sql( """ |SELECT s.store_id, f.product_id |FROM (SELECT DISTINCT * FROM fact_sk) f | JOIN (SELECT | *, | ROW_NUMBER() OVER (PARTITION BY store_id ORDER BY state_province DESC) AS rn |FROM dim_store) s | ON f.store_id = s.store_id |WHERE s.country = 'DE' AND s.rn = 1 |""".stripMargin).show {code} {noformat} Caused by: java.lang.UnsupportedOperationException: WholeStageCodegen (3) does not implement doExecuteBroadcast at org.apache.spark.sql.execution.SparkPlan.doExecuteBroadcast(SparkPlan.scala:297) at org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.doExecuteBroadcast(AdaptiveSparkPlanExec.scala:323) at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeBroadcast$1(SparkPlan.scala:192) at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:217) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:214) at org.apache.spark.sql.execution.SparkPlan.executeBroadcast(SparkPlan.scala:188) {noformat} was: {code:sql} sql( """ |SELECT s.store_id, f.product_id |FROM (SELECT DISTINCT * FROM fact_sk) f | JOIN (SELECT | *, | ROW_NUMBER() OVER (PARTITION BY store_id ORDER BY state_province DESC) AS rn |FROM dim_store) s | ON f.store_id = s.store_id |WHERE s.country = 'DE' AND s.rn = 1 |""".stripMargin).show {code} {noformat} Caused by: java.lang.UnsupportedOperationException: WholeStageCodegen (3) does not implement doExecuteBroadcast at org.apache.spark.sql.execution.SparkPlan.doExecuteBroadcast(SparkPlan.scala:297) at org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.doExecuteBroadcast(AdaptiveSparkPlanExec.scala:323) at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeBroadcast$1(SparkPlan.scala:192) at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:217) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:214) at org.apache.spark.sql.execution.SparkPlan.executeBroadcast(SparkPlan.scala:188) {noformat} > UnsupportedOperationException: WholeStageCodegen (3) does not implement > doExecuteBroadcast > -- > > Key: SPARK-35568 > URL: https://issues.apache.org/jira/browse/SPARK-35568 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0 >Reporter: Yuming Wang >Priority: Major > > How to reproduce: > {code:scala} > sql( > """ > |SELECT s.store_id, f.product_id > |FROM (SELECT DISTINCT * FROM fact_sk) f > | JOIN (SELECT > | *, > | ROW_NUMBER() OVER (PARTITION BY store_id ORDER BY > state_province DESC) AS rn > |FROM dim_store) s > | ON f.store_id = s.store_id > |WHERE s.country = 'DE' AND s.rn = 1 > |""".stripMargin).show > {code} > {noformat} > Caused by: java.lang.UnsupportedOperationException: WholeStageCodegen (3) > does not implement doExecuteBroadcast > at > org.apache.spark.sql.execution.SparkPlan.doExecuteBroadcast(SparkPlan.scala:297) > at > org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.doExecuteBroadcast(AdaptiveSparkPlanExec.scala:323) > at > org.apache.spark.sql.execution.SparkPlan.$anonfun$executeBroadcast$1(SparkPlan.scala:192) > at > org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:217) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:214) > at > org.apache.spark.sql.execution.SparkPlan.executeBroadcast(SparkPlan.scala:188) > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-35568) UnsupportedOperationException: WholeStageCodegen (3) does not implement doExecuteBroadcast
Yuming Wang created SPARK-35568: --- Summary: UnsupportedOperationException: WholeStageCodegen (3) does not implement doExecuteBroadcast Key: SPARK-35568 URL: https://issues.apache.org/jira/browse/SPARK-35568 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.2.0 Reporter: Yuming Wang {code:sql} sql( """ |SELECT s.store_id, f.product_id |FROM (SELECT DISTINCT * FROM fact_sk) f | JOIN (SELECT | *, | ROW_NUMBER() OVER (PARTITION BY store_id ORDER BY state_province DESC) AS rn |FROM dim_store) s | ON f.store_id = s.store_id |WHERE s.country = 'DE' AND s.rn = 1 |""".stripMargin).show {code} {noformat} Caused by: java.lang.UnsupportedOperationException: WholeStageCodegen (3) does not implement doExecuteBroadcast at org.apache.spark.sql.execution.SparkPlan.doExecuteBroadcast(SparkPlan.scala:297) at org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.doExecuteBroadcast(AdaptiveSparkPlanExec.scala:323) at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeBroadcast$1(SparkPlan.scala:192) at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:217) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:214) at org.apache.spark.sql.execution.SparkPlan.executeBroadcast(SparkPlan.scala:188) {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35441) InMemoryFileIndex load all files into memroy
[ https://issues.apache.org/jira/browse/SPARK-35441?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17350242#comment-17350242 ] Yuming Wang commented on SPARK-35441: - Please increase your driver memory. or you should merge small files if these 20,000,000 files contain many small files. > InMemoryFileIndex load all files into memroy > > > Key: SPARK-35441 > URL: https://issues.apache.org/jira/browse/SPARK-35441 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.3.1 >Reporter: Zhang Jianguo >Priority: Minor > > When InMemoryFileIndex is initialated, it'll load all files of ${rootPath}. > If there're thousands of files, it could lead to OOM. > > https://github.com/apache/spark/blob/0b3758e8cdb3eaa9d55ce3b41ecad5fa01567343/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InMemoryFileIndex.scala#L66 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-35494) Timestamp casting performance issue when invoked with timezone
[ https://issues.apache.org/jira/browse/SPARK-35494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-35494: Target Version/s: (was: 2.4.9) > Timestamp casting performance issue when invoked with timezone > -- > > Key: SPARK-35494 > URL: https://issues.apache.org/jira/browse/SPARK-35494 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.7, 2.4.8 >Reporter: Raphael Luta >Priority: Major > > In Spark SQL 2.4, when converting a datetime string column to timestamp with > cast or to_timestamp, we have noticed a major performance issue when the > source string contains timezone information (for example > 2021-05-24T00:00:00+02:00) > This simple benchmark illustrates the difference: > {noformat} > OpenJDK 64-Bit Server VM 1.8.0_292-b10 on Mac OS X 10.16 > Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz > Timestamp Conversion: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative > -- > no tz 1368 / 1379 3,1 326,1 1,0X > with UTC tz 5940 / 5947 0,7 1416,2 0,2X > with hours tz 5940 / 5962 0,7 1416,2 0,2X > {noformat} > With the provided patch, the benchmark results are: > {noformat} > OpenJDK 64-Bit Server VM 1.8.0_292-b10 on Mac OS X 10.16 > Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz > Timestamp Conversion: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative > > no tz 1383 / 1392 3,0 329,7 1,0X > with UTC tz 1550 / 1587 2,7 369,6 0,9X > with hours tz 1570 / 1574 2,7 374,2 0,9X > {noformat} > This does not occur on the 3.x branches -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32291) COALESCE should not reduce the child parallelism if it is Join
[ https://issues.apache.org/jira/browse/SPARK-32291?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17349930#comment-17349930 ] Yuming Wang commented on SPARK-32291: - We can use localCheckpoint to workaround this issue: {code:scala} spark.range(100).createTempView("t1") spark.range(200).createTempView("t2") spark.sql("set spark.sql.autoBroadcastJoinThreshold=0") spark.sql("select t1.* from t1 join t2 on (t1.id = t2.id)").localCheckpoint().coalesce(1).show() {code} > COALESCE should not reduce the child parallelism if it is Join > -- > > Key: SPARK-32291 > URL: https://issues.apache.org/jira/browse/SPARK-32291 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Yuming Wang >Priority: Major > Attachments: COALESCE.png, coalesce.png, repartition.png > > > How to reproduce this issue: > {code:scala} > spark.range(100).createTempView("t1") > spark.range(200).createTempView("t2") > spark.sql("set spark.sql.autoBroadcastJoinThreshold=0") > spark.sql("select /*+ COALESCE(1) */ t1.* from t1 join t2 on (t1.id = > t2.id)").show > {code} > The dag is: > !COALESCE.png! > A real case: > !coalesce.png! > !repartition.png! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-35244) invoke should throw the original exception
[ https://issues.apache.org/jira/browse/SPARK-35244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang reassigned SPARK-35244: --- Assignee: Wenchen Fan (was: Apache Spark) > invoke should throw the original exception > -- > > Key: SPARK-35244 > URL: https://issues.apache.org/jira/browse/SPARK-35244 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.2, 3.1.1, 3.2.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > Fix For: 3.0.3, 3.1.2, 3.2.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35441) InMemoryFileIndex load all files into memroy
[ https://issues.apache.org/jira/browse/SPARK-35441?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17347404#comment-17347404 ] Yuming Wang commented on SPARK-35441: - What is your driver memory? > InMemoryFileIndex load all files into memroy > > > Key: SPARK-35441 > URL: https://issues.apache.org/jira/browse/SPARK-35441 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.3.1 >Reporter: Zhang Jianguo >Priority: Minor > > When InMemoryFileIndex is initialated, it'll load all files of ${rootPath}. > If there're thousands of files, it could lead to OOM. > > https://github.com/apache/spark/blob/0b3758e8cdb3eaa9d55ce3b41ecad5fa01567343/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InMemoryFileIndex.scala#L66 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-35415) Change information to map type for SHOW TABLE EXTENDED command
Yuming Wang created SPARK-35415: --- Summary: Change information to map type for SHOW TABLE EXTENDED command Key: SPARK-35415 URL: https://issues.apache.org/jira/browse/SPARK-35415 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.2.0 Reporter: Yuming Wang -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-35286) Replace SessionState.start with SessionState.setCurrentSessionState
[ https://issues.apache.org/jira/browse/SPARK-35286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang resolved SPARK-35286. - Fix Version/s: 3.2.0 Resolution: Fixed Issue resolved by pull request 32410 [https://github.com/apache/spark/pull/32410] > Replace SessionState.start with SessionState.setCurrentSessionState > --- > > Key: SPARK-35286 > URL: https://issues.apache.org/jira/browse/SPARK-35286 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: Yuming Wang >Assignee: Yuming Wang >Priority: Major > Fix For: 3.2.0 > > > To avoid SessionState.createSessionDirs creating too many directories: > https://user-images.githubusercontent.com/5399861/116766834-28ea7080-aa5f-11eb-85ff-07bcaee444e5.png -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-35286) Replace SessionState.start with SessionState.setCurrentSessionState
[ https://issues.apache.org/jira/browse/SPARK-35286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang reassigned SPARK-35286: --- Assignee: Yuming Wang > Replace SessionState.start with SessionState.setCurrentSessionState > --- > > Key: SPARK-35286 > URL: https://issues.apache.org/jira/browse/SPARK-35286 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: Yuming Wang >Assignee: Yuming Wang >Priority: Major > > To avoid SessionState.createSessionDirs creating too many directories: > https://user-images.githubusercontent.com/5399861/116766834-28ea7080-aa5f-11eb-85ff-07bcaee444e5.png -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35365) spark3.1.1 use too long time to analyze table fields
[ https://issues.apache.org/jira/browse/SPARK-35365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17342981#comment-17342981 ] Yuming Wang commented on SPARK-35365: - {noformat} -- 2.4 org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences 46101271113 / 47150195238 1415 / 2368 -- 3.1 org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences 113500482546 / 1155004825462300 / 4124 {noformat} > spark3.1.1 use too long time to analyze table fields > > > Key: SPARK-35365 > URL: https://issues.apache.org/jira/browse/SPARK-35365 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.1 >Reporter: yao >Priority: Major > Attachments: spark2.4report, spark3.1.1_report_originalsql, > spark3.11report > > > I have a big sql with a few width tables join and complex logic, when I run > that in spark 2.4 , it will take 20 minues in analyze phase, when I use spark > 3.1.1, it will use about 40 minutes, > I need set spark.sql.analyzer.maxIterations=1000 in spark3.1.1. > or spark.sql.optimizer.maxIterations=1000 in spark2.4. > no other special setting for this . > I check on the spark ui , I find that there is no job generated, all executor > have no active tasks, and when I set log level to debug, I find that the job > is in analyze phase, analyze the fields reference. > this phase use too long time. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35365) spark3.1.1 use too long to analyze table fields
[ https://issues.apache.org/jira/browse/SPARK-35365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17342354#comment-17342354 ] Yuming Wang commented on SPARK-35365: - [~xiaohua] Could you check which rule affect the performance, for example: {noformat} === Metrics of Analyzer/Optimizer Rules === Total number of runs: 3022 Total time: 7.941302436 seconds Rule Effective Time / Total Time Effective Runs / Total Runs org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations 3350202022 / 3357847817 7 / 39 org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions 11946476 / 5885675436 / 39 org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences 516175887 / 577794974 15 / 39 org.apache.spark.sql.catalyst.analysis.TimeWindowing 0 / 519817133 0 / 39 org.apache.spark.sql.catalyst.analysis.DecimalPrecision 226306881 / 271650752 11 / 39 org.apache.spark.sql.catalyst.optimizer.EliminateOuterJoin 9838775 / 202214973 1 / 6 org.apache.spark.sql.catalyst.analysis.TypeCoercion$PromoteStrings 141138907 / 188596520 3 / 39 org.apache.spark.sql.catalyst.analysis.TypeCoercion$InConversion 107365436 / 185270852 3 / 39 org.apache.spark.sql.catalyst.analysis.TypeCoercion$ImplicitTypeCasts 58358334 / 1409436903 / 39 org.apache.spark.sql.catalyst.analysis.Analyzer$ExtractGenerator 0 / 119236169 0 / 39 org.apache.spark.sql.catalyst.optimizer.ColumnPruning 41291489 / 76464261 2 / 8 org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveWindowOrder 0 / 647750420 / 39 org.apache.spark.sql.catalyst.analysis.AlignViewOutput 0 / 617967610 / 39 org.apache.spark.sql.catalyst.analysis.TypeCoercion$BooleanEquality 0 / 581433310 / 39 {noformat} > spark3.1.1 use too long to analyze table fields > --- > > Key: SPARK-35365 > URL: https://issues.apache.org/jira/browse/SPARK-35365 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.1 >Reporter: yao >Priority: Major > > I have a big sql with a few width tables join and complex logic, when I run > that in spark 2.4 , it will take 20 minues in analyze phase, when I use spark > 3.1.1, it will use about 40 minutes, > I need set spark.sql.analyzer.maxIterations=1000 in spark3.1.1. > or spark.sql.optimizer.maxIterations=1000 in spark2.4. > no other special setting for this . > I check on the spark ui , I find that there is no job generated, all executor > have no active tasks, and when I set log level to debug, I find that the job > is in analyze phase, analyze the fields reference. > this phase use too long time. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-35335) Improve CoalesceShufflePartitions to avoid generating small files
Yuming Wang created SPARK-35335: --- Summary: Improve CoalesceShufflePartitions to avoid generating small files Key: SPARK-35335 URL: https://issues.apache.org/jira/browse/SPARK-35335 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.2.0 Reporter: Yuming Wang -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-35273) CombineFilters support non-deterministic expressions
[ https://issues.apache.org/jira/browse/SPARK-35273?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-35273: Description: For example: {code:scala} spark.sql("create table t1(id int) using parquet") spark.sql("select * from (select * from t1 where id not in (1, 3, 6)) t where id = 7 and rand() <= 0.01").explain("cost") {code} Current: {noformat} == Optimized Logical Plan == Filter (isnotnull(id#0) AND ((id#0 = 7) AND (rand(-639771619343876662) <= 0.01))), Statistics(sizeInBytes=1.0 B) +- Filter NOT id#0 IN (1,3,6), Statistics(sizeInBytes=1.0 B) +- Relation default.t1[id#0] parquet, Statistics(sizeInBytes=0.0 B) {noformat} Expected: {noformat} == Optimized Logical Plan == Filter ((rand(-639771619343876662) <= 0.01))), Statistics(sizeInBytes=1.0 B) +- Filter NOT id#0 IN (1,3,6) and isnotnull(id#0) AND ((id#0 = 7) , Statistics(sizeInBytes=1.0 B) +- Relation default.t1[id#0] parquet, Statistics(sizeInBytes=0.0 B) {noformat} Another example: {code:scala} spark.sql("create table t1(id int) using parquet") spark.sql("create view v1 as select * from t1 where id not in (1, 3, 6)") spark.sql("select * from v1 where id = 7 and rand() <= 0.01").explain("cost") {code} was: For example: {code:scala} spark.sql("create table t1(id int) using parquet") spark.sql("select * from (select * from t1 where id not in (1, 3, 6)) t where id = 7 and rand() <= 0.01").explain("cost") {code} Current: {noformat} == Optimized Logical Plan == Filter (isnotnull(id#0) AND ((id#0 = 7) AND (rand(-639771619343876662) <= 0.01))), Statistics(sizeInBytes=1.0 B) +- Filter NOT id#0 IN (1,3,6), Statistics(sizeInBytes=1.0 B) +- Relation default.t1[id#0] parquet, Statistics(sizeInBytes=0.0 B) {noformat} Expected: {noformat} == Optimized Logical Plan == Filter (isnotnull(id#0) AND (NOT id#0 IN (1,3,6) AND ((id#0 = 7) AND (rand(-1485510186481201685) <= 0.01, Statistics(sizeInBytes=1.0 B) +- Relation default.t1[id#0] parquet, Statistics(sizeInBytes=0.0 B) {noformat} Another example: {code:scala} spark.sql("create table t1(id int) using parquet") spark.sql("create view v1 as select * from t1 where id not in (1, 3, 6)") spark.sql("select * from v1 where id = 7 and rand() <= 0.01").explain("cost") {code} > CombineFilters support non-deterministic expressions > > > Key: SPARK-35273 > URL: https://issues.apache.org/jira/browse/SPARK-35273 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: Yuming Wang >Assignee: Yuming Wang >Priority: Major > Fix For: 3.2.0 > > > For example: > {code:scala} > spark.sql("create table t1(id int) using parquet") > spark.sql("select * from (select * from t1 where id not in (1, 3, 6)) t where > id = 7 and rand() <= 0.01").explain("cost") > {code} > Current: > {noformat} > == Optimized Logical Plan == > Filter (isnotnull(id#0) AND ((id#0 = 7) AND (rand(-639771619343876662) <= > 0.01))), Statistics(sizeInBytes=1.0 B) > +- Filter NOT id#0 IN (1,3,6), Statistics(sizeInBytes=1.0 B) >+- Relation default.t1[id#0] parquet, Statistics(sizeInBytes=0.0 B) > {noformat} > Expected: > {noformat} > == Optimized Logical Plan == > Filter ((rand(-639771619343876662) <= 0.01))), Statistics(sizeInBytes=1.0 B) > +- Filter NOT id#0 IN (1,3,6) and isnotnull(id#0) AND ((id#0 = 7) , > Statistics(sizeInBytes=1.0 B) >+- Relation default.t1[id#0] parquet, Statistics(sizeInBytes=0.0 B) > {noformat} > Another example: > {code:scala} > spark.sql("create table t1(id int) using parquet") > spark.sql("create view v1 as select * from t1 where id not in (1, 3, 6)") > spark.sql("select * from v1 where id = 7 and rand() <= 0.01").explain("cost") > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35321) Spark 3.x can't talk to HMS 1.2.x and lower due to get_all_functions Thrift API missing
[ https://issues.apache.org/jira/browse/SPARK-35321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17339944#comment-17339944 ] Yuming Wang commented on SPARK-35321: - Could we add a parameter to disable registerAllFunctionsOnce? https://issues.apache.org/jira/browse/HIVE-21563 > Spark 3.x can't talk to HMS 1.2.x and lower due to get_all_functions Thrift > API missing > --- > > Key: SPARK-35321 > URL: https://issues.apache.org/jira/browse/SPARK-35321 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.2, 3.1.1, 3.2.0 >Reporter: Chao Sun >Priority: Major > > https://issues.apache.org/jira/browse/HIVE-10319 introduced a new API > {{get_all_functions}} which is only supported in Hive 1.3.0/2.0.0 and up. > This is called when creating a new {{Hive}} object: > {code} > private Hive(HiveConf c, boolean doRegisterAllFns) throws HiveException { > conf = c; > if (doRegisterAllFns) { > registerAllFunctionsOnce(); > } > } > {code} > {{registerAllFunctionsOnce}} will reload all the permanent functions by > calling the {{get_all_functions}} API from the megastore. In Spark, we always > pass {{doRegisterAllFns}} as true, and this will cause failure: > {code} > Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: > org.apache.thrift.TApplicationException: Invalid method name: > 'get_all_functions' > at > org.apache.hadoop.hive.ql.metadata.Hive.getAllFunctions(Hive.java:3897) > at > org.apache.hadoop.hive.ql.metadata.Hive.reloadFunctions(Hive.java:248) > at > org.apache.hadoop.hive.ql.metadata.Hive.registerAllFunctionsOnce(Hive.java:231) > ... 96 more > Caused by: org.apache.thrift.TApplicationException: Invalid method name: > 'get_all_functions' > at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:79) > at > org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.recv_get_all_functions(ThriftHiveMetastore.java:3845) > at > org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.get_all_functions(ThriftHiveMetastore.java:3833) > {code} > It looks like Spark doesn't really need to call {{registerAllFunctionsOnce}} > since it loads the Hive permanent function directly from HMS API. The Hive > {{FunctionRegistry}} is only used for loading Hive built-in functions. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-35315) Keep benchmark result consistent between spark-submit and SBT
[ https://issues.apache.org/jira/browse/SPARK-35315?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang resolved SPARK-35315. - Fix Version/s: 3.2.0 Resolution: Fixed Issue resolved by pull request 32440 [https://github.com/apache/spark/pull/32440] > Keep benchmark result consistent between spark-submit and SBT > - > > Key: SPARK-35315 > URL: https://issues.apache.org/jira/browse/SPARK-35315 > Project: Spark > Issue Type: Improvement > Components: Tests >Affects Versions: 3.2.0 >Reporter: Chao Sun >Assignee: Chao Sun >Priority: Minor > Fix For: 3.2.0 > > > Currently benchmark can be done in two ways: {{spark-submit}}, or SBT > command. However in the former Spark will miss some properties such as > {{IS_TESTING}}, which is useful to turn on/off some behavior like codegen. > Therefore, the result could differ with the two methods. In addition, the > benchmark GitHub workflow is using the {{spark-submit}} approach. > This propose to set {{IS_TESTING}} to true in {{BenchmarkBase}} so that it is > always on. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-35315) Keep benchmark result consistent between spark-submit and SBT
[ https://issues.apache.org/jira/browse/SPARK-35315?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang reassigned SPARK-35315: --- Assignee: Chao Sun > Keep benchmark result consistent between spark-submit and SBT > - > > Key: SPARK-35315 > URL: https://issues.apache.org/jira/browse/SPARK-35315 > Project: Spark > Issue Type: Improvement > Components: Tests >Affects Versions: 3.2.0 >Reporter: Chao Sun >Assignee: Chao Sun >Priority: Minor > > Currently benchmark can be done in two ways: {{spark-submit}}, or SBT > command. However in the former Spark will miss some properties such as > {{IS_TESTING}}, which is useful to turn on/off some behavior like codegen. > Therefore, the result could differ with the two methods. In addition, the > benchmark GitHub workflow is using the {{spark-submit}} approach. > This propose to set {{IS_TESTING}} to true in {{BenchmarkBase}} so that it is > always on. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-35316) UnwrapCastInBinaryComparison support In/InSet predicate
[ https://issues.apache.org/jira/browse/SPARK-35316?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-35316: Description: It will not pushdown filters for In/InSet predicates: {code:scala} spark.range(50).selectExpr("cast(id as int) as id").write.mode("overwrite").parquet("/tmp/parquet/t1") spark.read.parquet("/tmp/parquet/t1").where("id in (1L, 2L, 4L)").explain spark.read.parquet("/tmp/parquet/t1").where("id = 1L or id = 2L or id = 4L").explain {code} {noformat} == Physical Plan == *(1) Filter cast(id#5 as bigint) IN (1,2,4) +- *(1) ColumnarToRow +- FileScan parquet [id#5] Batched: true, DataFilters: [cast(id#5 as bigint) IN (1,2,4)], Format: Parquet, Location: InMemoryFileIndex(1 paths)[file:/tmp/parquet/t1], PartitionFilters: [], PushedFilters: [], ReadSchema: struct == Physical Plan == *(1) Filter (((id#7 = 1) OR (id#7 = 2)) OR (id#7 = 4)) +- *(1) ColumnarToRow +- FileScan parquet [id#7] Batched: true, DataFilters: [(((id#7 = 1) OR (id#7 = 2)) OR (id#7 = 4))], Format: Parquet, Location: InMemoryFileIndex(1 paths)[file:/tmp/parquet/t1], PartitionFilters: [], PushedFilters: [Or(Or(EqualTo(id,1),EqualTo(id,2)),EqualTo(id,4))], ReadSchema: struct {noformat} was: In/InSet missing this optimization: {code:scala} spark.range(50).selectExpr("cast(id as int) as id").write.mode("overwrite").parquet("/tmp/parquet/t1") spark.read.parquet("/tmp/parquet/t1").where("id in (1L, 2L, 4L)").explain spark.read.parquet("/tmp/parquet/t1").where("id = 1L or id = 2L or id = 4L").explain {code} {noformat} == Physical Plan == *(1) Filter cast(id#5 as bigint) IN (1,2,4) +- *(1) ColumnarToRow +- FileScan parquet [id#5] Batched: true, DataFilters: [cast(id#5 as bigint) IN (1,2,4)], Format: Parquet, Location: InMemoryFileIndex(1 paths)[file:/tmp/parquet/t1], PartitionFilters: [], PushedFilters: [], ReadSchema: struct == Physical Plan == *(1) Filter (((id#7 = 1) OR (id#7 = 2)) OR (id#7 = 4)) +- *(1) ColumnarToRow +- FileScan parquet [id#7] Batched: true, DataFilters: [(((id#7 = 1) OR (id#7 = 2)) OR (id#7 = 4))], Format: Parquet, Location: InMemoryFileIndex(1 paths)[file:/tmp/parquet/t1], PartitionFilters: [], PushedFilters: [Or(Or(EqualTo(id,1),EqualTo(id,2)),EqualTo(id,4))], ReadSchema: struct {noformat} > UnwrapCastInBinaryComparison support In/InSet predicate > --- > > Key: SPARK-35316 > URL: https://issues.apache.org/jira/browse/SPARK-35316 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: Yuming Wang >Priority: Major > > It will not pushdown filters for In/InSet predicates: > {code:scala} > spark.range(50).selectExpr("cast(id as int) as > id").write.mode("overwrite").parquet("/tmp/parquet/t1") > spark.read.parquet("/tmp/parquet/t1").where("id in (1L, 2L, 4L)").explain > spark.read.parquet("/tmp/parquet/t1").where("id = 1L or id = 2L or id = > 4L").explain > {code} > {noformat} > == Physical Plan == > *(1) Filter cast(id#5 as bigint) IN (1,2,4) > +- *(1) ColumnarToRow >+- FileScan parquet [id#5] Batched: true, DataFilters: [cast(id#5 as > bigint) IN (1,2,4)], Format: Parquet, Location: InMemoryFileIndex(1 > paths)[file:/tmp/parquet/t1], PartitionFilters: [], PushedFilters: [], > ReadSchema: struct > == Physical Plan == > *(1) Filter (((id#7 = 1) OR (id#7 = 2)) OR (id#7 = 4)) > +- *(1) ColumnarToRow >+- FileScan parquet [id#7] Batched: true, DataFilters: [(((id#7 = 1) OR > (id#7 = 2)) OR (id#7 = 4))], Format: Parquet, Location: InMemoryFileIndex(1 > paths)[file:/tmp/parquet/t1], PartitionFilters: [], PushedFilters: > [Or(Or(EqualTo(id,1),EqualTo(id,2)),EqualTo(id,4))], ReadSchema: > struct > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-35316) UnwrapCastInBinaryComparison support In/InSet predicate
Yuming Wang created SPARK-35316: --- Summary: UnwrapCastInBinaryComparison support In/InSet predicate Key: SPARK-35316 URL: https://issues.apache.org/jira/browse/SPARK-35316 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.2.0 Reporter: Yuming Wang In/InSet missing this optimization: {code:scala} spark.range(50).selectExpr("cast(id as int) as id").write.mode("overwrite").parquet("/tmp/parquet/t1") spark.read.parquet("/tmp/parquet/t1").where("id in (1L, 2L, 4L)").explain spark.read.parquet("/tmp/parquet/t1").where("id = 1L or id = 2L or id = 4L").explain {code} {noformat} == Physical Plan == *(1) Filter cast(id#5 as bigint) IN (1,2,4) +- *(1) ColumnarToRow +- FileScan parquet [id#5] Batched: true, DataFilters: [cast(id#5 as bigint) IN (1,2,4)], Format: Parquet, Location: InMemoryFileIndex(1 paths)[file:/tmp/parquet/t1], PartitionFilters: [], PushedFilters: [], ReadSchema: struct == Physical Plan == *(1) Filter (((id#7 = 1) OR (id#7 = 2)) OR (id#7 = 4)) +- *(1) ColumnarToRow +- FileScan parquet [id#7] Batched: true, DataFilters: [(((id#7 = 1) OR (id#7 = 2)) OR (id#7 = 4))], Format: Parquet, Location: InMemoryFileIndex(1 paths)[file:/tmp/parquet/t1], PartitionFilters: [], PushedFilters: [Or(Or(EqualTo(id,1),EqualTo(id,2)),EqualTo(id,4))], ReadSchema: struct {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-35286) Replace SessionState.start with SessionState.setCurrentSessionState
Yuming Wang created SPARK-35286: --- Summary: Replace SessionState.start with SessionState.setCurrentSessionState Key: SPARK-35286 URL: https://issues.apache.org/jira/browse/SPARK-35286 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.2.0 Reporter: Yuming Wang To avoid SessionState.createSessionDirs creating too many directories: https://user-images.githubusercontent.com/5399861/116766834-28ea7080-aa5f-11eb-85ff-07bcaee444e5.png -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35245) DynamicFilter pushdown not working
[ https://issues.apache.org/jira/browse/SPARK-35245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17337807#comment-17337807 ] Yuming Wang commented on SPARK-35245: - This is because filtering side do not has selective predicate : https://github.com/apache/spark/blob/19c7d2f3d8cda8d9bc5dfc1a0bf5d46845b1bc2f/sql/core/src/main/scala/org/apache/spark/sql/execution/dynamicpruning/PartitionPruning.scala#L193-L201 > DynamicFilter pushdown not working > -- > > Key: SPARK-35245 > URL: https://issues.apache.org/jira/browse/SPARK-35245 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.1.1 >Reporter: jean-claude >Priority: Minor > > > The pushed filters is always empty. `PushedFilters: []` > I was expecting the filters to be pushed down on the probe side of the join. > Not sure how to properly configure this to work. For example how to set > fallbackFilterRatio ? > spark = ( SparkSession.builder > .master('local') > .config("spark.sql.optimizer.dynamicPartitionPruning.fallbackFilterRatio", > 100) > .getOrCreate() > ) > > df = > spark.read.parquet('abfss://warehouse@/iceberg/opensource//data/timeperiod=2021-04-25/0-0-929b48ef-7ec3-47bd-b0a1-e9172c2dca6a-1.parquet') > df.createOrReplaceTempView('TMP') > > spark.sql(''' > explain cost > select > timeperiod, rrname > from > TMP > where > timeperiod in ( > select > TO_DATE(d, 'MM-dd-') AS timeperiod > from > values > ('01-01-2021'), > ('01-01-2021'), > ('01-01-2021') tbl(d) > ) > group by > timeperiod, rrname > ''').show(truncate=False) > |== Optimized Logical Plan == > Aggregate [timeperiod#597, rrname#594], [timeperiod#597, rrname#594], > Statistics(sizeInBytes=69.0 MiB) > +- Join LeftSemi, (timeperiod#597 = timeperiod#669), > Statistics(sizeInBytes=69.0 MiB) > :- Project [rrname#594, timeperiod#597], Statistics(sizeInBytes=69.0 MiB) > : +- > Relation[count#591,time_first#592L,time_last#593L,rrname#594,rrtype#595,rdata#596,timeperiod#597] > parquet, Statistics(sizeInBytes=198.5 MiB) > +- LocalRelation [timeperiod#669], Statistics(sizeInBytes=36.0 B) > == Physical Plan == > *(2) HashAggregate(keys=[timeperiod#597, rrname#594], functions=[], > output=[timeperiod#597, rrname#594]) > +- Exchange hashpartitioning(timeperiod#597, rrname#594, 200), > ENSURE_REQUIREMENTS, [id=#839] > +- *(1) HashAggregate(keys=[timeperiod#597, rrname#594], functions=[], > output=[timeperiod#597, rrname#594]) > +- *(1) BroadcastHashJoin [timeperiod#597], [timeperiod#669], LeftSemi, > BuildRight, false > :- *(1) ColumnarToRow > : +- FileScan parquet [rrname#594,timeperiod#597] Batched: true, > DataFilters: [], Format: Parquet, Location: > InMemoryFileIndex[abfss://warehouse@/iceberg/opensource/..., > PartitionFilters: [], PushedFilters: [], ReadSchema: > struct > +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, date, > true]),false), [id=#822] > +- LocalTableScan [timeperiod#669] > > ``` -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-35273) CombineFilters support non-deterministic expressions
[ https://issues.apache.org/jira/browse/SPARK-35273?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-35273: Description: For example: {code:scala} spark.sql("create table t1(id int) using parquet") spark.sql("select * from (select * from t1 where id not in (1, 3, 6)) t where id = 7 and rand() <= 0.01").explain("cost") {code} Current: {noformat} == Optimized Logical Plan == Filter (isnotnull(id#0) AND ((id#0 = 7) AND (rand(-639771619343876662) <= 0.01))), Statistics(sizeInBytes=1.0 B) +- Filter NOT id#0 IN (1,3,6), Statistics(sizeInBytes=1.0 B) +- Relation default.t1[id#0] parquet, Statistics(sizeInBytes=0.0 B) {noformat} Expected: {noformat} == Optimized Logical Plan == Filter (isnotnull(id#0) AND (NOT id#0 IN (1,3,6) AND ((id#0 = 7) AND (rand(-1485510186481201685) <= 0.01, Statistics(sizeInBytes=1.0 B) +- Relation default.t1[id#0] parquet, Statistics(sizeInBytes=0.0 B) {noformat} Another example: {code:scala} spark.sql("create table t1(id int) using parquet") spark.sql("create view v1 as select * from t1 where id not in (1, 3, 6)") spark.sql("select * from v1 where id = 7 and rand() <= 0.01").explain("cost") {code} was: For example: {code:scala} spark.sql("create table t1(id int) using parquet") spark.sql("select * from (select * from t1 where id not in (1, 3, 6)) t where id = 7 and rand() <= 0.01").explain("cost") {code} Current: {noformat} == Optimized Logical Plan == Filter (isnotnull(id#0) AND ((id#0 = 7) AND (rand(-639771619343876662) <= 0.01))), Statistics(sizeInBytes=1.0 B) +- Filter NOT id#0 IN (1,3,6), Statistics(sizeInBytes=1.0 B) +- Relation default.t1[id#0] parquet, Statistics(sizeInBytes=0.0 B) {noformat} Expected: {noformat} == Optimized Logical Plan == Filter (isnotnull(id#0) AND (NOT id#0 IN (1,3,6) AND ((id#0 = 7) AND (rand(-1485510186481201685) <= 0.01, Statistics(sizeInBytes=1.0 B) +- Relation default.t1[id#0] parquet, Statistics(sizeInBytes=0.0 B) {noformat} Another example: {code:scala} spark.sql("create table t1(id int) using parquet") spark.sql("create table v1 as select * from t1 where id not in (1, 3, 6)") spark.sql("select * from v1 where id = 7 and rand() <= 0.01").explain("cost") {code} > CombineFilters support non-deterministic expressions > > > Key: SPARK-35273 > URL: https://issues.apache.org/jira/browse/SPARK-35273 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: Yuming Wang >Priority: Major > > For example: > {code:scala} > spark.sql("create table t1(id int) using parquet") > spark.sql("select * from (select * from t1 where id not in (1, 3, 6)) t where > id = 7 and rand() <= 0.01").explain("cost") > {code} > Current: > {noformat} > == Optimized Logical Plan == > Filter (isnotnull(id#0) AND ((id#0 = 7) AND (rand(-639771619343876662) <= > 0.01))), Statistics(sizeInBytes=1.0 B) > +- Filter NOT id#0 IN (1,3,6), Statistics(sizeInBytes=1.0 B) >+- Relation default.t1[id#0] parquet, Statistics(sizeInBytes=0.0 B) > {noformat} > Expected: > {noformat} > == Optimized Logical Plan == > Filter (isnotnull(id#0) AND (NOT id#0 IN (1,3,6) AND ((id#0 = 7) AND > (rand(-1485510186481201685) <= 0.01, Statistics(sizeInBytes=1.0 B) > +- Relation default.t1[id#0] parquet, Statistics(sizeInBytes=0.0 B) > {noformat} > Another example: > {code:scala} > spark.sql("create table t1(id int) using parquet") > spark.sql("create view v1 as select * from t1 where id not in (1, 3, 6)") > spark.sql("select * from v1 where id = 7 and rand() <= 0.01").explain("cost") > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-35273) CombineFilters support non-deterministic expressions
[ https://issues.apache.org/jira/browse/SPARK-35273?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-35273: Description: For example: {code:scala} spark.sql("create table t1(id int) using parquet") spark.sql("select * from (select * from t1 where id not in (1, 3, 6)) t where id = 7 and rand() <= 0.01").explain("cost") {code} Current: {noformat} == Optimized Logical Plan == Filter (isnotnull(id#0) AND ((id#0 = 7) AND (rand(-639771619343876662) <= 0.01))), Statistics(sizeInBytes=1.0 B) +- Filter NOT id#0 IN (1,3,6), Statistics(sizeInBytes=1.0 B) +- Relation default.t1[id#0] parquet, Statistics(sizeInBytes=0.0 B) {noformat} Expected: {noformat} == Optimized Logical Plan == Filter (isnotnull(id#0) AND (NOT id#0 IN (1,3,6) AND ((id#0 = 7) AND (rand(-1485510186481201685) <= 0.01, Statistics(sizeInBytes=1.0 B) +- Relation default.t1[id#0] parquet, Statistics(sizeInBytes=0.0 B) {noformat} Another example: {code:scala} spark.sql("create table t1(id int) using parquet") spark.sql("create table v1 as select * from t1 where id not in (1, 3, 6)") spark.sql("select * from v1 where id = 7 and rand() <= 0.01").explain("cost") {code} was: For example: {code:scala} spark.sql("create table t1(id int) using parquet") spark.sql("select * from (select * from t1 where id not in (1, 3, 6)) t where id = 7 and rand() <= 0.01").explain("cost") {code} Current: {noformat} == Optimized Logical Plan == Filter (isnotnull(id#0) AND ((id#0 = 7) AND (rand(-639771619343876662) <= 0.01))), Statistics(sizeInBytes=1.0 B) +- Filter NOT id#0 IN (1,3,6), Statistics(sizeInBytes=1.0 B) +- Relation default.t1[id#0] parquet, Statistics(sizeInBytes=0.0 B) {noformat} Expected: {noformat} == Optimized Logical Plan == Filter (isnotnull(id#0) AND (NOT id#0 IN (1,3,6) AND ((id#0 = 7) AND (rand(-1485510186481201685) <= 0.01, Statistics(sizeInBytes=1.0 B) +- Relation default.t1[id#0] parquet, Statistics(sizeInBytes=0.0 B) {noformat} Another example: {code:scala} spark.sql("create table t1(id int) using parquet") spark.sql("create table v1 as select * from t1 where id not in (1, 3, 6)") spark.sql("select * from v1 where id = 7 and rand() <= 0.01").explain("cost") {code{ > CombineFilters support non-deterministic expressions > > > Key: SPARK-35273 > URL: https://issues.apache.org/jira/browse/SPARK-35273 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: Yuming Wang >Priority: Major > > For example: > {code:scala} > spark.sql("create table t1(id int) using parquet") > spark.sql("select * from (select * from t1 where id not in (1, 3, 6)) t where > id = 7 and rand() <= 0.01").explain("cost") > {code} > Current: > {noformat} > == Optimized Logical Plan == > Filter (isnotnull(id#0) AND ((id#0 = 7) AND (rand(-639771619343876662) <= > 0.01))), Statistics(sizeInBytes=1.0 B) > +- Filter NOT id#0 IN (1,3,6), Statistics(sizeInBytes=1.0 B) >+- Relation default.t1[id#0] parquet, Statistics(sizeInBytes=0.0 B) > {noformat} > Expected: > {noformat} > == Optimized Logical Plan == > Filter (isnotnull(id#0) AND (NOT id#0 IN (1,3,6) AND ((id#0 = 7) AND > (rand(-1485510186481201685) <= 0.01, Statistics(sizeInBytes=1.0 B) > +- Relation default.t1[id#0] parquet, Statistics(sizeInBytes=0.0 B) > {noformat} > Another example: > {code:scala} > spark.sql("create table t1(id int) using parquet") > spark.sql("create table v1 as select * from t1 where id not in (1, 3, 6)") > spark.sql("select * from v1 where id = 7 and rand() <= 0.01").explain("cost") > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-35273) CombineFilters support non-deterministic expressions
[ https://issues.apache.org/jira/browse/SPARK-35273?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-35273: Description: For example: {code:scala} spark.sql("create table t1(id int) using parquet") spark.sql("select * from (select * from t1 where id not in (1, 3, 6)) t where id = 7 and rand() <= 0.01").explain("cost") {code} Current: {noformat} == Optimized Logical Plan == Filter (isnotnull(id#0) AND ((id#0 = 7) AND (rand(-639771619343876662) <= 0.01))), Statistics(sizeInBytes=1.0 B) +- Filter NOT id#0 IN (1,3,6), Statistics(sizeInBytes=1.0 B) +- Relation default.t1[id#0] parquet, Statistics(sizeInBytes=0.0 B) {noformat} Expected: {noformat} == Optimized Logical Plan == Filter (isnotnull(id#0) AND (NOT id#0 IN (1,3,6) AND ((id#0 = 7) AND (rand(-1485510186481201685) <= 0.01, Statistics(sizeInBytes=1.0 B) +- Relation default.t1[id#0] parquet, Statistics(sizeInBytes=0.0 B) {noformat} Another example: {code:scala} spark.sql("create table t1(id int) using parquet") spark.sql("create table v1 as select * from t1 where id not in (1, 3, 6)") spark.sql("select * from v1 where id = 7 and rand() <= 0.01").explain("cost") {code{ was: For example: {code:scala} spark.sql("create table t1(id int) using parquet") spark.sql("select * from (select * from t1 where id not in (1, 3, 6)) t where id = 7 and rand() <= 0.01").explain("cost") {code} Current: {noformat} == Optimized Logical Plan == Filter (isnotnull(id#0) AND ((id#0 = 7) AND (rand(-639771619343876662) <= 0.01))), Statistics(sizeInBytes=1.0 B) +- Filter NOT id#0 IN (1,3,6), Statistics(sizeInBytes=1.0 B) +- Relation default.t1[id#0] parquet, Statistics(sizeInBytes=0.0 B) {noformat} Expected: {noformat} == Optimized Logical Plan == Filter (isnotnull(id#0) AND (NOT id#0 IN (1,3,6) AND ((id#0 = 7) AND (rand(-1485510186481201685) <= 0.01, Statistics(sizeInBytes=1.0 B) +- Relation default.t1[id#0] parquet, Statistics(sizeInBytes=0.0 B) {noformat} > CombineFilters support non-deterministic expressions > > > Key: SPARK-35273 > URL: https://issues.apache.org/jira/browse/SPARK-35273 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: Yuming Wang >Priority: Major > > For example: > {code:scala} > spark.sql("create table t1(id int) using parquet") > spark.sql("select * from (select * from t1 where id not in (1, 3, 6)) t where > id = 7 and rand() <= 0.01").explain("cost") > {code} > Current: > {noformat} > == Optimized Logical Plan == > Filter (isnotnull(id#0) AND ((id#0 = 7) AND (rand(-639771619343876662) <= > 0.01))), Statistics(sizeInBytes=1.0 B) > +- Filter NOT id#0 IN (1,3,6), Statistics(sizeInBytes=1.0 B) >+- Relation default.t1[id#0] parquet, Statistics(sizeInBytes=0.0 B) > {noformat} > Expected: > {noformat} > == Optimized Logical Plan == > Filter (isnotnull(id#0) AND (NOT id#0 IN (1,3,6) AND ((id#0 = 7) AND > (rand(-1485510186481201685) <= 0.01, Statistics(sizeInBytes=1.0 B) > +- Relation default.t1[id#0] parquet, Statistics(sizeInBytes=0.0 B) > {noformat} > Another example: > {code:scala} > spark.sql("create table t1(id int) using parquet") > spark.sql("create table v1 as select * from t1 where id not in (1, 3, 6)") > spark.sql("select * from v1 where id = 7 and rand() <= 0.01").explain("cost") > {code{ -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-35273) CombineFilters support non-deterministic expressions
Yuming Wang created SPARK-35273: --- Summary: CombineFilters support non-deterministic expressions Key: SPARK-35273 URL: https://issues.apache.org/jira/browse/SPARK-35273 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.2.0 Reporter: Yuming Wang For example: {code:scala} spark.sql("create table t1(id int) using parquet") spark.sql("select * from (select * from t1 where id not in (1, 3, 6)) t where id = 7 and rand() <= 0.01").explain("cost") {code} Current: {noformat} == Optimized Logical Plan == Filter (isnotnull(id#0) AND ((id#0 = 7) AND (rand(-639771619343876662) <= 0.01))), Statistics(sizeInBytes=1.0 B) +- Filter NOT id#0 IN (1,3,6), Statistics(sizeInBytes=1.0 B) +- Relation default.t1[id#0] parquet, Statistics(sizeInBytes=0.0 B) {noformat} Expected: {noformat} == Optimized Logical Plan == Filter (isnotnull(id#0) AND (NOT id#0 IN (1,3,6) AND ((id#0 = 7) AND (rand(-1485510186481201685) <= 0.01, Statistics(sizeInBytes=1.0 B) +- Relation default.t1[id#0] parquet, Statistics(sizeInBytes=0.0 B) {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-35251) Improve LiveEntityHelpers.newAccumulatorInfos performace
Yuming Wang created SPARK-35251: --- Summary: Improve LiveEntityHelpers.newAccumulatorInfos performace Key: SPARK-35251 URL: https://issues.apache.org/jira/browse/SPARK-35251 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 3.2.0 Reporter: Yuming Wang [It|https://github.com/apache/spark/blob/3854ad87c78f2a331f9c9c1a34f9ec281900f8fe/core/src/main/scala/org/apache/spark/status/LiveEntity.scala#L646-L661] will impact performance if there are lot of {{AccumulableInfo}} instances. {noformat} num #instances #bytes class name -- 1: 33189785050708371592 [C 2:79535824591381896 [J 3: 33083623410586759488 java.lang.String 4: 139726297 8942483008 org.apache.spark.sql.execution.metric.SQLMetric 5: 249040156 5976963744 scala.Some 6: 146278719 5851148760 org.apache.spark.util.AccumulatorMetadata 7: 38448113 5536528272 java.net.URI 8: 37540162 360382 org.apache.hadoop.fs.FileStatus 9: 69724130 3346758240 java.util.Hashtable$Entry 10: 61521559 2953034832 java.util.concurrent.ConcurrentHashMap$Node 11: 50421974 2823630544 scala.collection.mutable.LinkedEntry 12: 43349222 2774350208 org.apache.spark.scheduler.AccumulableInfo {noformat} {noformat} --- 15430388364 ns (2.03%), 1543 samples [ 0] scala.collection.TraversableLike.noneIn$1 [ 1] scala.collection.TraversableLike.filterImpl [ 2] scala.collection.TraversableLike.filterImpl$ [ 3] scala.collection.AbstractTraversable.filterImpl [ 4] scala.collection.TraversableLike.filter [ 5] scala.collection.TraversableLike.filter$ [ 6] scala.collection.AbstractTraversable.filter [ 7] org.apache.spark.status.LiveEntityHelpers$.newAccumulatorInfos [ 8] org.apache.spark.status.LiveTask.doUpdate [ 9] org.apache.spark.status.LiveEntity.write {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34897) Support reconcile schemas based on index after nested column pruning
[ https://issues.apache.org/jira/browse/SPARK-34897?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17331225#comment-17331225 ] Yuming Wang commented on SPARK-34897: - Issue resolved by pull request 31993 https://github.com/apache/spark/pull/31993 > Support reconcile schemas based on index after nested column pruning > > > Key: SPARK-34897 > URL: https://issues.apache.org/jira/browse/SPARK-34897 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.2, 3.1.1, 3.2.0 >Reporter: Yuming Wang >Assignee: Yuming Wang >Priority: Major > Fix For: 3.0.3, 3.1.2, 3.2.0 > > > How to reproduce this issue: > {code:scala} > spark.sql( > """ > |CREATE TABLE `t1` ( > | `_col0` INT, > | `_col1` STRING, > | `_col2` STRUCT<`c1`: STRING, `c2`: STRING, `c3`: STRING, `c4`: BIGINT>, > | `_col3` STRING) > |USING orc > |PARTITIONED BY (_col3) > |""".stripMargin) > spark.sql("INSERT INTO `t1` values(1, '2', null, '2021-02-01')") > spark.sql("SELECT _col2.c1, _col0 FROM `t1` WHERE _col3 = '2021-02-01'").show > {code} > Error message: > {noformat} > java.lang.AssertionError: assertion failed: The given data schema > struct<_col0:int,_col2:struct> has less fields than the actual ORC > physical schema, no idea which columns were dropped, fail to read. Try to > disable > at scala.Predef$.assert(Predef.scala:223) > at > org.apache.spark.sql.execution.datasources.orc.OrcUtils$.requestedColumnIds(OrcUtils.scala:159) > at > org.apache.spark.sql.execution.datasources.orc.OrcFileFormat.$anonfun$buildReaderWithPartitionValues$3(OrcFileFormat.scala:180) > at org.apache.spark.util.Utils$.tryWithResource(Utils.scala:2620) > at > org.apache.spark.sql.execution.datasources.orc.OrcFileFormat.$anonfun$buildReaderWithPartitionValues$1(OrcFileFormat.scala:178) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:117) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:165) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:94) > at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:756) > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-34897) Support reconcile schemas based on index after nested column pruning
[ https://issues.apache.org/jira/browse/SPARK-34897?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang resolved SPARK-34897. - Fix Version/s: 3.2.0 3.1.2 3.0.3 Assignee: Yuming Wang Resolution: Fixed > Support reconcile schemas based on index after nested column pruning > > > Key: SPARK-34897 > URL: https://issues.apache.org/jira/browse/SPARK-34897 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.2, 3.1.1, 3.2.0 >Reporter: Yuming Wang >Assignee: Yuming Wang >Priority: Major > Fix For: 3.0.3, 3.1.2, 3.2.0 > > > How to reproduce this issue: > {code:scala} > spark.sql( > """ > |CREATE TABLE `t1` ( > | `_col0` INT, > | `_col1` STRING, > | `_col2` STRUCT<`c1`: STRING, `c2`: STRING, `c3`: STRING, `c4`: BIGINT>, > | `_col3` STRING) > |USING orc > |PARTITIONED BY (_col3) > |""".stripMargin) > spark.sql("INSERT INTO `t1` values(1, '2', null, '2021-02-01')") > spark.sql("SELECT _col2.c1, _col0 FROM `t1` WHERE _col3 = '2021-02-01'").show > {code} > Error message: > {noformat} > java.lang.AssertionError: assertion failed: The given data schema > struct<_col0:int,_col2:struct> has less fields than the actual ORC > physical schema, no idea which columns were dropped, fail to read. Try to > disable > at scala.Predef$.assert(Predef.scala:223) > at > org.apache.spark.sql.execution.datasources.orc.OrcUtils$.requestedColumnIds(OrcUtils.scala:159) > at > org.apache.spark.sql.execution.datasources.orc.OrcFileFormat.$anonfun$buildReaderWithPartitionValues$3(OrcFileFormat.scala:180) > at org.apache.spark.util.Utils$.tryWithResource(Utils.scala:2620) > at > org.apache.spark.sql.execution.datasources.orc.OrcFileFormat.$anonfun$buildReaderWithPartitionValues$1(OrcFileFormat.scala:178) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:117) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:165) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:94) > at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:756) > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-35203) Improve Repartition statistics estimation
Yuming Wang created SPARK-35203: --- Summary: Improve Repartition statistics estimation Key: SPARK-35203 URL: https://issues.apache.org/jira/browse/SPARK-35203 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.2.0 Reporter: Yuming Wang -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35191) all columns are read even if column pruning applies when spark3.0 read table written by spark2.2
[ https://issues.apache.org/jira/browse/SPARK-35191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17329932#comment-17329932 ] Yuming Wang commented on SPARK-35191: - Could you check if it works after https://github.com/apache/spark/pull/31993? > all columns are read even if column pruning applies when spark3.0 read table > written by spark2.2 > > > Key: SPARK-35191 > URL: https://issues.apache.org/jira/browse/SPARK-35191 > Project: Spark > Issue Type: Question > Components: Spark Core >Affects Versions: 3.0.0 > Environment: spark3.0 > spark.sql.hive.convertMetastoreOrc=true(default value in spark3.0) > spark.sql.orc.impl=native(default value in spark3.0) >Reporter: xiaoli >Priority: Major > > Before I address this issue, let me talk about the issue background: The > current spark version we use is 2.2, and we plan to migrate to spark3.0 in > near future. Before migration, we test some query in both spark2.2 and > spark3.0 to check potential issue. The data source table of these query is > orc format written by spark2.2. > > I find that even if column pruning is applied, spark3.0’s native reader will > read all columns. > > Then I do remote debug. In OrcUtils.scala’s requestedColumnIds Method, it > will check whether field name is started with “_col”. In my case, field name > is started with “_col”, like “_col1”, “_col2”. So pruneCols is not done. The > code is below: > > if (orcFieldNames.forall(_.startsWith("_col"))) { > // This is a ORC file written by Hive, no field names in the physical > schema, assume the > // physical schema maps to the data scheme by index. > _assert_(orcFieldNames.length <= dataSchema.length, "The given data schema > " + > s"*$*{dataSchema.catalogString} has less fields than the actual ORC > physical schema, " + > "no idea which columns were dropped, fail to read.") > // for ORC file written by Hive, no field names > // in the physical schema, there is a need to send the > // entire dataSchema instead of required schema. > // So pruneCols is not done in this case > Some(requiredSchema.fieldNames.map { name => > val index = dataSchema.fieldIndex(name) > if (index < orcFieldNames.length) { > index > } else { > -1 > } > }, false) > > Although this code comment explains reason, I still do not understand. This > issue only happens in this case: spark3.0 uses native reader to read table > written by spark2.2. > > In other cases, there is no such issue. I do another 2 tests: > Test1: use spark3.0’s hive reader (running with > spark.sql.hive.convertMetastoreOrc=false and spark.sql.orc.impl=hive) to read > the same table, it only reads pruned columns. > Test2: use spark3.0 to write a table, then use spark3.0’s native reader to > read this new table, it only reads pruned columns. > > This issue I mentioned is a block we use native reader in spark3.0. Can > anyone know further reason or provide solutions? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-35185) Improve Distinct statistics estimation
Yuming Wang created SPARK-35185: --- Summary: Improve Distinct statistics estimation Key: SPARK-35185 URL: https://issues.apache.org/jira/browse/SPARK-35185 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.2.0 Reporter: Yuming Wang -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-35121) Improve JoinSelection when join condition is not defined
[ https://issues.apache.org/jira/browse/SPARK-35121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-35121: Summary: Improve JoinSelection when join condition is not defined (was: Improve JoinSelection when join condition is empty) > Improve JoinSelection when join condition is not defined > > > Key: SPARK-35121 > URL: https://issues.apache.org/jira/browse/SPARK-35121 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: Yuming Wang >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-35121) Improve JoinSelection when join condition is empty
Yuming Wang created SPARK-35121: --- Summary: Improve JoinSelection when join condition is empty Key: SPARK-35121 URL: https://issues.apache.org/jira/browse/SPARK-35121 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.2.0 Reporter: Yuming Wang -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-35118) Propagate empty relation through Join if join condition is empty
Yuming Wang created SPARK-35118: --- Summary: Propagate empty relation through Join if join condition is empty Key: SPARK-35118 URL: https://issues.apache.org/jira/browse/SPARK-35118 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.2.0 Reporter: Yuming Wang {code:scala} spark.sql("create table t1 using parquet as select id as a, id as b from range(10)") spark.sql("create table t2 using parquet as select id as a, id as b from range(20)") spark.sql("select t1.a from t1 left semi join t2").explain(true) {code} Current plan: {noformat} == Optimized Logical Plan == Join LeftSemi :- Project [a#6L] : +- Relation default.t1[a#6L,b#7L] parquet +- Project +- Relation default.t2[a#8L,b#9L] parquet {noformat} Excepted plan: {noformat} == Optimized Logical Plan == Project [a#6L] +- Relation default.t1[a#6L,b#7L] parquet {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-34087) a memory leak occurs when we clone the spark session
[ https://issues.apache.org/jira/browse/SPARK-34087?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-34087: Attachment: screenshot-1.png > a memory leak occurs when we clone the spark session > > > Key: SPARK-34087 > URL: https://issues.apache.org/jira/browse/SPARK-34087 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.1 >Reporter: Fu Chen >Assignee: wuyi >Priority: Major > Fix For: 3.2.0 > > Attachments: 1610451044690.jpg > > > In Spark-3.0.1, the memory leak occurs when we keep cloning the spark session > because a new ExecutionListenerBus instance will add to AsyncEventQueue when > we clone a new session. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-34087) a memory leak occurs when we clone the spark session
[ https://issues.apache.org/jira/browse/SPARK-34087?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-34087: Attachment: (was: screenshot-1.png) > a memory leak occurs when we clone the spark session > > > Key: SPARK-34087 > URL: https://issues.apache.org/jira/browse/SPARK-34087 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.1 >Reporter: Fu Chen >Assignee: wuyi >Priority: Major > Fix For: 3.2.0 > > Attachments: 1610451044690.jpg > > > In Spark-3.0.1, the memory leak occurs when we keep cloning the spark session > because a new ExecutionListenerBus instance will add to AsyncEventQueue when > we clone a new session. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-34897) Support reconcile schemas based on index after nested column pruning
[ https://issues.apache.org/jira/browse/SPARK-34897?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-34897: Summary: Support reconcile schemas based on index after nested column pruning (was: The given data schema has less fields than the actual ORC physical schema) > Support reconcile schemas based on index after nested column pruning > > > Key: SPARK-34897 > URL: https://issues.apache.org/jira/browse/SPARK-34897 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.2, 3.2.0, 3.1.1 >Reporter: Yuming Wang >Priority: Major > > How to reproduce this issue: > {code:scala} > spark.sql( > """ > |CREATE TABLE `t1` ( > | `_col0` INT, > | `_col1` STRING, > | `_col2` STRUCT<`c1`: STRING, `c2`: STRING, `c3`: STRING, `c4`: BIGINT>, > | `_col3` STRING) > |USING orc > |PARTITIONED BY (_col3) > |""".stripMargin) > spark.sql("INSERT INTO `t1` values(1, '2', null, '2021-02-01')") > spark.sql("SELECT _col2.c1, _col0 FROM `t1` WHERE _col3 = '2021-02-01'").show > {code} > Error message: > {noformat} > java.lang.AssertionError: assertion failed: The given data schema > struct<_col0:int,_col2:struct> has less fields than the actual ORC > physical schema, no idea which columns were dropped, fail to read. Try to > disable > at scala.Predef$.assert(Predef.scala:223) > at > org.apache.spark.sql.execution.datasources.orc.OrcUtils$.requestedColumnIds(OrcUtils.scala:159) > at > org.apache.spark.sql.execution.datasources.orc.OrcFileFormat.$anonfun$buildReaderWithPartitionValues$3(OrcFileFormat.scala:180) > at org.apache.spark.util.Utils$.tryWithResource(Utils.scala:2620) > at > org.apache.spark.sql.execution.datasources.orc.OrcFileFormat.$anonfun$buildReaderWithPartitionValues$1(OrcFileFormat.scala:178) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:117) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:165) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:94) > at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:756) > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-34897) Support reconcile schemas based on index after nested column pruning
[ https://issues.apache.org/jira/browse/SPARK-34897?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-34897: Comment: was deleted (was: We can workaround this issue by setting spark.sql.optimizer.nestedSchemaPruning.enabled to false.) > Support reconcile schemas based on index after nested column pruning > > > Key: SPARK-34897 > URL: https://issues.apache.org/jira/browse/SPARK-34897 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.2, 3.2.0, 3.1.1 >Reporter: Yuming Wang >Priority: Major > > How to reproduce this issue: > {code:scala} > spark.sql( > """ > |CREATE TABLE `t1` ( > | `_col0` INT, > | `_col1` STRING, > | `_col2` STRUCT<`c1`: STRING, `c2`: STRING, `c3`: STRING, `c4`: BIGINT>, > | `_col3` STRING) > |USING orc > |PARTITIONED BY (_col3) > |""".stripMargin) > spark.sql("INSERT INTO `t1` values(1, '2', null, '2021-02-01')") > spark.sql("SELECT _col2.c1, _col0 FROM `t1` WHERE _col3 = '2021-02-01'").show > {code} > Error message: > {noformat} > java.lang.AssertionError: assertion failed: The given data schema > struct<_col0:int,_col2:struct> has less fields than the actual ORC > physical schema, no idea which columns were dropped, fail to read. Try to > disable > at scala.Predef$.assert(Predef.scala:223) > at > org.apache.spark.sql.execution.datasources.orc.OrcUtils$.requestedColumnIds(OrcUtils.scala:159) > at > org.apache.spark.sql.execution.datasources.orc.OrcFileFormat.$anonfun$buildReaderWithPartitionValues$3(OrcFileFormat.scala:180) > at org.apache.spark.util.Utils$.tryWithResource(Utils.scala:2620) > at > org.apache.spark.sql.execution.datasources.orc.OrcFileFormat.$anonfun$buildReaderWithPartitionValues$1(OrcFileFormat.scala:178) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:117) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:165) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:94) > at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:756) > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-35007) Spark 2.4.x version does not support numeric
[ https://issues.apache.org/jira/browse/SPARK-35007?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang resolved SPARK-35007. - Resolution: Won't Fix > Spark 2.4.x version does not support numeric > > > Key: SPARK-35007 > URL: https://issues.apache.org/jira/browse/SPARK-35007 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.7 >Reporter: Sun BiaoBiao >Priority: Major > Original Estimate: 48h > Remaining Estimate: 48h > > as said in this pr: [https://github.com/apache/spark/pull/31891] , > > So seems the key difference between DEICMAL and NUMERIC are: > {quote}implementation-defined decimal precision equal to or greater than the > value of the specified precision. > {quote} > Decimal can have _at least_ specified precision (which means can have more) > whereas numeric should have exactly specified precision. > Spark's decimal satisfy both so I think NUMERIC as a synonym of DECIMAL makes > sense. > > my code needs to use NUMERIC type in spark2.4 version > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-35007) Spark 2.4.x version does not support numeric
[ https://issues.apache.org/jira/browse/SPARK-35007?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-35007: Target Version/s: (was: 2.4.8) > Spark 2.4.x version does not support numeric > > > Key: SPARK-35007 > URL: https://issues.apache.org/jira/browse/SPARK-35007 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.7 >Reporter: Sun BiaoBiao >Priority: Major > Original Estimate: 48h > Remaining Estimate: 48h > > as said in this pr: [https://github.com/apache/spark/pull/31891] , > > So seems the key difference between DEICMAL and NUMERIC are: > {quote}implementation-defined decimal precision equal to or greater than the > value of the specified precision. > {quote} > Decimal can have _at least_ specified precision (which means can have more) > whereas numeric should have exactly specified precision. > Spark's decimal satisfy both so I think NUMERIC as a synonym of DECIMAL makes > sense. > > my code needs to use NUMERIC type in spark2.4 version > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35010) nestedSchemaPruning causes issue when reading hive generated Orc files
[ https://issues.apache.org/jira/browse/SPARK-35010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17318347#comment-17318347 ] Yuming Wang commented on SPARK-35010: - Yes. It is an issue: https://github.com/apache/spark/pull/31993 > nestedSchemaPruning causes issue when reading hive generated Orc files > -- > > Key: SPARK-35010 > URL: https://issues.apache.org/jira/browse/SPARK-35010 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0, 3.1.0 >Reporter: Baohe Zhang >Priority: Critical > > In spark3, we have spark.sql.orc.imple=native and > spark.sql.optimizer.nestedSchemaPruning.enabled=true as the default settings. > And these would cause issues when query struct field of hive-generated orc > files. > For example, we got an error when running this query in spark3 > {code:java} > spark.table("testtable").filter(col("utc_date") === > "20210122").select(col("open_count.d35")).show(false) > {code} > The error is > {code:java} > Caused by: java.lang.AssertionError: assertion failed: The given data schema > struct>> has less fields than the > actual ORC physical schema, no idea which columns were dropped, fail to read. > at scala.Predef$.assert(Predef.scala:223) > at > org.apache.spark.sql.execution.datasources.orc.OrcUtils$.requestedColumnIds(OrcUtils.scala:153) > at > org.apache.spark.sql.execution.datasources.orc.OrcFileFormat.$anonfun$buildReaderWithPartitionValues$3(OrcFileFormat.scala:180) > at org.apache.spark.util.Utils$.tryWithResource(Utils.scala:2539) > at > org.apache.spark.sql.execution.datasources.orc.OrcFileFormat.$anonfun$buildReaderWithPartitionValues$1(OrcFileFormat.scala:178) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:116) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:169) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93) > at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:729) > at > org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:340) > at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:872) > at > org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:872) > at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:313) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) > at org.apache.spark.scheduler.Task.run(Task.scala:127) > at > org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:446) > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:449) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > {code} > > I think the reason is that we apply the nestedSchemaPruning to the > dataSchema. > [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/SchemaPruning.scala#L75] > This nestedSchemaPruning not only prunes the unused fields of the struct, it > also prunes the unused columns. In my test, the dataSchema originally has 48 > columns, but after nested schema pruning, the dataSchema is pruned to 1 > column. This pruning result in an assertion error in > [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcUtils.scala#L159] > because column pruning in hive generated orc files is not supported. > This issue seems also related to the hive version, we use hive 1.2, and it > doesn't contain field names in the physical schema. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-35002) Fix the java.net.BindException when testing with Github Action
[ https://issues.apache.org/jira/browse/SPARK-35002?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-35002: Summary: Fix the java.net.BindException when testing with Github Action (was: Try to fix the java.net.BindException when testing with Github Action) > Fix the java.net.BindException when testing with Github Action > -- > > Key: SPARK-35002 > URL: https://issues.apache.org/jira/browse/SPARK-35002 > Project: Spark > Issue Type: Improvement > Components: Project Infra >Affects Versions: 3.2.0 >Reporter: Yuming Wang >Priority: Major > > {noformat} > [info] org.apache.spark.sql.kafka010.producer.InternalKafkaProducerPoolSuite > *** ABORTED *** (282 milliseconds) > [info] java.net.BindException: Cannot assign requested address: Service > 'sparkDriver' failed after 100 retries (on a random free port)! Consider > explicitly setting the appropriate binding address for the service > 'sparkDriver' (for example spark.driver.bindAddress for SparkDriver) to the > correct binding address. > [info] at sun.nio.ch.Net.bind0(Native Method) > [info] at sun.nio.ch.Net.bind(Net.java:461) > [info] at sun.nio.ch.Net.bind(Net.java:453) > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-35002) Try to fix the java.net.BindException when testing with Github Action
Yuming Wang created SPARK-35002: --- Summary: Try to fix the java.net.BindException when testing with Github Action Key: SPARK-35002 URL: https://issues.apache.org/jira/browse/SPARK-35002 Project: Spark Issue Type: Improvement Components: Project Infra Affects Versions: 3.2.0 Reporter: Yuming Wang {noformat} [info] org.apache.spark.sql.kafka010.producer.InternalKafkaProducerPoolSuite *** ABORTED *** (282 milliseconds) [info] java.net.BindException: Cannot assign requested address: Service 'sparkDriver' failed after 100 retries (on a random free port)! Consider explicitly setting the appropriate binding address for the service 'sparkDriver' (for example spark.driver.bindAddress for SparkDriver) to the correct binding address. [info] at sun.nio.ch.Net.bind0(Native Method) [info] at sun.nio.ch.Net.bind(Net.java:461) [info] at sun.nio.ch.Net.bind(Net.java:453) {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-34966) Avoid shuffle if join type do not match
[ https://issues.apache.org/jira/browse/SPARK-34966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang resolved SPARK-34966. - Resolution: Invalid https://github.com/apache/spark/blob/69aa727ff495f6698fe9b37e952dfaf36f1dd5eb/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/hash.scala#L504-L507 It is same: import org.apache.spark.sql.catalyst.expressions.{Cast, EmptyRow, Literal, Murmur3Hash, Pmod} println(Pmod(new Murmur3Hash(Seq(Literal(100.toShort))), Literal(10)).eval(EmptyRow)) println(Pmod(new Murmur3Hash(Seq(new Cast(Literal(100.toShort), IntegerType))), Literal(10)).eval(EmptyRow)) But it is not same: println(Pmod(new Murmur3Hash(Seq(Literal(100))), Literal(10)).eval(EmptyRow)) println(Pmod(new Murmur3Hash(Seq(new Cast(Literal(100), LongType))), Literal(10)).eval(EmptyRow)) > Avoid shuffle if join type do not match > --- > > Key: SPARK-34966 > URL: https://issues.apache.org/jira/browse/SPARK-34966 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: Yuming Wang >Priority: Major > > How to reproduce this issue: > {code:scala} > spark.sql("set spark.sql.autoBroadcastJoinThreshold=-1") > spark.sql("CREATE TABLE t1 using parquet clustered by (id) into 200 > buckets AS SELECT cast(id as bigint) FROM range(1000)") > spark.sql("CREATE TABLE t2 using parquet clustered by (id) into 200 > buckets AS SELECT cast(id as int) FROM range(500)") > spark.sql("select * from t1 join t2 on (t1.id = t2.id)").explain > {code} > Current: > {noformat} > == Physical Plan == > AdaptiveSparkPlan isFinalPlan=false > +- SortMergeJoin [id#14L], [cast(id#15 as bigint)], Inner >:- Sort [id#14L ASC NULLS FIRST], false, 0 >: +- Filter isnotnull(id#14L) >: +- FileScan parquet default.t1[id#14L] Batched: true, DataFilters: > [isnotnull(id#14L)], Format: Parquet, Location: InMemoryFileIndex(1 > paths)[file:/Users/yumwang/opensource/spark/spark-warehouse/org.apache.spark, > PartitionFilters: [], PushedFilters: [IsNotNull(id)], ReadSchema: > struct, SelectedBucketsCount: 200 out of 200 >+- Sort [cast(id#15 as bigint) ASC NULLS FIRST], false, 0 > +- Exchange hashpartitioning(cast(id#15 as bigint), 200), > ENSURE_REQUIREMENTS, [id=#58] > +- Filter isnotnull(id#15) > +- FileScan parquet default.t2[id#15] Batched: true, DataFilters: > [isnotnull(id#15)], Format: Parquet, Location: InMemoryFileIndex(1 > paths)[file:/Users/yumwang/opensource/spark/spark-warehouse/org.apache.spark, > PartitionFilters: [], PushedFilters: [IsNotNull(id)], ReadSchema: > struct > {noformat} > Expected: > {noformat} > == Physical Plan == > AdaptiveSparkPlan isFinalPlan=false > +- SortMergeJoin [id#14L], [cast(id#15 as bigint)], Inner >:- Sort [id#14L ASC NULLS FIRST], false, 0 >: +- Filter isnotnull(id#14L) >: +- FileScan parquet default.t1[id#14L] Batched: true, DataFilters: > [isnotnull(id#14L)], Format: Parquet, Location: InMemoryFileIndex(1 > paths)[file:/Users/yumwang/opensource/spark/spark-warehouse/org.apache.spark, > PartitionFilters: [], PushedFilters: [IsNotNull(id)], ReadSchema: > struct, SelectedBucketsCount: 200 out of 200 >+- Sort [cast(id#15 as bigint) ASC NULLS FIRST], false, 0 > +- Filter isnotnull(id#15) > +- FileScan parquet default.t2[id#15] Batched: true, DataFilters: > [isnotnull(id#15)], Format: Parquet, Location: InMemoryFileIndex(1 > paths)[file:/Users/yumwang/opensource/spark/spark-warehouse/org.apache.spark, > PartitionFilters: [], PushedFilters: [IsNotNull(id)], ReadSchema: > struct, SelectedBucketsCount: 200 out of 200 > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34967) Regression in spark 3.1.1 for window function and struct binding resolution
[ https://issues.apache.org/jira/browse/SPARK-34967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17315564#comment-17315564 ] Yuming Wang commented on SPARK-34967: - How to reproduce this issue? > Regression in spark 3.1.1 for window function and struct binding resolution > --- > > Key: SPARK-34967 > URL: https://issues.apache.org/jira/browse/SPARK-34967 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.1 >Reporter: Lukasz Z >Priority: Major > > https://github.com/zulk666/SparkTestProject/blob/main/src/main/scala/FailStructWithWindow.scala > When I execute this example code in spark 3.0.2 or 2.4.7 receive normal > result but in spark 3.1.1 I've got exception. > This is struct selection after window function aggregation. > Exception in thread "main" > org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding > attribute, tree: _gen_alias_32#32Exception in thread "main" > org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding > attribute, tree: _gen_alias_32#32 at > org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56) at > org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:75) > at > org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:74) > at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$1(TreeNode.scala:317) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:73) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:317) > at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$3(TreeNode.scala:322) > at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$mapChildren$1(TreeNode.scala:407) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:243) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:405) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:358) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:322) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:306) at > org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReference(BoundAttribute.scala:74) > at > org.apache.spark.sql.catalyst.expressions.BindReferences$.$anonfun$bindReferences$1(BoundAttribute.scala:96) > at > scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286) at > scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) at > scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) at > scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) at > scala.collection.TraversableLike.map(TraversableLike.scala:286) at > scala.collection.TraversableLike.map$(TraversableLike.scala:279) at > scala.collection.AbstractTraversable.map(Traversable.scala:108) at > org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReferences(BoundAttribute.scala:96) > at > org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection.(InterpretedMutableProjection.scala:35) > at > org.apache.spark.sql.catalyst.optimizer.ConvertToLocalRelation$$anonfun$apply$19.applyOrElse(Optimizer.scala:1589) > at > org.apache.spark.sql.catalyst.optimizer.ConvertToLocalRelation$$anonfun$apply$19.applyOrElse(Optimizer.scala:1586) > at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$1(TreeNode.scala:317) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:73) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:317) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDown(LogicalPlan.scala:29) > at > org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDown(AnalysisHelper.scala:171) > at > org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDown$(AnalysisHelper.scala:169) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDown(LogicalPlan.scala:29) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDown(LogicalPlan.scala:29) > at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$3(TreeNode.scala:322) > at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$mapChildren$1(TreeNode.scala:407) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:243) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:405) > at >
[jira] [Created] (SPARK-34966) Avoid shuffle if join type do not match
Yuming Wang created SPARK-34966: --- Summary: Avoid shuffle if join type do not match Key: SPARK-34966 URL: https://issues.apache.org/jira/browse/SPARK-34966 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.2.0 Reporter: Yuming Wang How to reproduce this issue: {code:scala} spark.sql("set spark.sql.autoBroadcastJoinThreshold=-1") spark.sql("CREATE TABLE t1 using parquet clustered by (id) into 200 buckets AS SELECT cast(id as bigint) FROM range(1000)") spark.sql("CREATE TABLE t2 using parquet clustered by (id) into 200 buckets AS SELECT cast(id as int) FROM range(500)") spark.sql("select * from t1 join t2 on (t1.id = t2.id)").explain {code} Current: {noformat} == Physical Plan == AdaptiveSparkPlan isFinalPlan=false +- SortMergeJoin [id#14L], [cast(id#15 as bigint)], Inner :- Sort [id#14L ASC NULLS FIRST], false, 0 : +- Filter isnotnull(id#14L) : +- FileScan parquet default.t1[id#14L] Batched: true, DataFilters: [isnotnull(id#14L)], Format: Parquet, Location: InMemoryFileIndex(1 paths)[file:/Users/yumwang/opensource/spark/spark-warehouse/org.apache.spark, PartitionFilters: [], PushedFilters: [IsNotNull(id)], ReadSchema: struct, SelectedBucketsCount: 200 out of 200 +- Sort [cast(id#15 as bigint) ASC NULLS FIRST], false, 0 +- Exchange hashpartitioning(cast(id#15 as bigint), 200), ENSURE_REQUIREMENTS, [id=#58] +- Filter isnotnull(id#15) +- FileScan parquet default.t2[id#15] Batched: true, DataFilters: [isnotnull(id#15)], Format: Parquet, Location: InMemoryFileIndex(1 paths)[file:/Users/yumwang/opensource/spark/spark-warehouse/org.apache.spark, PartitionFilters: [], PushedFilters: [IsNotNull(id)], ReadSchema: struct {noformat} Expected: {noformat} == Physical Plan == AdaptiveSparkPlan isFinalPlan=false +- SortMergeJoin [id#14L], [cast(id#15 as bigint)], Inner :- Sort [id#14L ASC NULLS FIRST], false, 0 : +- Filter isnotnull(id#14L) : +- FileScan parquet default.t1[id#14L] Batched: true, DataFilters: [isnotnull(id#14L)], Format: Parquet, Location: InMemoryFileIndex(1 paths)[file:/Users/yumwang/opensource/spark/spark-warehouse/org.apache.spark, PartitionFilters: [], PushedFilters: [IsNotNull(id)], ReadSchema: struct, SelectedBucketsCount: 200 out of 200 +- Sort [cast(id#15 as bigint) ASC NULLS FIRST], false, 0 +- Filter isnotnull(id#15) +- FileScan parquet default.t2[id#15] Batched: true, DataFilters: [isnotnull(id#15)], Format: Parquet, Location: InMemoryFileIndex(1 paths)[file:/Users/yumwang/opensource/spark/spark-warehouse/org.apache.spark, PartitionFilters: [], PushedFilters: [IsNotNull(id)], ReadSchema: struct, SelectedBucketsCount: 200 out of 200 {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33979) Filter predicate reorder
[ https://issues.apache.org/jira/browse/SPARK-33979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-33979: Description: Reorder filter predicate to improve query performance: {noformat} others < In < Like < UDF/CaseWhen/If < Inset < LikeAny/LikeAll {noformat} [https://www.ibm.com/support/knowledgecenter/SSSHTQ_8.1.0/com.ibm.netcool_OMNIbus.doc_8.1.0/omnibus/wip/admin/reference/omn_adm_per_optimizationrules.html#omn_adm_per_optimizationrules__reorder] [https://docs.oracle.com/en/database/oracle/oracle-database/21/addci/extensible-optimizer-interface.html#GUID-28A4EDA6-19DD-4773-B3B8-1802C3B01E21] [https://docs.oracle.com/cd/B10501_01/server.920/a96533/hintsref.htm#13676] https://issues.apache.org/jira/browse/HIVE-21857 https://issues.apache.org/jira/browse/IMPALA-2805 was: Reorder filter predicate to improve query performance: {noformat} others < In < Like < UDF/CaseWhen/If < Inset < LikeAny/LikeAll {noformat} [https://www.ibm.com/support/knowledgecenter/SSSHTQ_8.1.0/com.ibm.netcool_OMNIbus.doc_8.1.0/omnibus/wip/admin/reference/omn_adm_per_optimizationrules.html#omn_adm_per_optimizationrules__reorder] [https://docs.oracle.com/en/database/oracle/oracle-database/21/addci/extensible-optimizer-interface.html#GUID-28A4EDA6-19DD-4773-B3B8-1802C3B01E21] [https://docs.oracle.com/cd/B10501_01/server.920/a96533/hintsref.htm#13676] https://issues.apache.org/jira/browse/HIVE-21857 > Filter predicate reorder > > > Key: SPARK-33979 > URL: https://issues.apache.org/jira/browse/SPARK-33979 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.2.0 >Reporter: Yuming Wang >Priority: Major > > Reorder filter predicate to improve query performance: > {noformat} > others < In < Like < UDF/CaseWhen/If < Inset < LikeAny/LikeAll > {noformat} > [https://www.ibm.com/support/knowledgecenter/SSSHTQ_8.1.0/com.ibm.netcool_OMNIbus.doc_8.1.0/omnibus/wip/admin/reference/omn_adm_per_optimizationrules.html#omn_adm_per_optimizationrules__reorder] > > [https://docs.oracle.com/en/database/oracle/oracle-database/21/addci/extensible-optimizer-interface.html#GUID-28A4EDA6-19DD-4773-B3B8-1802C3B01E21] > [https://docs.oracle.com/cd/B10501_01/server.920/a96533/hintsref.htm#13676] > https://issues.apache.org/jira/browse/HIVE-21857 > https://issues.apache.org/jira/browse/IMPALA-2805 > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-34931) CoarseGrainedExecutorBackend send wrong 'Reason' when executor exits which leading to job failed.
[ https://issues.apache.org/jira/browse/SPARK-34931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-34931: Comment: was deleted (was: User 'sarutak' has created a pull request for this issue: https://github.com/apache/spark/pull/32028) > CoarseGrainedExecutorBackend send wrong 'Reason' when executor exits which > leading to job failed. > - > > Key: SPARK-34931 > URL: https://issues.apache.org/jira/browse/SPARK-34931 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: ice bai >Priority: Major > > when executor lost for some reason(e.g. Unable to register with external > shuffle server),CoarseGrainedExecutorBackend will send a RemoveExecutor event > with 'ExecutorLossReason'. But this will cause TaskSetManager handle > handleFailedTask function with exitCausedByApp=true. This is not correct -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-34931) CoarseGrainedExecutorBackend send wrong 'Reason' when executor exits which leading to job failed.
[ https://issues.apache.org/jira/browse/SPARK-34931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-34931: Comment: was deleted (was: User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/32021) > CoarseGrainedExecutorBackend send wrong 'Reason' when executor exits which > leading to job failed. > - > > Key: SPARK-34931 > URL: https://issues.apache.org/jira/browse/SPARK-34931 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: ice bai >Priority: Major > > when executor lost for some reason(e.g. Unable to register with external > shuffle server),CoarseGrainedExecutorBackend will send a RemoveExecutor event > with 'ExecutorLossReason'. But this will cause TaskSetManager handle > handleFailedTask function with exitCausedByApp=true. This is not correct -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-34931) CoarseGrainedExecutorBackend send wrong 'Reason' when executor exits which leading to job failed.
[ https://issues.apache.org/jira/browse/SPARK-34931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-34931: Comment: was deleted (was: User 'sarutak' has created a pull request for this issue: https://github.com/apache/spark/pull/32028) > CoarseGrainedExecutorBackend send wrong 'Reason' when executor exits which > leading to job failed. > - > > Key: SPARK-34931 > URL: https://issues.apache.org/jira/browse/SPARK-34931 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: ice bai >Priority: Major > > when executor lost for some reason(e.g. Unable to register with external > shuffle server),CoarseGrainedExecutorBackend will send a RemoveExecutor event > with 'ExecutorLossReason'. But this will cause TaskSetManager handle > handleFailedTask function with exitCausedByApp=true. This is not correct -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-34931) CoarseGrainedExecutorBackend send wrong 'Reason' when executor exits which leading to job failed.
[ https://issues.apache.org/jira/browse/SPARK-34931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-34931: Comment: was deleted (was: User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/32021) > CoarseGrainedExecutorBackend send wrong 'Reason' when executor exits which > leading to job failed. > - > > Key: SPARK-34931 > URL: https://issues.apache.org/jira/browse/SPARK-34931 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: ice bai >Priority: Major > > when executor lost for some reason(e.g. Unable to register with external > shuffle server),CoarseGrainedExecutorBackend will send a RemoveExecutor event > with 'ExecutorLossReason'. But this will cause TaskSetManager handle > handleFailedTask function with exitCausedByApp=true. This is not correct -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org