[jira] [Updated] (SPARK-36093) The result incorrect if the partition path case is inconsistent

2021-07-12 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36093?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-36093:

Labels: correctness  (was: )

> The result incorrect if the partition path case is inconsistent
> ---
>
> Key: SPARK-36093
> URL: https://issues.apache.org/jira/browse/SPARK-36093
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Yuming Wang
>Priority: Major
>  Labels: correctness
>
> Please reproduce this issue using HDFS. Local HDFS can not reproduce this 
> issue.
> {code:scala}
> sql("create table t1(cal_dt date) using parquet")
> sql("insert into t1 values 
> (date'2021-06-27'),(date'2021-06-28'),(date'2021-06-29'),(date'2021-06-30')")
> sql("create view t1_v as select * from t1")
> sql("CREATE TABLE t2 USING PARQUET PARTITIONED BY (CAL_DT) AS SELECT 1 AS 
> FLAG,CAL_DT FROM t1_v WHERE CAL_DT BETWEEN '2021-06-27' AND '2021-06-28'")
> sql("INSERT INTO t2 SELECT 2 AS FLAG,CAL_DT FROM t1_v WHERE CAL_DT BETWEEN 
> '2021-06-29' AND '2021-06-30'")
> sql("SELECT * FROM t2 WHERE CAL_DT BETWEEN '2021-06-29' AND 
> '2021-06-30'").show
> sql("SELECT * FROM t2 ").show
> {code}
> {noformat}
> // It should not empty.
> scala> sql("SELECT * FROM t2 WHERE CAL_DT BETWEEN '2021-06-29' AND 
> '2021-06-30'").show
> ++--+
> |FLAG|CAL_DT|
> ++--+
> ++--+
> scala> sql("SELECT * FROM t2 ").show
> ++--+
> |FLAG|CAL_DT|
> ++--+
> |   1|2021-06-27|
> |   1|2021-06-28|
> ++--+
> scala> sql("SELECT 2 AS FLAG,CAL_DT FROM t1_v WHERE CAL_DT BETWEEN 
> '2021-06-29' AND '2021-06-30'").show
> ++--+
> |FLAG|CAL_DT|
> ++--+
> |   2|2021-06-29|
> |   2|2021-06-30|
> ++--+
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-36093) The result incorrect if the partition path case is inconsistent

2021-07-12 Thread Yuming Wang (Jira)
Yuming Wang created SPARK-36093:
---

 Summary: The result incorrect if the partition path case is 
inconsistent
 Key: SPARK-36093
 URL: https://issues.apache.org/jira/browse/SPARK-36093
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.2.0
Reporter: Yuming Wang


Please reproduce this issue using HDFS. Local HDFS can not reproduce this issue.
{code:scala}
sql("create table t1(cal_dt date) using parquet")
sql("insert into t1 values 
(date'2021-06-27'),(date'2021-06-28'),(date'2021-06-29'),(date'2021-06-30')")
sql("create view t1_v as select * from t1")
sql("CREATE TABLE t2 USING PARQUET PARTITIONED BY (CAL_DT) AS SELECT 1 AS 
FLAG,CAL_DT FROM t1_v WHERE CAL_DT BETWEEN '2021-06-27' AND '2021-06-28'")
sql("INSERT INTO t2 SELECT 2 AS FLAG,CAL_DT FROM t1_v WHERE CAL_DT BETWEEN 
'2021-06-29' AND '2021-06-30'")
sql("SELECT * FROM t2 WHERE CAL_DT BETWEEN '2021-06-29' AND '2021-06-30'").show
sql("SELECT * FROM t2 ").show
{code}


{noformat}
// It should not empty.
scala> sql("SELECT * FROM t2 WHERE CAL_DT BETWEEN '2021-06-29' AND 
'2021-06-30'").show
++--+
|FLAG|CAL_DT|
++--+
++--+


scala> sql("SELECT * FROM t2 ").show
++--+
|FLAG|CAL_DT|
++--+
|   1|2021-06-27|
|   1|2021-06-28|
++--+

scala> sql("SELECT 2 AS FLAG,CAL_DT FROM t1_v WHERE CAL_DT BETWEEN '2021-06-29' 
AND '2021-06-30'").show
++--+
|FLAG|CAL_DT|
++--+
|   2|2021-06-29|
|   2|2021-06-30|
++--+
{noformat}




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-36086) The case of the delta table is inconsistent with parquet

2021-07-12 Thread Yuming Wang (Jira)
Yuming Wang created SPARK-36086:
---

 Summary: The case of the delta table is inconsistent with parquet
 Key: SPARK-36086
 URL: https://issues.apache.org/jira/browse/SPARK-36086
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.1.1
Reporter: Yuming Wang


How to reproduce this issue:

{noformat}
1. Add delta-core_2.12-1.0.0-SNAPSHOT.jar to ${SPARK_HOME}/jars.
2. bin/spark-shell --conf 
spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension --conf 
spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog
{noformat}

{code:scala}
spark.sql("create table t1 using parquet as select id, id as lower_id from 
range(5)")
spark.sql("CREATE VIEW v1 as SELECT * FROM t1")
spark.sql("CREATE TABLE t2 USING DELTA PARTITIONED BY (LOWER_ID) SELECT 
LOWER_ID, ID FROM v1")
spark.sql("CREATE TABLE t3 USING PARQUET PARTITIONED BY (LOWER_ID) SELECT 
LOWER_ID, ID FROM v1")

spark.sql("desc extended t2").show(false)
spark.sql("desc extended t3").show(false)
{code}

{noformat}
scala> spark.sql("desc extended t2").show(false)
++--+---+
|col_name|data_type 
|comment|
++--+---+
|lower_id|bigint
|   |
|id  |bigint
|   |
||  
|   |
|# Partitioning  |  
|   |
|Part 0  |lower_id  
|   |
||  
|   |
|# Detailed Table Information|  
|   |
|Name|default.t2
|   |
|Location
|file:/Users/yumwang/Downloads/spark-3.1.1-bin-hadoop2.7/spark-warehouse/t2|
   |
|Provider|delta 
|   |
|Table Properties
|[Type=MANAGED,delta.minReaderVersion=1,delta.minWriterVersion=2]  |
   |
++--+---+


scala> spark.sql("desc extended t3").show(false)
++--+---+
|col_name|data_type 
|comment|
++--+---+
|ID  |bigint
|null   |
|LOWER_ID|bigint
|null   |
|# Partition Information |  
|   |
|# col_name  |data_type 
|comment|
|LOWER_ID|bigint
|null   |
||  
|   |
|# Detailed Table Information|  
|   |
|Database|default   
|   |
|Table   |t3
|   |
|Owner   |yumwang   
|   |
|Created Time|Mon Jul 12 14:07:16 CST 2021  
|   |
|Last Access |UNKNOWN   
|   |
|Created By  |Spark 3.1.1   
|   |
|Type|MANAGED   
|   |
|Provider|PARQUET   
  

[jira] [Created] (SPARK-36080) Broadcast join outer join stream side

2021-07-09 Thread Yuming Wang (Jira)
Yuming Wang created SPARK-36080:
---

 Summary: Broadcast join outer join stream side
 Key: SPARK-36080
 URL: https://issues.apache.org/jira/browse/SPARK-36080
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.2.0
Reporter: Yuming Wang






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-35991) Add PlanStability suite for TPCH

2021-07-06 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35991?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-35991:

Affects Version/s: (was: 3.1.2)
   (was: 3.2.0)
   3.3.0

> Add PlanStability suite for TPCH
> 
>
> Key: SPARK-35991
> URL: https://issues.apache.org/jira/browse/SPARK-35991
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: angerszhu
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-35991) Add PlanStability suite for TPCH

2021-07-06 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35991?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-35991:

Issue Type: Improvement  (was: Bug)

> Add PlanStability suite for TPCH
> 
>
> Key: SPARK-35991
> URL: https://issues.apache.org/jira/browse/SPARK-35991
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.2, 3.2.0
>Reporter: angerszhu
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-35908) Remove repartition if the child maximum number of rows less than or equal to 1

2021-07-01 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35908?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang resolved SPARK-35908.
-
Resolution: Not A Problem

> Remove repartition if the child maximum number of rows less than or equal to 1
> --
>
> Key: SPARK-35908
> URL: https://issues.apache.org/jira/browse/SPARK-35908
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Yuming Wang
>Priority: Major
>
> {code:scala}
> spark.sql("select count(*) from range(1, 10, 2, 
> 2)").repartition(10).explain("cost")
> {code}
> Current optimized logical plan:
> {noformat}
> == Optimized Logical Plan ==
> Repartition 10, true, Statistics(sizeInBytes=16.0 B, rowCount=1)
> +- Aggregate [count(1) AS count(1)#2L], Statistics(sizeInBytes=16.0 B, 
> rowCount=1)
>+- Project, Statistics(sizeInBytes=20.0 B)
>   +- Range (1, 10, step=2, splits=Some(2)), Statistics(sizeInBytes=40.0 
> B, rowCount=5)
> {noformat}
> Expected optimized logical plan:
> {noformat}
> == Optimized Logical Plan ==
> Aggregate [count(1) AS count(1)#2L], Statistics(sizeInBytes=16.0 B, 
> rowCount=1)
> +- Project, Statistics(sizeInBytes=20.0 B)
>+- Range (1, 10, step=2, splits=Some(2)), Statistics(sizeInBytes=40.0 B, 
> rowCount=5)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-35967) Update nullability based on column statistics

2021-07-01 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-35967:

Summary: Update nullability based on column statistics  (was: Update 
nullability base on column statistics)

> Update nullability based on column statistics
> -
>
> Key: SPARK-35967
> URL: https://issues.apache.org/jira/browse/SPARK-35967
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Yuming Wang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-35967) Update nullability base on column statistics

2021-07-01 Thread Yuming Wang (Jira)
Yuming Wang created SPARK-35967:
---

 Summary: Update nullability base on column statistics
 Key: SPARK-35967
 URL: https://issues.apache.org/jira/browse/SPARK-35967
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.2.0
Reporter: Yuming Wang






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-35904) Collapse above RebalancePartitions

2021-06-28 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-35904:

Fix Version/s: (was: 3.2.0)

> Collapse above RebalancePartitions
> --
>
> Key: SPARK-35904
> URL: https://issues.apache.org/jira/browse/SPARK-35904
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
>
> Make RebalancePartitions extends RepartitionOperation.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-35904) Collapse above RebalancePartitions

2021-06-28 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang reopened SPARK-35904:
-

Reverted at 
https://github.com/apache/spark/commit/108635af1708173a72bec0e36bf3f2cea5b088c4

> Collapse above RebalancePartitions
> --
>
> Key: SPARK-35904
> URL: https://issues.apache.org/jira/browse/SPARK-35904
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
> Fix For: 3.2.0
>
>
> Make RebalancePartitions extends RepartitionOperation.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-35908) Remove repartition if the child maximum number of rows less than or equal to 1

2021-06-26 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35908?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-35908:

Description: 
{code:scala}
spark.sql("select count(*) from range(1, 10, 2, 
2)").repartition(10).explain("cost")
{code}

Current optimized logical plan:
{noformat}
== Optimized Logical Plan ==
Repartition 10, true, Statistics(sizeInBytes=16.0 B, rowCount=1)
+- Aggregate [count(1) AS count(1)#2L], Statistics(sizeInBytes=16.0 B, 
rowCount=1)
   +- Project, Statistics(sizeInBytes=20.0 B)
  +- Range (1, 10, step=2, splits=Some(2)), Statistics(sizeInBytes=40.0 B, 
rowCount=5)
{noformat}

Expected optimized logical plan:
{noformat}
== Optimized Logical Plan ==
Aggregate [count(1) AS count(1)#2L], Statistics(sizeInBytes=16.0 B, rowCount=1)
+- Project, Statistics(sizeInBytes=20.0 B)
   +- Range (1, 10, step=2, splits=Some(2)), Statistics(sizeInBytes=40.0 B, 
rowCount=5)
{noformat}




  was:
{code:scala}
spark.sql("select count(*) from range(1, 10, 2, 2) order by 1 limit 
10").explain("cost")
{code}

Current optimized logical plan:
{noformat}
== Optimized Logical Plan ==
Sort [count(1)#2L ASC NULLS FIRST], true, Statistics(sizeInBytes=16.0 B)
+- Aggregate [count(1) AS count(1)#2L], Statistics(sizeInBytes=16.0 B, 
rowCount=1)
   +- Project, Statistics(sizeInBytes=20.0 B)
  +- Range (1, 10, step=2, splits=Some(2)), Statistics(sizeInBytes=40.0 B, 
rowCount=5)
{noformat}

Expected optimized logical plan:
{noformat}
== Optimized Logical Plan ==
Aggregate [count(1) AS count(1)#2L], Statistics(sizeInBytes=16.0 B, rowCount=1)
+- Project, Statistics(sizeInBytes=20.0 B)
   +- Range (1, 10, step=2, splits=Some(2)), Statistics(sizeInBytes=40.0 B, 
rowCount=5)
{noformat}





> Remove repartition if the child maximum number of rows less than or equal to 1
> --
>
> Key: SPARK-35908
> URL: https://issues.apache.org/jira/browse/SPARK-35908
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Yuming Wang
>Priority: Major
>
> {code:scala}
> spark.sql("select count(*) from range(1, 10, 2, 
> 2)").repartition(10).explain("cost")
> {code}
> Current optimized logical plan:
> {noformat}
> == Optimized Logical Plan ==
> Repartition 10, true, Statistics(sizeInBytes=16.0 B, rowCount=1)
> +- Aggregate [count(1) AS count(1)#2L], Statistics(sizeInBytes=16.0 B, 
> rowCount=1)
>+- Project, Statistics(sizeInBytes=20.0 B)
>   +- Range (1, 10, step=2, splits=Some(2)), Statistics(sizeInBytes=40.0 
> B, rowCount=5)
> {noformat}
> Expected optimized logical plan:
> {noformat}
> == Optimized Logical Plan ==
> Aggregate [count(1) AS count(1)#2L], Statistics(sizeInBytes=16.0 B, 
> rowCount=1)
> +- Project, Statistics(sizeInBytes=20.0 B)
>+- Range (1, 10, step=2, splits=Some(2)), Statistics(sizeInBytes=40.0 B, 
> rowCount=5)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-35908) Remove repartition if the child maximum number of rows less than or equal to 1

2021-06-26 Thread Yuming Wang (Jira)
Yuming Wang created SPARK-35908:
---

 Summary: Remove repartition if the child maximum number of rows 
less than or equal to 1
 Key: SPARK-35908
 URL: https://issues.apache.org/jira/browse/SPARK-35908
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.2.0
Reporter: Yuming Wang


{code:scala}
spark.sql("select count(*) from range(1, 10, 2, 2) order by 1 limit 
10").explain("cost")
{code}

Current optimized logical plan:
{noformat}
== Optimized Logical Plan ==
Sort [count(1)#2L ASC NULLS FIRST], true, Statistics(sizeInBytes=16.0 B)
+- Aggregate [count(1) AS count(1)#2L], Statistics(sizeInBytes=16.0 B, 
rowCount=1)
   +- Project, Statistics(sizeInBytes=20.0 B)
  +- Range (1, 10, step=2, splits=Some(2)), Statistics(sizeInBytes=40.0 B, 
rowCount=5)
{noformat}

Expected optimized logical plan:
{noformat}
== Optimized Logical Plan ==
Aggregate [count(1) AS count(1)#2L], Statistics(sizeInBytes=16.0 B, rowCount=1)
+- Project, Statistics(sizeInBytes=20.0 B)
   +- Range (1, 10, step=2, splits=Some(2)), Statistics(sizeInBytes=40.0 B, 
rowCount=5)
{noformat}






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-35906) Remove order by if the maximum number of rows less than or equal to 1

2021-06-26 Thread Yuming Wang (Jira)
Yuming Wang created SPARK-35906:
---

 Summary: Remove order by if the maximum number of rows less than 
or equal to 1
 Key: SPARK-35906
 URL: https://issues.apache.org/jira/browse/SPARK-35906
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.2.0
Reporter: Yuming Wang


{code:scala}
spark.sql("select count(*) from range(1, 10, 2, 2) order by 1 limit 
10").explain("cost")
{code}

Current optimized logical plan:
{noformat}
== Optimized Logical Plan ==
Sort [count(1)#2L ASC NULLS FIRST], true, Statistics(sizeInBytes=16.0 B)
+- Aggregate [count(1) AS count(1)#2L], Statistics(sizeInBytes=16.0 B, 
rowCount=1)
   +- Project, Statistics(sizeInBytes=20.0 B)
  +- Range (1, 10, step=2, splits=Some(2)), Statistics(sizeInBytes=40.0 B, 
rowCount=5)
{noformat}

Expected optimized logical plan:
{noformat}
== Optimized Logical Plan ==
Aggregate [count(1) AS count(1)#2L], Statistics(sizeInBytes=16.0 B, rowCount=1)
+- Project, Statistics(sizeInBytes=20.0 B)
   +- Range (1, 10, step=2, splits=Some(2)), Statistics(sizeInBytes=40.0 B, 
rowCount=5)
{noformat}






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-35904) Collapse above RebalancePartitions

2021-06-25 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-35904:

Description: Make RebalancePartitions extends RepartitionOperation.

> Collapse above RebalancePartitions
> --
>
> Key: SPARK-35904
> URL: https://issues.apache.org/jira/browse/SPARK-35904
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Yuming Wang
>Priority: Major
>
> Make RebalancePartitions extends RepartitionOperation.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-35904) Collapse above RebalancePartitions

2021-06-25 Thread Yuming Wang (Jira)
Yuming Wang created SPARK-35904:
---

 Summary: Collapse above RebalancePartitions
 Key: SPARK-35904
 URL: https://issues.apache.org/jira/browse/SPARK-35904
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.2.0
Reporter: Yuming Wang






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-35886) Codegen issue for decimal type

2021-06-24 Thread Yuming Wang (Jira)
Yuming Wang created SPARK-35886:
---

 Summary: Codegen issue for decimal type
 Key: SPARK-35886
 URL: https://issues.apache.org/jira/browse/SPARK-35886
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.2.0
Reporter: Yuming Wang


How to reproduce this issue:
{code:scala}
spark.sql(
  """
|CREATE TABLE t1 (
|  c1 DECIMAL(18,6),
|  c2 DECIMAL(18,6),
|  c3 DECIMAL(18,6))
|USING parquet;
|""".stripMargin)

spark.sql("SELECT sum(c1 * c3) + sum(c2 * c3) FROM t1").show
{code}


{noformat}
20:23:36.272 ERROR 
org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator: failed to 
compile: org.codehaus.commons.compiler.CompileException: File 'generated.java', 
Line 56, Column 6: Expression "agg_exprIsNull_2_0" is not an rvalue
org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 56, 
Column 6: Expression "agg_exprIsNull_2_0" is not an rvalue
at 
org.codehaus.janino.UnitCompiler.compileError(UnitCompiler.java:12675)
at 
org.codehaus.janino.UnitCompiler.toRvalueOrCompileException(UnitCompiler.java:7676)
{noformat}




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34807) Push down filter through window after TransposeWindow

2021-06-23 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34807?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang reassigned SPARK-34807:
---

Assignee: Tanel Kiis

> Push down filter through window after TransposeWindow
> -
>
> Key: SPARK-34807
> URL: https://issues.apache.org/jira/browse/SPARK-34807
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Yuming Wang
>Assignee: Tanel Kiis
>Priority: Major
>
> {code:scala}
>   spark.range(10).selectExpr("id AS a", "id AS b", "id AS c", "id AS 
> d").createTempView("t1")
>   val df = spark.sql(
> """
>   |SELECT *
>   |  FROM (
>   |SELECT b,
>   |  sum(d) OVER (PARTITION BY a, b),
>   |  rank() OVER (PARTITION BY a ORDER BY c)
>   |FROM t1
>   |  ) v1
>   |WHERE b = 2
>   |""".stripMargin)
> {code}
> Current optimized plan:
> {noformat}
> == Optimized Logical Plan ==
> Project [b#221L, sum(d) OVER (PARTITION BY a, b ROWS BETWEEN UNBOUNDED 
> PRECEDING AND UNBOUNDED FOLLOWING)#231L, RANK() OVER (PARTITION BY a ORDER BY 
> c ASC NULLS FIRST ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)#232]
> +- Filter (b#221L = 2)
>+- Window [rank(c#222L) windowspecdefinition(a#220L, c#222L ASC NULLS 
> FIRST, specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) 
> AS RANK() OVER (PARTITION BY a ORDER BY c ASC NULLS FIRST ROWS BETWEEN 
> UNBOUNDED PRECEDING AND CURRENT ROW)#232], [a#220L], [c#222L ASC NULLS FIRST]
>   +- Project [b#221L, a#220L, c#222L, sum(d) OVER (PARTITION BY a, b ROWS 
> BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)#231L]
>  +- Window [sum(d#223L) windowspecdefinition(a#220L, b#221L, 
> specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) 
> AS sum(d) OVER (PARTITION BY a, b ROWS BETWEEN UNBOUNDED PRECEDING AND 
> UNBOUNDED FOLLOWING)#231L], [a#220L, b#221L]
> +- Project [id#218L AS b#221L, id#218L AS d#223L, id#218L AS 
> a#220L, id#218L AS c#222L]
>+- Range (0, 10, step=1, splits=Some(2))
> {noformat}
> Expected optimized plan:
> {noformat}
> == Optimized Logical Plan ==
> Project [b#221L, sum(d) OVER (PARTITION BY a, b ROWS BETWEEN UNBOUNDED 
> PRECEDING AND UNBOUNDED FOLLOWING)#231L, RANK() OVER (PARTITION BY a ORDER BY 
> c ASC NULLS FIRST ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)#232]
> +- Window [sum(d#223L) windowspecdefinition(a#220L, b#221L, 
> specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) 
> AS sum(d) OVER (PARTITION BY a, b ROWS BETWEEN UNBOUNDED PRECEDING AND 
> UNBOUNDED FOLLOWING)#231L], [a#220L, b#221L]
>+- Project [b#221L, d#223L, a#220L, RANK() OVER (PARTITION BY a ORDER BY c 
> ASC NULLS FIRST ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)#232]
>   +- Filter (b#221L = 2)
>  +- Window [rank(c#222L) windowspecdefinition(a#220L, c#222L ASC 
> NULLS FIRST, specifiedwindowframe(RowFrame, unboundedpreceding$(), 
> currentrow$())) AS RANK() OVER (PARTITION BY a ORDER BY c ASC NULLS FIRST 
> ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)#232], [a#220L], [c#222L ASC 
> NULLS FIRST]
> +- Project [id#218L AS b#221L, id#218L AS d#223L, id#218L AS 
> a#220L, id#218L AS c#222L]
>+- Range (0, 10, step=1, splits=Some(2))
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-34807) Push down filter through window after TransposeWindow

2021-06-23 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34807?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang resolved SPARK-34807.
-
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 31980
[https://github.com/apache/spark/pull/31980]

> Push down filter through window after TransposeWindow
> -
>
> Key: SPARK-34807
> URL: https://issues.apache.org/jira/browse/SPARK-34807
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Yuming Wang
>Assignee: Tanel Kiis
>Priority: Major
> Fix For: 3.2.0
>
>
> {code:scala}
>   spark.range(10).selectExpr("id AS a", "id AS b", "id AS c", "id AS 
> d").createTempView("t1")
>   val df = spark.sql(
> """
>   |SELECT *
>   |  FROM (
>   |SELECT b,
>   |  sum(d) OVER (PARTITION BY a, b),
>   |  rank() OVER (PARTITION BY a ORDER BY c)
>   |FROM t1
>   |  ) v1
>   |WHERE b = 2
>   |""".stripMargin)
> {code}
> Current optimized plan:
> {noformat}
> == Optimized Logical Plan ==
> Project [b#221L, sum(d) OVER (PARTITION BY a, b ROWS BETWEEN UNBOUNDED 
> PRECEDING AND UNBOUNDED FOLLOWING)#231L, RANK() OVER (PARTITION BY a ORDER BY 
> c ASC NULLS FIRST ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)#232]
> +- Filter (b#221L = 2)
>+- Window [rank(c#222L) windowspecdefinition(a#220L, c#222L ASC NULLS 
> FIRST, specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) 
> AS RANK() OVER (PARTITION BY a ORDER BY c ASC NULLS FIRST ROWS BETWEEN 
> UNBOUNDED PRECEDING AND CURRENT ROW)#232], [a#220L], [c#222L ASC NULLS FIRST]
>   +- Project [b#221L, a#220L, c#222L, sum(d) OVER (PARTITION BY a, b ROWS 
> BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)#231L]
>  +- Window [sum(d#223L) windowspecdefinition(a#220L, b#221L, 
> specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) 
> AS sum(d) OVER (PARTITION BY a, b ROWS BETWEEN UNBOUNDED PRECEDING AND 
> UNBOUNDED FOLLOWING)#231L], [a#220L, b#221L]
> +- Project [id#218L AS b#221L, id#218L AS d#223L, id#218L AS 
> a#220L, id#218L AS c#222L]
>+- Range (0, 10, step=1, splits=Some(2))
> {noformat}
> Expected optimized plan:
> {noformat}
> == Optimized Logical Plan ==
> Project [b#221L, sum(d) OVER (PARTITION BY a, b ROWS BETWEEN UNBOUNDED 
> PRECEDING AND UNBOUNDED FOLLOWING)#231L, RANK() OVER (PARTITION BY a ORDER BY 
> c ASC NULLS FIRST ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)#232]
> +- Window [sum(d#223L) windowspecdefinition(a#220L, b#221L, 
> specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) 
> AS sum(d) OVER (PARTITION BY a, b ROWS BETWEEN UNBOUNDED PRECEDING AND 
> UNBOUNDED FOLLOWING)#231L], [a#220L, b#221L]
>+- Project [b#221L, d#223L, a#220L, RANK() OVER (PARTITION BY a ORDER BY c 
> ASC NULLS FIRST ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)#232]
>   +- Filter (b#221L = 2)
>  +- Window [rank(c#222L) windowspecdefinition(a#220L, c#222L ASC 
> NULLS FIRST, specifiedwindowframe(RowFrame, unboundedpreceding$(), 
> currentrow$())) AS RANK() OVER (PARTITION BY a ORDER BY c ASC NULLS FIRST 
> ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)#232], [a#220L], [c#222L ASC 
> NULLS FIRST]
> +- Project [id#218L AS b#221L, id#218L AS d#223L, id#218L AS 
> a#220L, id#218L AS c#222L]
>+- Range (0, 10, step=1, splits=Some(2))
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-35837) Recommendations for Common Query Problems

2021-06-21 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-35837:

Description: 
Teradata supports [Recommendations for Common Query 
Problems|https://docs.teradata.com/r/wada1XMYPkZVTqPKz2CNaw/JE7PEg6H~4nBZYEGphxxsg].

We can implement a similar feature.
 1. Detect the most skew values for join. The user decides whether these are 
needed.
 2. Detect the most skew values for window function. The user decides whether 
these are needed.
 3. Detect the bucket read, for example, Analyzer add a cast to bucket column.
 4. Recommend the user add a filter condition to the partition column of the 
partition table.
 5. Check the condition of join, for example, the result of cast string to 
double may be incorrect.

For example:
{code:sql}
0: jdbc:hive2://localhost:1/default> EXPLAIN RECOMMENDATION
0: jdbc:hive2://localhost:1/default> SELECT a.*,
0: jdbc:hive2://localhost:1/default>CASE
0: jdbc:hive2://localhost:1/default>  WHEN ( NOT ( a.exclude = 1
0: jdbc:hive2://localhost:1/default>   AND a.cobrand = 6
0: jdbc:hive2://localhost:1/default>   AND 
a.primary_app_id IN ( 1462, 2878, 2571 ) ) )
0: jdbc:hive2://localhost:1/default>   AND ( a.valid_page_count 
= 1 ) THEN 1
0: jdbc:hive2://localhost:1/default>  ELSE 0
0: jdbc:hive2://localhost:1/default>END AS is_singlepage,
0: jdbc:hive2://localhost:1/default>ca.bsns_vrtcl_name
0: jdbc:hive2://localhost:1/default> FROM   (SELECT *
0: jdbc:hive2://localhost:1/default> FROM   (SELECT *,
0: jdbc:hive2://localhost:1/default>'VI' AS 
page_type
0: jdbc:hive2://localhost:1/default> FROM   tbl1
0: jdbc:hive2://localhost:1/default> UNION
0: jdbc:hive2://localhost:1/default> SELECT *,
0: jdbc:hive2://localhost:1/default>'SRP' AS 
page_type
0: jdbc:hive2://localhost:1/default> FROM   tbl2
0: jdbc:hive2://localhost:1/default> UNION
0: jdbc:hive2://localhost:1/default> SELECT *,
0: jdbc:hive2://localhost:1/default>'My Garage' AS 
page_type
0: jdbc:hive2://localhost:1/default> FROM   tbl3
0: jdbc:hive2://localhost:1/default> UNION
0: jdbc:hive2://localhost:1/default> SELECT *,
0: jdbc:hive2://localhost:1/default>'Motors 
Homepage' AS page_type
0: jdbc:hive2://localhost:1/default> FROM   tbl4) t
0: jdbc:hive2://localhost:1/default> WHERE  session_start_dt 
BETWEEN ( '2020-01-01' ) AND (
0: jdbc:hive2://localhost:1/default>
 CURRENT_DATE() - 10 )) a
0: jdbc:hive2://localhost:1/default>LEFT JOIN (SELECT item_id,
0: jdbc:hive2://localhost:1/default>  item_site_id,
0: jdbc:hive2://localhost:1/default>  auct_end_dt,
0: jdbc:hive2://localhost:1/default>  leaf_categ_id
0: jdbc:hive2://localhost:1/default>   FROM   tbl5
0: jdbc:hive2://localhost:1/default>   WHERE  auct_end_dt 
>= ( '2020-01-01' )) itm
0: jdbc:hive2://localhost:1/default>   ON a.item_id = 
itm.item_id
0: jdbc:hive2://localhost:1/default>LEFT JOIN tbl6 ca
0: jdbc:hive2://localhost:1/default>   ON itm.leaf_categ_id = 
ca.leaf_categ_id
0: jdbc:hive2://localhost:1/default>  AND itm.item_site_id 
= ca.site_id;
+-+--+
| result
  |
+-+--+
| 1. Detect the most skew values for join   
  |
|   Check join: Join LeftOuter, ((leaf_categ_id#1453 = leaf_categ_id#3020) AND 
(cast(item_site_id#1444 as decimal(9,0)) = site_id#3022))  |
| table: tbl5   
  |
|   leaf_categ_id, item_site_id, count  
  |
|   171243, 0, 115412614
  |
|   176984, 3, 81003252

[jira] [Updated] (SPARK-35837) Recommendations for Common Query Problems

2021-06-21 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-35837:

Description: 
Teradata supports [Recommendations for Common Query 
Problems|https://docs.teradata.com/r/wada1XMYPkZVTqPKz2CNaw/JE7PEg6H~4nBZYEGphxxsg].

We can implement a similar feature.
 1. Detect the most skew values for join. The user decides whether these are 
needed.
 2. Detect the most skew values for window function. The user decides whether 
these are needed.
 3. Detect the bucket read, for example, Analyzer add a cast to bucket column.
 4. Recommend the user add a filter condition to the partition column of the 
partition table.
 5. Check the condition of join, for example, the result of cast string to 
double may be incorrect.

For example:
{code:sql}
0: jdbc:hive2://localhost:1/default> EXPLAIN RECOMMENDATION
0: jdbc:hive2://localhost:1/default> SELECT a.*,
0: jdbc:hive2://localhost:1/default>CASE
0: jdbc:hive2://localhost:1/default>  WHEN ( NOT ( a.exclude = 1
0: jdbc:hive2://localhost:1/default>   AND a.cobrand = 6
0: jdbc:hive2://localhost:1/default>   AND 
a.primary_app_id IN ( 1462, 2878, 2571 ) ) )
0: jdbc:hive2://localhost:1/default>   AND ( a.valid_page_count 
= 1 ) THEN 1
0: jdbc:hive2://localhost:1/default>  ELSE 0
0: jdbc:hive2://localhost:1/default>END AS is_singlepage,
0: jdbc:hive2://localhost:1/default>ca.bsns_vrtcl_name
0: jdbc:hive2://localhost:1/default> FROM   (SELECT *
0: jdbc:hive2://localhost:1/default> FROM   (SELECT *,
0: jdbc:hive2://localhost:1/default>'VI' AS 
page_type
0: jdbc:hive2://localhost:1/default> FROM   tbl1
0: jdbc:hive2://localhost:1/default> UNION
0: jdbc:hive2://localhost:1/default> SELECT *,
0: jdbc:hive2://localhost:1/default>'SRP' AS 
page_type
0: jdbc:hive2://localhost:1/default> FROM   tbl2
0: jdbc:hive2://localhost:1/default> UNION
0: jdbc:hive2://localhost:1/default> SELECT *,
0: jdbc:hive2://localhost:1/default>'My Garage' AS 
page_type
0: jdbc:hive2://localhost:1/default> FROM   tbl3
0: jdbc:hive2://localhost:1/default> UNION
0: jdbc:hive2://localhost:1/default> SELECT *,
0: jdbc:hive2://localhost:1/default>'Motors 
Homepage' AS page_type
0: jdbc:hive2://localhost:1/default> FROM   tbl4) t
0: jdbc:hive2://localhost:1/default> WHERE  session_start_dt 
BETWEEN ( '2020-01-01' ) AND (
0: jdbc:hive2://localhost:1/default>
 CURRENT_DATE() - 10 )) a
0: jdbc:hive2://localhost:1/default>LEFT JOIN (SELECT item_id,
0: jdbc:hive2://localhost:1/default>  item_site_id,
0: jdbc:hive2://localhost:1/default>  auct_end_dt,
0: jdbc:hive2://localhost:1/default>  leaf_categ_id
0: jdbc:hive2://localhost:1/default>   FROM   tbl5
0: jdbc:hive2://localhost:1/default>   WHERE  auct_end_dt 
>= ( '2020-01-01' )) itm
0: jdbc:hive2://localhost:1/default>   ON a.item_id = 
itm.item_id
0: jdbc:hive2://localhost:1/default>LEFT JOIN tbl6 ca
0: jdbc:hive2://localhost:1/default>   ON itm.leaf_categ_id = 
ca.leaf_categ_id
0: jdbc:hive2://localhost:1/default>  AND itm.item_site_id 
= ca.site_id;
+-+--+
| result
  |
+-+--+
| 1. Detect the most skew values for join   
  |
|   Check join: Join LeftOuter, ((leaf_categ_id#1453 = leaf_categ_id#3020) AND 
(cast(item_site_id#1444 as decimal(9,0)) = site_id#3022))  |
| table: tbl5   
  |
|   leaf_categ_id, item_site_id, count  
  |
|   171243, 0, 115412614
  |
|   176984, 3, 81003252

[jira] [Updated] (SPARK-35837) Recommendations for Common Query Problems

2021-06-21 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-35837:

Description: 
Teradata supports [Recommendations for Common Query 
Problems|https://docs.teradata.com/r/wada1XMYPkZVTqPKz2CNaw/JE7PEg6H~4nBZYEGphxxsg].

We can implement a similar feature.
 1. Detect the most skew values for join. The user decides whether these are 
needed.
 2. Detect the most skew values for window function. The user decides whether 
these are needed.
 3. Detect the bucket read, for example, Analyzer add a cast to bucket column.
 4. Recommend the user add a filter condition to the partition column of the 
partition table.
 5. Check the condition of join, for example, the result of cast string to 
double may be incorrect.

For example:
{code:sql}
0: jdbc:hive2://localhost:1/default> EXPLAIN OPTIMIZE
0: jdbc:hive2://localhost:1/default> SELECT a.*,
0: jdbc:hive2://localhost:1/default>CASE
0: jdbc:hive2://localhost:1/default>  WHEN ( NOT ( a.exclude = 1
0: jdbc:hive2://localhost:1/default>   AND a.cobrand = 6
0: jdbc:hive2://localhost:1/default>   AND 
a.primary_app_id IN ( 1462, 2878, 2571 ) ) )
0: jdbc:hive2://localhost:1/default>   AND ( a.valid_page_count 
= 1 ) THEN 1
0: jdbc:hive2://localhost:1/default>  ELSE 0
0: jdbc:hive2://localhost:1/default>END AS is_singlepage,
0: jdbc:hive2://localhost:1/default>ca.bsns_vrtcl_name
0: jdbc:hive2://localhost:1/default> FROM   (SELECT *
0: jdbc:hive2://localhost:1/default> FROM   (SELECT *,
0: jdbc:hive2://localhost:1/default>'VI' AS 
page_type
0: jdbc:hive2://localhost:1/default> FROM   tbl1
0: jdbc:hive2://localhost:1/default> UNION
0: jdbc:hive2://localhost:1/default> SELECT *,
0: jdbc:hive2://localhost:1/default>'SRP' AS 
page_type
0: jdbc:hive2://localhost:1/default> FROM   tbl2
0: jdbc:hive2://localhost:1/default> UNION
0: jdbc:hive2://localhost:1/default> SELECT *,
0: jdbc:hive2://localhost:1/default>'My Garage' AS 
page_type
0: jdbc:hive2://localhost:1/default> FROM   tbl3
0: jdbc:hive2://localhost:1/default> UNION
0: jdbc:hive2://localhost:1/default> SELECT *,
0: jdbc:hive2://localhost:1/default>'Motors 
Homepage' AS page_type
0: jdbc:hive2://localhost:1/default> FROM   tbl4) t
0: jdbc:hive2://localhost:1/default> WHERE  session_start_dt 
BETWEEN ( '2020-01-01' ) AND (
0: jdbc:hive2://localhost:1/default>
 CURRENT_DATE() - 10 )) a
0: jdbc:hive2://localhost:1/default>LEFT JOIN (SELECT item_id,
0: jdbc:hive2://localhost:1/default>  item_site_id,
0: jdbc:hive2://localhost:1/default>  auct_end_dt,
0: jdbc:hive2://localhost:1/default>  leaf_categ_id
0: jdbc:hive2://localhost:1/default>   FROM   tbl5
0: jdbc:hive2://localhost:1/default>   WHERE  auct_end_dt 
>= ( '2020-01-01' )) itm
0: jdbc:hive2://localhost:1/default>   ON a.item_id = 
itm.item_id
0: jdbc:hive2://localhost:1/default>LEFT JOIN tbl6 ca
0: jdbc:hive2://localhost:1/default>   ON itm.leaf_categ_id = 
ca.leaf_categ_id
0: jdbc:hive2://localhost:1/default>  AND itm.item_site_id 
= ca.site_id;
+-+--+
| result
  |
+-+--+
| 1. Detect the most skew values for join   
  |
|   Check join: Join LeftOuter, ((leaf_categ_id#1453 = leaf_categ_id#3020) AND 
(cast(item_site_id#1444 as decimal(9,0)) = site_id#3022))  |
| table: tbl5   
  |
|   leaf_categ_id, item_site_id, count  
  |
|   171243, 0, 115412614
  |
|   176984, 3, 81003252 

[jira] [Created] (SPARK-35837) Recommendations for Common Query Problems

2021-06-21 Thread Yuming Wang (Jira)
Yuming Wang created SPARK-35837:
---

 Summary: Recommendations for Common Query Problems
 Key: SPARK-35837
 URL: https://issues.apache.org/jira/browse/SPARK-35837
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 3.2.0
Reporter: Yuming Wang


Teradata supports [Recommendations for Common Query 
Problems|https://docs.teradata.com/r/wada1XMYPkZVTqPKz2CNaw/JE7PEg6H~4nBZYEGphxxsg].

We can implement a similar feature.
 1. Detect the most skew values for join. The user decides whether these are 
needed.
 2. Detect the most skew values for window function. The user decides whether 
these are needed.
 3. Detect it can be optimized to bucket read, for example, Analyzer add a cast 
to bucket column.
 4.Recommend the user add a filter condition to the partition column of the 
partition table.
 5. Check the condition of join, for example, the result of cast string to 
double may be incorrect.

For example:
{code:sql}
0: jdbc:hive2://localhost:1/default> EXPLAIN OPTIMIZE
0: jdbc:hive2://localhost:1/default> SELECT a.*,
0: jdbc:hive2://localhost:1/default>CASE
0: jdbc:hive2://localhost:1/default>  WHEN ( NOT ( a.exclude = 1
0: jdbc:hive2://localhost:1/default>   AND a.cobrand = 6
0: jdbc:hive2://localhost:1/default>   AND 
a.primary_app_id IN ( 1462, 2878, 2571 ) ) )
0: jdbc:hive2://localhost:1/default>   AND ( a.valid_page_count 
= 1 ) THEN 1
0: jdbc:hive2://localhost:1/default>  ELSE 0
0: jdbc:hive2://localhost:1/default>END AS is_singlepage,
0: jdbc:hive2://localhost:1/default>ca.bsns_vrtcl_name
0: jdbc:hive2://localhost:1/default> FROM   (SELECT *
0: jdbc:hive2://localhost:1/default> FROM   (SELECT *,
0: jdbc:hive2://localhost:1/default>'VI' AS 
page_type
0: jdbc:hive2://localhost:1/default> FROM   tbl1
0: jdbc:hive2://localhost:1/default> UNION
0: jdbc:hive2://localhost:1/default> SELECT *,
0: jdbc:hive2://localhost:1/default>'SRP' AS 
page_type
0: jdbc:hive2://localhost:1/default> FROM   tbl2
0: jdbc:hive2://localhost:1/default> UNION
0: jdbc:hive2://localhost:1/default> SELECT *,
0: jdbc:hive2://localhost:1/default>'My Garage' AS 
page_type
0: jdbc:hive2://localhost:1/default> FROM   tbl3
0: jdbc:hive2://localhost:1/default> UNION
0: jdbc:hive2://localhost:1/default> SELECT *,
0: jdbc:hive2://localhost:1/default>'Motors 
Homepage' AS page_type
0: jdbc:hive2://localhost:1/default> FROM   tbl4) t
0: jdbc:hive2://localhost:1/default> WHERE  session_start_dt 
BETWEEN ( '2020-01-01' ) AND (
0: jdbc:hive2://localhost:1/default>
 CURRENT_DATE() - 10 )) a
0: jdbc:hive2://localhost:1/default>LEFT JOIN (SELECT item_id,
0: jdbc:hive2://localhost:1/default>  item_site_id,
0: jdbc:hive2://localhost:1/default>  auct_end_dt,
0: jdbc:hive2://localhost:1/default>  leaf_categ_id
0: jdbc:hive2://localhost:1/default>   FROM   tbl5
0: jdbc:hive2://localhost:1/default>   WHERE  auct_end_dt 
>= ( '2020-01-01' )) itm
0: jdbc:hive2://localhost:1/default>   ON a.item_id = 
itm.item_id
0: jdbc:hive2://localhost:1/default>LEFT JOIN tbl6 ca
0: jdbc:hive2://localhost:1/default>   ON itm.leaf_categ_id = 
ca.leaf_categ_id
0: jdbc:hive2://localhost:1/default>  AND itm.item_site_id 
= ca.site_id;
+-+--+
| result
  |
+-+--+
| 1. Detect the most skew values for join   
  |
|   Check join: Join LeftOuter, ((leaf_categ_id#1453 = leaf_categ_id#3020) AND 
(cast(item_site_id#1444 as decimal(9,0)) = site_id#3022))  |
| table: tbl5   
  |
|   leaf_categ_id, item_site_id, count  
  |
|   171243, 0, 115412614 

[jira] [Commented] (SPARK-35797) patternToRegex failed when pattern is star

2021-06-19 Thread Yuming Wang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17366080#comment-17366080
 ] 

Yuming Wang commented on SPARK-35797:
-

How to reproduce this issue?

> patternToRegex failed when pattern is star
> --
>
> Key: SPARK-35797
> URL: https://issues.apache.org/jira/browse/SPARK-35797
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.1, 3.1.2
> Environment: Spark 3.1.1
>Reporter: YuanGuanhu
>Priority: Major
> Attachments: patternFailed.txt
>
>
> we are using hue to connect Spark JDBCServer with spark3.1.1, when we queru 
> this `select * from tb1` we got exception attached file



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34120) Improve the statistics estimation

2021-06-18 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang reassigned SPARK-34120:
---

Assignee: Yuming Wang

> Improve the statistics estimation
> -
>
> Key: SPARK-34120
> URL: https://issues.apache.org/jira/browse/SPARK-34120
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
> Fix For: 3.2.0
>
>
> This umbrella tickets aim to track improvements in statistics estimates.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-34120) Improve the statistics estimation

2021-06-18 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang resolved SPARK-34120.
-
Fix Version/s: 3.2.0
   Resolution: Fixed

> Improve the statistics estimation
> -
>
> Key: SPARK-34120
> URL: https://issues.apache.org/jira/browse/SPARK-34120
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Yuming Wang
>Priority: Major
> Fix For: 3.2.0
>
>
> This umbrella tickets aim to track improvements in statistics estimates.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35185) Improve Distinct statistics estimation

2021-06-18 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang reassigned SPARK-35185:
---

Assignee: Yuming Wang

> Improve Distinct statistics estimation
> --
>
> Key: SPARK-35185
> URL: https://issues.apache.org/jira/browse/SPARK-35185
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-35185) Improve Distinct statistics estimation

2021-06-18 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang resolved SPARK-35185.
-
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 32291
[https://github.com/apache/spark/pull/32291]

> Improve Distinct statistics estimation
> --
>
> Key: SPARK-35185
> URL: https://issues.apache.org/jira/browse/SPARK-35185
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
> Fix For: 3.2.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-35786) Support optimize repartition by expression in AQE

2021-06-17 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35786?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-35786:

Parent: SPARK-35793
Issue Type: Sub-task  (was: Improvement)

> Support optimize repartition by expression in AQE
> -
>
> Key: SPARK-35786
> URL: https://issues.apache.org/jira/browse/SPARK-35786
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: XiDuo You
>Priority: Major
>
> Currently, we only support use LocalShuffleReader for `REPARTITION`. In some 
> case, user also want to do this optimization with `REPARTITION_BY_COL`.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30538) A not very elegant way to control ouput small file

2021-06-17 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30538?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-30538:

Parent: SPARK-35793
Issue Type: Sub-task  (was: Improvement)

> A  not very elegant way to control ouput small file 
> 
>
> Key: SPARK-30538
> URL: https://issues.apache.org/jira/browse/SPARK-30538
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: angerszhu
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-35335) Improve CoalesceShufflePartitions to avoid generating small files

2021-06-17 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-35335:

Parent: SPARK-35793
Issue Type: Sub-task  (was: Improvement)

> Improve CoalesceShufflePartitions to avoid generating  small files
> --
>
> Key: SPARK-35335
> URL: https://issues.apache.org/jira/browse/SPARK-35335
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Yuming Wang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-35650) Coalesce small output files through AQE

2021-06-17 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35650?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-35650:

Parent: SPARK-35793
Issue Type: Sub-task  (was: New Feature)

> Coalesce small output files through AQE
> ---
>
> Key: SPARK-35650
> URL: https://issues.apache.org/jira/browse/SPARK-35650
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
> Fix For: 3.2.0
>
>
> Add a new API to support coalesce small output files through AQE.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-35725) Support repartition expand partitions in AQE

2021-06-17 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35725?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-35725:

Parent Issue: SPARK-35793  (was: SPARK-33828)

> Support repartition expand partitions in AQE
> 
>
> Key: SPARK-35725
> URL: https://issues.apache.org/jira/browse/SPARK-35725
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: XiDuo You
>Priority: Major
>
> Currently, we don't support expand partition dynamically in AQE which is not 
> friendly for some data skew job.
> Let's say if we have a simple query:
> {code:java}
> SELECT * FROM table DISTRIBUTE BY col
> {code}
> The column of `col` is skewed, then some shuffle partitions would handle too 
> much data than others.
> If we haven't inroduced extra shuffle, we can optimize this case by expanding 
> partitions in AQE.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31264) Repartition by dynamic partition columns before insert table

2021-06-17 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31264?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-31264:

Parent: SPARK-35793
Issue Type: Sub-task  (was: Improvement)

> Repartition by dynamic partition columns before insert table
> 
>
> Key: SPARK-31264
> URL: https://issues.apache.org/jira/browse/SPARK-31264
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yuming Wang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-35793) Repartition before writing data source tables

2021-06-17 Thread Yuming Wang (Jira)
Yuming Wang created SPARK-35793:
---

 Summary: Repartition before writing data source tables
 Key: SPARK-35793
 URL: https://issues.apache.org/jira/browse/SPARK-35793
 Project: Spark
  Issue Type: Umbrella
  Components: SQL
Affects Versions: 3.2.0
Reporter: Yuming Wang


This umbrella ticket aim to track repartition before writing data source 
tables. It contains:
 # repartition by dynamic partition column before writing dynamic partition 
tables.
 # repartition before writing normal tables to avoid generating too many small 
files.
 # Improve local shuffle reader.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-35556) Remove the close HiveClient's SessionState

2021-06-16 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang resolved SPARK-35556.
-
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 32693
[https://github.com/apache/spark/pull/32693]

> Remove the close HiveClient's SessionState
> --
>
> Key: SPARK-35556
> URL: https://issues.apache.org/jira/browse/SPARK-35556
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Minor
> Fix For: 3.2.0
>
>
> There are some logs related to  `java.lang.NoSuchMethodError: 
> org.apache.hadoop.hive.ql.session.SessionState.getTmpErrOutputFile()Ljava/io/File;`
>  when run sql/hive module UTs
> For example, when execute 
> {code:java}
> mvn test -pl sql/hive 
> -DwildcardSuites=org.apache.spark.sql.hive.client.VersionsSuite -Dtest=none
> {code}
> the all tests passed but there will be some error log as follow:
> {code:java}
> Run completed in 17 minutes, 18 seconds.
> Total number of tests run: 867
> Suites: completed 2, aborted 0
> Tests: succeeded 867, failed 0, canceled 0, ignored 1, pending 0
> All tests passed.
> 15:04:02.407 WARN org.apache.hadoop.hive.metastore.ObjectStore: Version 
> information not found in metastore. hive.metastore.schema.verification is not 
> enabled so recording the schema version 2.3.0
> 15:04:02.408 WARN org.apache.hadoop.hive.metastore.ObjectStore: 
> setMetaStoreSchemaVersion called but recording version is disabled: version = 
> 2.3.0, comment = Set by MetaStore yangjie01@0.2.30.21
> 15:04:02.441 WARN org.apache.hadoop.hive.metastore.ObjectStore: Failed to get 
> database default, returning NoSuchObjectException
> 15:04:03.140 ERROR org.apache.spark.util.Utils: Uncaught exception in thread 
> shutdown-hook-0
> java.lang.NoSuchMethodError: 
> org.apache.hadoop.hive.ql.session.SessionState.getTmpErrOutputFile()Ljava/io/File;
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$closeState$1(HiveClientImpl.scala:168)
>   at 
> scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$withHiveState$1(HiveClientImpl.scala:312)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:243)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:242)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:292)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl.closeState(HiveClientImpl.scala:158)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$new$1(HiveClientImpl.scala:175)
>   at 
> org.apache.spark.util.SparkShutdownHook.run(ShutdownHookManager.scala:214)
>   at 
> org.apache.spark.util.SparkShutdownHookManager.$anonfun$runAll$2(ShutdownHookManager.scala:188)
>   at 
> scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
>   at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1994)
>   at 
> org.apache.spark.util.SparkShutdownHookManager.$anonfun$runAll$1(ShutdownHookManager.scala:188)
>   at 
> scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
>   at scala.util.Try$.apply(Try.scala:213)
>   at 
> org.apache.spark.util.SparkShutdownHookManager.runAll(ShutdownHookManager.scala:188)
>   at 
> org.apache.spark.util.SparkShutdownHookManager$$anon$2.run(ShutdownHookManager.scala:178)
>   at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> 15:04:03.141 WARN org.apache.hadoop.util.ShutdownHookManager: ShutdownHook 
> '$anon$2' failed, java.util.concurrent.ExecutionException: 
> java.lang.NoSuchMethodError: 
> org.apache.hadoop.hive.ql.session.SessionState.getTmpErrOutputFile()Ljava/io/File;
> java.util.concurrent.ExecutionException: java.lang.NoSuchMethodError: 
> org.apache.hadoop.hive.ql.session.SessionState.getTmpErrOutputFile()Ljava/io/File;
>   at java.util.concurrent.FutureTask.report(FutureTask.java:122)
>   at java.util.concurrent.FutureTask.get(FutureTask.java:206)
>   at 
> org.apache.hadoop.util.ShutdownHookManager.executeShutdown(ShutdownHookManager.java:124)
>   at 
> org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:95)
> Caused by: java.lang.NoSuchMethodError: 
> 

[jira] [Assigned] (SPARK-35556) Remove the close HiveClient's SessionState

2021-06-16 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang reassigned SPARK-35556:
---

Assignee: Yang Jie

> Remove the close HiveClient's SessionState
> --
>
> Key: SPARK-35556
> URL: https://issues.apache.org/jira/browse/SPARK-35556
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Minor
>
> There are some logs related to  `java.lang.NoSuchMethodError: 
> org.apache.hadoop.hive.ql.session.SessionState.getTmpErrOutputFile()Ljava/io/File;`
>  when run sql/hive module UTs
> For example, when execute 
> {code:java}
> mvn test -pl sql/hive 
> -DwildcardSuites=org.apache.spark.sql.hive.client.VersionsSuite -Dtest=none
> {code}
> the all tests passed but there will be some error log as follow:
> {code:java}
> Run completed in 17 minutes, 18 seconds.
> Total number of tests run: 867
> Suites: completed 2, aborted 0
> Tests: succeeded 867, failed 0, canceled 0, ignored 1, pending 0
> All tests passed.
> 15:04:02.407 WARN org.apache.hadoop.hive.metastore.ObjectStore: Version 
> information not found in metastore. hive.metastore.schema.verification is not 
> enabled so recording the schema version 2.3.0
> 15:04:02.408 WARN org.apache.hadoop.hive.metastore.ObjectStore: 
> setMetaStoreSchemaVersion called but recording version is disabled: version = 
> 2.3.0, comment = Set by MetaStore yangjie01@0.2.30.21
> 15:04:02.441 WARN org.apache.hadoop.hive.metastore.ObjectStore: Failed to get 
> database default, returning NoSuchObjectException
> 15:04:03.140 ERROR org.apache.spark.util.Utils: Uncaught exception in thread 
> shutdown-hook-0
> java.lang.NoSuchMethodError: 
> org.apache.hadoop.hive.ql.session.SessionState.getTmpErrOutputFile()Ljava/io/File;
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$closeState$1(HiveClientImpl.scala:168)
>   at 
> scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$withHiveState$1(HiveClientImpl.scala:312)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:243)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:242)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:292)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl.closeState(HiveClientImpl.scala:158)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$new$1(HiveClientImpl.scala:175)
>   at 
> org.apache.spark.util.SparkShutdownHook.run(ShutdownHookManager.scala:214)
>   at 
> org.apache.spark.util.SparkShutdownHookManager.$anonfun$runAll$2(ShutdownHookManager.scala:188)
>   at 
> scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
>   at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1994)
>   at 
> org.apache.spark.util.SparkShutdownHookManager.$anonfun$runAll$1(ShutdownHookManager.scala:188)
>   at 
> scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
>   at scala.util.Try$.apply(Try.scala:213)
>   at 
> org.apache.spark.util.SparkShutdownHookManager.runAll(ShutdownHookManager.scala:188)
>   at 
> org.apache.spark.util.SparkShutdownHookManager$$anon$2.run(ShutdownHookManager.scala:178)
>   at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> 15:04:03.141 WARN org.apache.hadoop.util.ShutdownHookManager: ShutdownHook 
> '$anon$2' failed, java.util.concurrent.ExecutionException: 
> java.lang.NoSuchMethodError: 
> org.apache.hadoop.hive.ql.session.SessionState.getTmpErrOutputFile()Ljava/io/File;
> java.util.concurrent.ExecutionException: java.lang.NoSuchMethodError: 
> org.apache.hadoop.hive.ql.session.SessionState.getTmpErrOutputFile()Ljava/io/File;
>   at java.util.concurrent.FutureTask.report(FutureTask.java:122)
>   at java.util.concurrent.FutureTask.get(FutureTask.java:206)
>   at 
> org.apache.hadoop.util.ShutdownHookManager.executeShutdown(ShutdownHookManager.java:124)
>   at 
> org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:95)
> Caused by: java.lang.NoSuchMethodError: 
> org.apache.hadoop.hive.ql.session.SessionState.getTmpErrOutputFile()Ljava/io/File;
>   at 
> 

[jira] [Updated] (SPARK-35556) Avoid log NoSuchMethodError when HiveClientImpl.state close

2021-06-16 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-35556:

Component/s: (was: Tests)

>  Avoid log NoSuchMethodError when HiveClientImpl.state close
> 
>
> Key: SPARK-35556
> URL: https://issues.apache.org/jira/browse/SPARK-35556
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Yang Jie
>Priority: Minor
>
> There are some logs related to  `java.lang.NoSuchMethodError: 
> org.apache.hadoop.hive.ql.session.SessionState.getTmpErrOutputFile()Ljava/io/File;`
>  when run sql/hive module UTs
> For example, when execute 
> {code:java}
> mvn test -pl sql/hive 
> -DwildcardSuites=org.apache.spark.sql.hive.client.VersionsSuite -Dtest=none
> {code}
> the all tests passed but there will be some error log as follow:
> {code:java}
> Run completed in 17 minutes, 18 seconds.
> Total number of tests run: 867
> Suites: completed 2, aborted 0
> Tests: succeeded 867, failed 0, canceled 0, ignored 1, pending 0
> All tests passed.
> 15:04:02.407 WARN org.apache.hadoop.hive.metastore.ObjectStore: Version 
> information not found in metastore. hive.metastore.schema.verification is not 
> enabled so recording the schema version 2.3.0
> 15:04:02.408 WARN org.apache.hadoop.hive.metastore.ObjectStore: 
> setMetaStoreSchemaVersion called but recording version is disabled: version = 
> 2.3.0, comment = Set by MetaStore yangjie01@0.2.30.21
> 15:04:02.441 WARN org.apache.hadoop.hive.metastore.ObjectStore: Failed to get 
> database default, returning NoSuchObjectException
> 15:04:03.140 ERROR org.apache.spark.util.Utils: Uncaught exception in thread 
> shutdown-hook-0
> java.lang.NoSuchMethodError: 
> org.apache.hadoop.hive.ql.session.SessionState.getTmpErrOutputFile()Ljava/io/File;
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$closeState$1(HiveClientImpl.scala:168)
>   at 
> scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$withHiveState$1(HiveClientImpl.scala:312)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:243)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:242)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:292)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl.closeState(HiveClientImpl.scala:158)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$new$1(HiveClientImpl.scala:175)
>   at 
> org.apache.spark.util.SparkShutdownHook.run(ShutdownHookManager.scala:214)
>   at 
> org.apache.spark.util.SparkShutdownHookManager.$anonfun$runAll$2(ShutdownHookManager.scala:188)
>   at 
> scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
>   at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1994)
>   at 
> org.apache.spark.util.SparkShutdownHookManager.$anonfun$runAll$1(ShutdownHookManager.scala:188)
>   at 
> scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
>   at scala.util.Try$.apply(Try.scala:213)
>   at 
> org.apache.spark.util.SparkShutdownHookManager.runAll(ShutdownHookManager.scala:188)
>   at 
> org.apache.spark.util.SparkShutdownHookManager$$anon$2.run(ShutdownHookManager.scala:178)
>   at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> 15:04:03.141 WARN org.apache.hadoop.util.ShutdownHookManager: ShutdownHook 
> '$anon$2' failed, java.util.concurrent.ExecutionException: 
> java.lang.NoSuchMethodError: 
> org.apache.hadoop.hive.ql.session.SessionState.getTmpErrOutputFile()Ljava/io/File;
> java.util.concurrent.ExecutionException: java.lang.NoSuchMethodError: 
> org.apache.hadoop.hive.ql.session.SessionState.getTmpErrOutputFile()Ljava/io/File;
>   at java.util.concurrent.FutureTask.report(FutureTask.java:122)
>   at java.util.concurrent.FutureTask.get(FutureTask.java:206)
>   at 
> org.apache.hadoop.util.ShutdownHookManager.executeShutdown(ShutdownHookManager.java:124)
>   at 
> org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:95)
> Caused by: java.lang.NoSuchMethodError: 
> org.apache.hadoop.hive.ql.session.SessionState.getTmpErrOutputFile()Ljava/io/File;
>   at 
> 

[jira] [Updated] (SPARK-35556) Remove the close HiveClient's SessionState

2021-06-16 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-35556:

Summary: Remove the close HiveClient's SessionState  (was:  Avoid log 
NoSuchMethodError when HiveClientImpl.state close)

> Remove the close HiveClient's SessionState
> --
>
> Key: SPARK-35556
> URL: https://issues.apache.org/jira/browse/SPARK-35556
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Yang Jie
>Priority: Minor
>
> There are some logs related to  `java.lang.NoSuchMethodError: 
> org.apache.hadoop.hive.ql.session.SessionState.getTmpErrOutputFile()Ljava/io/File;`
>  when run sql/hive module UTs
> For example, when execute 
> {code:java}
> mvn test -pl sql/hive 
> -DwildcardSuites=org.apache.spark.sql.hive.client.VersionsSuite -Dtest=none
> {code}
> the all tests passed but there will be some error log as follow:
> {code:java}
> Run completed in 17 minutes, 18 seconds.
> Total number of tests run: 867
> Suites: completed 2, aborted 0
> Tests: succeeded 867, failed 0, canceled 0, ignored 1, pending 0
> All tests passed.
> 15:04:02.407 WARN org.apache.hadoop.hive.metastore.ObjectStore: Version 
> information not found in metastore. hive.metastore.schema.verification is not 
> enabled so recording the schema version 2.3.0
> 15:04:02.408 WARN org.apache.hadoop.hive.metastore.ObjectStore: 
> setMetaStoreSchemaVersion called but recording version is disabled: version = 
> 2.3.0, comment = Set by MetaStore yangjie01@0.2.30.21
> 15:04:02.441 WARN org.apache.hadoop.hive.metastore.ObjectStore: Failed to get 
> database default, returning NoSuchObjectException
> 15:04:03.140 ERROR org.apache.spark.util.Utils: Uncaught exception in thread 
> shutdown-hook-0
> java.lang.NoSuchMethodError: 
> org.apache.hadoop.hive.ql.session.SessionState.getTmpErrOutputFile()Ljava/io/File;
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$closeState$1(HiveClientImpl.scala:168)
>   at 
> scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$withHiveState$1(HiveClientImpl.scala:312)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:243)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:242)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:292)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl.closeState(HiveClientImpl.scala:158)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$new$1(HiveClientImpl.scala:175)
>   at 
> org.apache.spark.util.SparkShutdownHook.run(ShutdownHookManager.scala:214)
>   at 
> org.apache.spark.util.SparkShutdownHookManager.$anonfun$runAll$2(ShutdownHookManager.scala:188)
>   at 
> scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
>   at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1994)
>   at 
> org.apache.spark.util.SparkShutdownHookManager.$anonfun$runAll$1(ShutdownHookManager.scala:188)
>   at 
> scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
>   at scala.util.Try$.apply(Try.scala:213)
>   at 
> org.apache.spark.util.SparkShutdownHookManager.runAll(ShutdownHookManager.scala:188)
>   at 
> org.apache.spark.util.SparkShutdownHookManager$$anon$2.run(ShutdownHookManager.scala:178)
>   at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> 15:04:03.141 WARN org.apache.hadoop.util.ShutdownHookManager: ShutdownHook 
> '$anon$2' failed, java.util.concurrent.ExecutionException: 
> java.lang.NoSuchMethodError: 
> org.apache.hadoop.hive.ql.session.SessionState.getTmpErrOutputFile()Ljava/io/File;
> java.util.concurrent.ExecutionException: java.lang.NoSuchMethodError: 
> org.apache.hadoop.hive.ql.session.SessionState.getTmpErrOutputFile()Ljava/io/File;
>   at java.util.concurrent.FutureTask.report(FutureTask.java:122)
>   at java.util.concurrent.FutureTask.get(FutureTask.java:206)
>   at 
> org.apache.hadoop.util.ShutdownHookManager.executeShutdown(ShutdownHookManager.java:124)
>   at 
> org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:95)
> Caused by: java.lang.NoSuchMethodError: 
> org.apache.hadoop.hive.ql.session.SessionState.getTmpErrOutputFile()Ljava/io/File;
>   at 
> 

[jira] [Updated] (SPARK-28560) Optimize shuffle reader to local shuffle reader when smj converted to bhj in adaptive execution

2021-06-16 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28560?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-28560:

Attachment: localShuffleReader.png

> Optimize shuffle reader to local shuffle reader when smj converted to bhj in 
> adaptive execution
> ---
>
> Key: SPARK-28560
> URL: https://issues.apache.org/jira/browse/SPARK-28560
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Ke Jia
>Assignee: Ke Jia
>Priority: Major
> Fix For: 3.0.0
>
> Attachments: localShuffleReader.png
>
>
> Implement a rule in the new adaptive execution framework introduced in 
> SPARK-23128. This rule is used to optimize the shuffle reader to local 
> shuffle reader when smj is converted to bhj in adaptive execution.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28560) Optimize shuffle reader to local shuffle reader when smj converted to bhj in adaptive execution

2021-06-16 Thread Yuming Wang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-28560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17364079#comment-17364079
 ] 

Yuming Wang commented on SPARK-28560:
-

This is very useful if there is data skew at the probe side.
 !localShuffleReader.png! 

> Optimize shuffle reader to local shuffle reader when smj converted to bhj in 
> adaptive execution
> ---
>
> Key: SPARK-28560
> URL: https://issues.apache.org/jira/browse/SPARK-28560
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Ke Jia
>Assignee: Ke Jia
>Priority: Major
> Fix For: 3.0.0
>
> Attachments: localShuffleReader.png
>
>
> Implement a rule in the new adaptive execution framework introduced in 
> SPARK-23128. This rule is used to optimize the shuffle reader to local 
> shuffle reader when smj is converted to bhj in adaptive execution.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27714) Support Join Reorder based on Genetic Algorithm when the # of joined tables > 12

2021-06-13 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-27714?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-27714:

Target Version/s:   (was: 3.0.0)

> Support Join Reorder based on Genetic Algorithm when the # of joined tables > 
> 12
> 
>
> Key: SPARK-27714
> URL: https://issues.apache.org/jira/browse/SPARK-27714
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Xianyin Xin
>Assignee: Xianyin Xin
>Priority: Major
>
> Now the join reorder logic is based on dynamic planning which can find the 
> most optimized plan theoretically, but the searching cost grows rapidly with 
> the # of joined tables grows. It would be better to introduce Genetic 
> algorithm (GA) to overcome this problem.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35321) Spark 3.x can't talk to HMS 1.2.x and lower due to get_all_functions Thrift API missing

2021-06-11 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35321?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang reassigned SPARK-35321:
---

Assignee: Chao Sun

> Spark 3.x can't talk to HMS 1.2.x and lower due to get_all_functions Thrift 
> API missing
> ---
>
> Key: SPARK-35321
> URL: https://issues.apache.org/jira/browse/SPARK-35321
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.2, 3.1.1, 3.2.0
>Reporter: Chao Sun
>Assignee: Chao Sun
>Priority: Major
>
> https://issues.apache.org/jira/browse/HIVE-10319 introduced a new API 
> {{get_all_functions}} which is only supported in Hive 1.3.0/2.0.0 and up. 
> This is called when creating a new {{Hive}} object:
> {code}
>   private Hive(HiveConf c, boolean doRegisterAllFns) throws HiveException {
> conf = c;
> if (doRegisterAllFns) {
>   registerAllFunctionsOnce();
> }
>   }
> {code}
> {{registerAllFunctionsOnce}} will reload all the permanent functions by 
> calling the {{get_all_functions}} API from the megastore. In Spark, we always 
> pass {{doRegisterAllFns}} as true, and this will cause failure:
> {code}
> Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: 
> org.apache.thrift.TApplicationException: Invalid method name: 
> 'get_all_functions'
>   at 
> org.apache.hadoop.hive.ql.metadata.Hive.getAllFunctions(Hive.java:3897)
>   at 
> org.apache.hadoop.hive.ql.metadata.Hive.reloadFunctions(Hive.java:248)
>   at 
> org.apache.hadoop.hive.ql.metadata.Hive.registerAllFunctionsOnce(Hive.java:231)
>   ... 96 more
> Caused by: org.apache.thrift.TApplicationException: Invalid method name: 
> 'get_all_functions'
>   at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:79)
>   at 
> org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.recv_get_all_functions(ThriftHiveMetastore.java:3845)
>   at 
> org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.get_all_functions(ThriftHiveMetastore.java:3833)
> {code}
> It looks like Spark doesn't really need to call {{registerAllFunctionsOnce}} 
> since it loads the Hive permanent function directly from HMS API. The Hive 
> {{FunctionRegistry}} is only used for loading Hive built-in functions.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-35321) Spark 3.x can't talk to HMS 1.2.x and lower due to get_all_functions Thrift API missing

2021-06-11 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35321?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang resolved SPARK-35321.
-
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 32887
[https://github.com/apache/spark/pull/32887]

> Spark 3.x can't talk to HMS 1.2.x and lower due to get_all_functions Thrift 
> API missing
> ---
>
> Key: SPARK-35321
> URL: https://issues.apache.org/jira/browse/SPARK-35321
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.2, 3.1.1, 3.2.0
>Reporter: Chao Sun
>Assignee: Chao Sun
>Priority: Major
> Fix For: 3.2.0
>
>
> https://issues.apache.org/jira/browse/HIVE-10319 introduced a new API 
> {{get_all_functions}} which is only supported in Hive 1.3.0/2.0.0 and up. 
> This is called when creating a new {{Hive}} object:
> {code}
>   private Hive(HiveConf c, boolean doRegisterAllFns) throws HiveException {
> conf = c;
> if (doRegisterAllFns) {
>   registerAllFunctionsOnce();
> }
>   }
> {code}
> {{registerAllFunctionsOnce}} will reload all the permanent functions by 
> calling the {{get_all_functions}} API from the megastore. In Spark, we always 
> pass {{doRegisterAllFns}} as true, and this will cause failure:
> {code}
> Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: 
> org.apache.thrift.TApplicationException: Invalid method name: 
> 'get_all_functions'
>   at 
> org.apache.hadoop.hive.ql.metadata.Hive.getAllFunctions(Hive.java:3897)
>   at 
> org.apache.hadoop.hive.ql.metadata.Hive.reloadFunctions(Hive.java:248)
>   at 
> org.apache.hadoop.hive.ql.metadata.Hive.registerAllFunctionsOnce(Hive.java:231)
>   ... 96 more
> Caused by: org.apache.thrift.TApplicationException: Invalid method name: 
> 'get_all_functions'
>   at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:79)
>   at 
> org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.recv_get_all_functions(ThriftHiveMetastore.java:3845)
>   at 
> org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.get_all_functions(ThriftHiveMetastore.java:3833)
> {code}
> It looks like Spark doesn't really need to call {{registerAllFunctionsOnce}} 
> since it loads the Hive permanent function directly from HMS API. The Hive 
> {{FunctionRegistry}} is only used for loading Hive built-in functions.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-35650) Coalesce small output files through AQE

2021-06-04 Thread Yuming Wang (Jira)
Yuming Wang created SPARK-35650:
---

 Summary: Coalesce small output files through AQE
 Key: SPARK-35650
 URL: https://issues.apache.org/jira/browse/SPARK-35650
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 3.2.0
Reporter: Yuming Wang


Add a new API to support coalesce small output files through AQE.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34808) Removes outer join if it only has distinct on streamed side

2021-05-31 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34808?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang reassigned SPARK-34808:
---

Assignee: Yuming Wang

> Removes outer join if it only has distinct on streamed side
> ---
>
> Key: SPARK-34808
> URL: https://issues.apache.org/jira/browse/SPARK-34808
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
>
> For example:
> {code:scala}
> spark.range(200L).selectExpr("id AS a").createTempView("t1")
> spark.range(300L).selectExpr("id AS b").createTempView("t2")
> spark.sql("SELECT DISTINCT a FROM t1 LEFT JOIN t2 ON a = b").explain(true)
> {code}
> Current optimized plan:
> {noformat}
> == Optimized Logical Plan ==
> Aggregate [a#2L], [a#2L]
> +- Project [a#2L]
>+- Join LeftOuter, (a#2L = b#6L)
>   :- Project [id#0L AS a#2L]
>   :  +- Range (0, 200, step=1, splits=Some(2))
>   +- Project [id#4L AS b#6L]
>  +- Range (0, 300, step=1, splits=Some(2))
> {noformat}
> Expected optimized plan:
> {noformat}
> == Optimized Logical Plan ==
> Aggregate [a#2L], [a#2L]
> +- Project [id#0L AS a#2L]
>+- Range (0, 200, step=1, splits=Some(2))
> {noformat}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-34808) Removes outer join if it only has distinct on streamed side

2021-05-31 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34808?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang resolved SPARK-34808.
-
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 31908
[https://github.com/apache/spark/pull/31908]

> Removes outer join if it only has distinct on streamed side
> ---
>
> Key: SPARK-34808
> URL: https://issues.apache.org/jira/browse/SPARK-34808
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
> Fix For: 3.2.0
>
>
> For example:
> {code:scala}
> spark.range(200L).selectExpr("id AS a").createTempView("t1")
> spark.range(300L).selectExpr("id AS b").createTempView("t2")
> spark.sql("SELECT DISTINCT a FROM t1 LEFT JOIN t2 ON a = b").explain(true)
> {code}
> Current optimized plan:
> {noformat}
> == Optimized Logical Plan ==
> Aggregate [a#2L], [a#2L]
> +- Project [a#2L]
>+- Join LeftOuter, (a#2L = b#6L)
>   :- Project [id#0L AS a#2L]
>   :  +- Range (0, 200, step=1, splits=Some(2))
>   +- Project [id#4L AS b#6L]
>  +- Range (0, 300, step=1, splits=Some(2))
> {noformat}
> Expected optimized plan:
> {noformat}
> == Optimized Logical Plan ==
> Aggregate [a#2L], [a#2L]
> +- Project [id#0L AS a#2L]
>+- Range (0, 200, step=1, splits=Some(2))
> {noformat}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-35571) tag v3.0.0 org.apache.spark.sql.catalyst.parser.AstBuilder import error

2021-05-31 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35571?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-35571:

Target Version/s:   (was: 3.0.0)

> tag v3.0.0 org.apache.spark.sql.catalyst.parser.AstBuilder import error
> ---
>
> Key: SPARK-35571
> URL: https://issues.apache.org/jira/browse/SPARK-35571
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: geekyouth
>Priority: Major
>
> org.apache.spark.sql.catalyst.parser.AstBuilder:
> https://github.com/apache/spark/blob/v3.0.0/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala
> line 36:
> `import org.apache.spark.sql.catalyst.parser.SqlBaseParser._`
> SqlBaseParser do not exists in pkg `org.apache.spark.sql.catalyst.parser`
> https://github.com/apache/spark/tree/v3.0.0/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser
> also line 54 :
> SqlBaseBaseVisitor does not import and could not compile



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35568) UnsupportedOperationException: WholeStageCodegen (3) does not implement doExecuteBroadcast

2021-05-30 Thread Yuming Wang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17354174#comment-17354174
 ] 

Yuming Wang commented on SPARK-35568:
-

Thank you [~dongjoon].

> UnsupportedOperationException: WholeStageCodegen (3) does not implement 
> doExecuteBroadcast
> --
>
> Key: SPARK-35568
> URL: https://issues.apache.org/jira/browse/SPARK-35568
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Yuming Wang
>Priority: Major
>
> How to reproduce:
> {code:scala}
> sql(
>   """
> |SELECT s.store_id, f.product_id
> |FROM (SELECT DISTINCT * FROM fact_sk) f
> |  JOIN (SELECT
> |  *,
> |  ROW_NUMBER() OVER (PARTITION BY store_id ORDER BY 
> state_province DESC) AS rn
> |FROM dim_store) s
> |   ON f.store_id = s.store_id
> |WHERE s.country = 'DE' AND s.rn = 1
> |""".stripMargin).show
> {code}
> {noformat}
> Caused by: java.lang.UnsupportedOperationException: WholeStageCodegen (3) 
> does not implement doExecuteBroadcast
>   at 
> org.apache.spark.sql.execution.SparkPlan.doExecuteBroadcast(SparkPlan.scala:297)
>   at 
> org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.doExecuteBroadcast(AdaptiveSparkPlanExec.scala:323)
>   at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$executeBroadcast$1(SparkPlan.scala:192)
>   at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:217)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:214)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeBroadcast(SparkPlan.scala:188)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-35568) UnsupportedOperationException: WholeStageCodegen (3) does not implement doExecuteBroadcast

2021-05-30 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35568?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-35568:

Description: 
How to reproduce:
{code:scala}
sql(
  """
|SELECT s.store_id, f.product_id
|FROM (SELECT DISTINCT * FROM fact_sk) f
|  JOIN (SELECT
|  *,
|  ROW_NUMBER() OVER (PARTITION BY store_id ORDER BY 
state_province DESC) AS rn
|FROM dim_store) s
|   ON f.store_id = s.store_id
|WHERE s.country = 'DE' AND s.rn = 1
|""".stripMargin).show
{code}

{noformat}
Caused by: java.lang.UnsupportedOperationException: WholeStageCodegen (3) does 
not implement doExecuteBroadcast
at 
org.apache.spark.sql.execution.SparkPlan.doExecuteBroadcast(SparkPlan.scala:297)
at 
org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.doExecuteBroadcast(AdaptiveSparkPlanExec.scala:323)
at 
org.apache.spark.sql.execution.SparkPlan.$anonfun$executeBroadcast$1(SparkPlan.scala:192)
at 
org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:217)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at 
org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:214)
at 
org.apache.spark.sql.execution.SparkPlan.executeBroadcast(SparkPlan.scala:188)
{noformat}


  was:
{code:sql}
sql(
  """
|SELECT s.store_id, f.product_id
|FROM (SELECT DISTINCT * FROM fact_sk) f
|  JOIN (SELECT
|  *,
|  ROW_NUMBER() OVER (PARTITION BY store_id ORDER BY 
state_province DESC) AS rn
|FROM dim_store) s
|   ON f.store_id = s.store_id
|WHERE s.country = 'DE' AND s.rn = 1
|""".stripMargin).show
{code}

{noformat}
Caused by: java.lang.UnsupportedOperationException: WholeStageCodegen (3) does 
not implement doExecuteBroadcast
at 
org.apache.spark.sql.execution.SparkPlan.doExecuteBroadcast(SparkPlan.scala:297)
at 
org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.doExecuteBroadcast(AdaptiveSparkPlanExec.scala:323)
at 
org.apache.spark.sql.execution.SparkPlan.$anonfun$executeBroadcast$1(SparkPlan.scala:192)
at 
org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:217)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at 
org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:214)
at 
org.apache.spark.sql.execution.SparkPlan.executeBroadcast(SparkPlan.scala:188)
{noformat}



> UnsupportedOperationException: WholeStageCodegen (3) does not implement 
> doExecuteBroadcast
> --
>
> Key: SPARK-35568
> URL: https://issues.apache.org/jira/browse/SPARK-35568
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Yuming Wang
>Priority: Major
>
> How to reproduce:
> {code:scala}
> sql(
>   """
> |SELECT s.store_id, f.product_id
> |FROM (SELECT DISTINCT * FROM fact_sk) f
> |  JOIN (SELECT
> |  *,
> |  ROW_NUMBER() OVER (PARTITION BY store_id ORDER BY 
> state_province DESC) AS rn
> |FROM dim_store) s
> |   ON f.store_id = s.store_id
> |WHERE s.country = 'DE' AND s.rn = 1
> |""".stripMargin).show
> {code}
> {noformat}
> Caused by: java.lang.UnsupportedOperationException: WholeStageCodegen (3) 
> does not implement doExecuteBroadcast
>   at 
> org.apache.spark.sql.execution.SparkPlan.doExecuteBroadcast(SparkPlan.scala:297)
>   at 
> org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.doExecuteBroadcast(AdaptiveSparkPlanExec.scala:323)
>   at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$executeBroadcast$1(SparkPlan.scala:192)
>   at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:217)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:214)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeBroadcast(SparkPlan.scala:188)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-35568) UnsupportedOperationException: WholeStageCodegen (3) does not implement doExecuteBroadcast

2021-05-30 Thread Yuming Wang (Jira)
Yuming Wang created SPARK-35568:
---

 Summary: UnsupportedOperationException: WholeStageCodegen (3) does 
not implement doExecuteBroadcast
 Key: SPARK-35568
 URL: https://issues.apache.org/jira/browse/SPARK-35568
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.2.0
Reporter: Yuming Wang


{code:sql}
sql(
  """
|SELECT s.store_id, f.product_id
|FROM (SELECT DISTINCT * FROM fact_sk) f
|  JOIN (SELECT
|  *,
|  ROW_NUMBER() OVER (PARTITION BY store_id ORDER BY 
state_province DESC) AS rn
|FROM dim_store) s
|   ON f.store_id = s.store_id
|WHERE s.country = 'DE' AND s.rn = 1
|""".stripMargin).show
{code}

{noformat}
Caused by: java.lang.UnsupportedOperationException: WholeStageCodegen (3) does 
not implement doExecuteBroadcast
at 
org.apache.spark.sql.execution.SparkPlan.doExecuteBroadcast(SparkPlan.scala:297)
at 
org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.doExecuteBroadcast(AdaptiveSparkPlanExec.scala:323)
at 
org.apache.spark.sql.execution.SparkPlan.$anonfun$executeBroadcast$1(SparkPlan.scala:192)
at 
org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:217)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at 
org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:214)
at 
org.apache.spark.sql.execution.SparkPlan.executeBroadcast(SparkPlan.scala:188)
{noformat}




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35441) InMemoryFileIndex load all files into memroy

2021-05-23 Thread Yuming Wang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35441?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17350242#comment-17350242
 ] 

Yuming Wang commented on SPARK-35441:
-

Please increase your driver memory. or you should merge small files if these 
20,000,000 files contain many small files.


> InMemoryFileIndex load all files into memroy
> 
>
> Key: SPARK-35441
> URL: https://issues.apache.org/jira/browse/SPARK-35441
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.1
>Reporter: Zhang Jianguo
>Priority: Minor
>
> When InMemoryFileIndex is initialated, it'll load all files of ${rootPath}. 
> If there're thousands of files, it could lead to OOM.
>  
> https://github.com/apache/spark/blob/0b3758e8cdb3eaa9d55ce3b41ecad5fa01567343/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InMemoryFileIndex.scala#L66



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-35494) Timestamp casting performance issue when invoked with timezone

2021-05-23 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-35494:

Target Version/s:   (was: 2.4.9)

> Timestamp casting performance issue when invoked with timezone
> --
>
> Key: SPARK-35494
> URL: https://issues.apache.org/jira/browse/SPARK-35494
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.7, 2.4.8
>Reporter: Raphael Luta
>Priority: Major
>
> In Spark SQL 2.4, when converting a datetime string column to timestamp with 
> cast or to_timestamp, we have noticed a major performance issue when the 
> source string contains timezone information (for example 
> 2021-05-24T00:00:00+02:00)
> This simple benchmark illustrates the difference:
> {noformat}
> OpenJDK 64-Bit Server VM 1.8.0_292-b10 on Mac OS X 10.16
> Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
> Timestamp Conversion: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
> --
> no tz 1368 / 1379   3,1   326,1   1,0X
> with UTC tz   5940 / 5947   0,7   1416,2  0,2X
> with hours tz 5940 / 5962   0,7   1416,2  0,2X
> {noformat}
> With the provided patch, the benchmark results are:
> {noformat}
> OpenJDK 64-Bit Server VM 1.8.0_292-b10 on Mac OS X 10.16
> Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
> Timestamp Conversion: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
> 
> no tz 1383 / 1392   3,0   329,7   1,0X
> with UTC tz   1550 / 1587   2,7   369,6   0,9X
> with hours tz 1570 / 1574   2,7   374,2   0,9X
> {noformat}
>  This does not occur on the 3.x branches 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32291) COALESCE should not reduce the child parallelism if it is Join

2021-05-22 Thread Yuming Wang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32291?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17349930#comment-17349930
 ] 

Yuming Wang commented on SPARK-32291:
-

We can use localCheckpoint to workaround this issue:
{code:scala}
spark.range(100).createTempView("t1")
spark.range(200).createTempView("t2")
spark.sql("set spark.sql.autoBroadcastJoinThreshold=0")
spark.sql("select t1.* from t1 join t2 on (t1.id = 
t2.id)").localCheckpoint().coalesce(1).show()
{code}

> COALESCE should not reduce the child parallelism if it is Join
> --
>
> Key: SPARK-32291
> URL: https://issues.apache.org/jira/browse/SPARK-32291
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yuming Wang
>Priority: Major
> Attachments: COALESCE.png, coalesce.png, repartition.png
>
>
> How to reproduce this issue:
> {code:scala}
> spark.range(100).createTempView("t1")
> spark.range(200).createTempView("t2")
> spark.sql("set spark.sql.autoBroadcastJoinThreshold=0")
> spark.sql("select /*+ COALESCE(1) */ t1.* from t1 join t2 on (t1.id = 
> t2.id)").show
> {code}
> The dag is:
>  !COALESCE.png! 
> A real case:
>  !coalesce.png! 
>  !repartition.png! 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35244) invoke should throw the original exception

2021-05-21 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang reassigned SPARK-35244:
---

Assignee: Wenchen Fan  (was: Apache Spark)

> invoke should throw the original exception
> --
>
> Key: SPARK-35244
> URL: https://issues.apache.org/jira/browse/SPARK-35244
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.2, 3.1.1, 3.2.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
> Fix For: 3.0.3, 3.1.2, 3.2.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35441) InMemoryFileIndex load all files into memroy

2021-05-19 Thread Yuming Wang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35441?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17347404#comment-17347404
 ] 

Yuming Wang commented on SPARK-35441:
-

What is your driver memory?

> InMemoryFileIndex load all files into memroy
> 
>
> Key: SPARK-35441
> URL: https://issues.apache.org/jira/browse/SPARK-35441
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.1
>Reporter: Zhang Jianguo
>Priority: Minor
>
> When InMemoryFileIndex is initialated, it'll load all files of ${rootPath}. 
> If there're thousands of files, it could lead to OOM.
>  
> https://github.com/apache/spark/blob/0b3758e8cdb3eaa9d55ce3b41ecad5fa01567343/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InMemoryFileIndex.scala#L66



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-35415) Change information to map type for SHOW TABLE EXTENDED command

2021-05-16 Thread Yuming Wang (Jira)
Yuming Wang created SPARK-35415:
---

 Summary: Change information to map type for SHOW TABLE EXTENDED 
command
 Key: SPARK-35415
 URL: https://issues.apache.org/jira/browse/SPARK-35415
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.2.0
Reporter: Yuming Wang






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-35286) Replace SessionState.start with SessionState.setCurrentSessionState

2021-05-16 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang resolved SPARK-35286.
-
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 32410
[https://github.com/apache/spark/pull/32410]

> Replace SessionState.start with SessionState.setCurrentSessionState
> ---
>
> Key: SPARK-35286
> URL: https://issues.apache.org/jira/browse/SPARK-35286
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
> Fix For: 3.2.0
>
>
> To avoid SessionState.createSessionDirs creating too many directories:
> https://user-images.githubusercontent.com/5399861/116766834-28ea7080-aa5f-11eb-85ff-07bcaee444e5.png



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35286) Replace SessionState.start with SessionState.setCurrentSessionState

2021-05-16 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang reassigned SPARK-35286:
---

Assignee: Yuming Wang

> Replace SessionState.start with SessionState.setCurrentSessionState
> ---
>
> Key: SPARK-35286
> URL: https://issues.apache.org/jira/browse/SPARK-35286
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
>
> To avoid SessionState.createSessionDirs creating too many directories:
> https://user-images.githubusercontent.com/5399861/116766834-28ea7080-aa5f-11eb-85ff-07bcaee444e5.png



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35365) spark3.1.1 use too long time to analyze table fields

2021-05-11 Thread Yuming Wang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17342981#comment-17342981
 ] 

Yuming Wang commented on SPARK-35365:
-

{noformat}
-- 2.4
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences   
  46101271113 / 47150195238   1415 / 2368 
-- 3.1
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences
113500482546 / 1155004825462300 / 4124
{noformat}

> spark3.1.1 use too long time to analyze table fields
> 
>
> Key: SPARK-35365
> URL: https://issues.apache.org/jira/browse/SPARK-35365
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.1
>Reporter: yao
>Priority: Major
> Attachments: spark2.4report, spark3.1.1_report_originalsql, 
> spark3.11report
>
>
> I have a big sql with a few width tables join and complex logic, when I run 
> that in spark 2.4 , it will take 20 minues in analyze phase, when I use spark 
> 3.1.1, it will use about 40 minutes,
> I need set spark.sql.analyzer.maxIterations=1000 in spark3.1.1.
> or spark.sql.optimizer.maxIterations=1000 in spark2.4.
> no other special setting for this .
> I check on the spark ui , I find that there is no job generated, all executor 
> have no active tasks, and when I set log level to debug, I find that the job 
> is in analyze phase, analyze the fields reference.
> this phase use too long time.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35365) spark3.1.1 use too long to analyze table fields

2021-05-11 Thread Yuming Wang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17342354#comment-17342354
 ] 

Yuming Wang commented on SPARK-35365:
-

[~xiaohua] Could you check which rule affect the performance, for example:
{noformat}
=== Metrics of Analyzer/Optimizer Rules ===
Total number of runs: 3022
Total time: 7.941302436 seconds

Rule
  Effective Time / Total Time Effective Runs / Total Runs   
 

org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations
  3350202022 / 3357847817 7 / 39
 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions
  11946476 / 5885675436 / 39
 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences   
  516175887 / 577794974   15 / 39   
 
org.apache.spark.sql.catalyst.analysis.TimeWindowing
  0 / 519817133   0 / 39
 
org.apache.spark.sql.catalyst.analysis.DecimalPrecision 
  226306881 / 271650752   11 / 39   
 
org.apache.spark.sql.catalyst.optimizer.EliminateOuterJoin  
  9838775 / 202214973 1 / 6 
 
org.apache.spark.sql.catalyst.analysis.TypeCoercion$PromoteStrings  
  141138907 / 188596520   3 / 39
 
org.apache.spark.sql.catalyst.analysis.TypeCoercion$InConversion
  107365436 / 185270852   3 / 39
 
org.apache.spark.sql.catalyst.analysis.TypeCoercion$ImplicitTypeCasts   
  58358334 / 1409436903 / 39
 
org.apache.spark.sql.catalyst.analysis.Analyzer$ExtractGenerator
  0 / 119236169   0 / 39
 
org.apache.spark.sql.catalyst.optimizer.ColumnPruning   
  41291489 / 76464261 2 / 8 
 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveWindowOrder  
  0 / 647750420 / 39
 
org.apache.spark.sql.catalyst.analysis.AlignViewOutput  
  0 / 617967610 / 39
 
org.apache.spark.sql.catalyst.analysis.TypeCoercion$BooleanEquality 
  0 / 581433310 / 39 
{noformat}


> spark3.1.1 use too long to analyze table fields
> ---
>
> Key: SPARK-35365
> URL: https://issues.apache.org/jira/browse/SPARK-35365
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.1
>Reporter: yao
>Priority: Major
>
> I have a big sql with a few width tables join and complex logic, when I run 
> that in spark 2.4 , it will take 20 minues in analyze phase, when I use spark 
> 3.1.1, it will use about 40 minutes,
> I need set spark.sql.analyzer.maxIterations=1000 in spark3.1.1.
> or spark.sql.optimizer.maxIterations=1000 in spark2.4.
> no other special setting for this .
> I check on the spark ui , I find that there is no job generated, all executor 
> have no active tasks, and when I set log level to debug, I find that the job 
> is in analyze phase, analyze the fields reference.
> this phase use too long time.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-35335) Improve CoalesceShufflePartitions to avoid generating small files

2021-05-07 Thread Yuming Wang (Jira)
Yuming Wang created SPARK-35335:
---

 Summary: Improve CoalesceShufflePartitions to avoid generating  
small files
 Key: SPARK-35335
 URL: https://issues.apache.org/jira/browse/SPARK-35335
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.2.0
Reporter: Yuming Wang






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-35273) CombineFilters support non-deterministic expressions

2021-05-06 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35273?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-35273:

Description: 
For example:
{code:scala}
spark.sql("create table t1(id int) using parquet")
spark.sql("select * from (select * from t1 where id not in (1, 3, 6)) t where 
id = 7 and rand() <= 0.01").explain("cost")
{code}

Current:
{noformat}
== Optimized Logical Plan ==
Filter (isnotnull(id#0) AND ((id#0 = 7) AND (rand(-639771619343876662) <= 
0.01))), Statistics(sizeInBytes=1.0 B)
+- Filter NOT id#0 IN (1,3,6), Statistics(sizeInBytes=1.0 B)
   +- Relation default.t1[id#0] parquet, Statistics(sizeInBytes=0.0 B)
{noformat}

Expected:
{noformat}
== Optimized Logical Plan ==
Filter ((rand(-639771619343876662) <= 0.01))), Statistics(sizeInBytes=1.0 B)
+- Filter NOT id#0 IN (1,3,6) and isnotnull(id#0) AND ((id#0 = 7) , 
Statistics(sizeInBytes=1.0 B)
   +- Relation default.t1[id#0] parquet, Statistics(sizeInBytes=0.0 B)
{noformat}


Another example:
{code:scala}
spark.sql("create table t1(id int) using parquet")
spark.sql("create view v1 as select * from t1 where id not in (1, 3, 6)")
spark.sql("select * from v1 where id = 7 and rand() <= 0.01").explain("cost")
{code}


  was:
For example:
{code:scala}
spark.sql("create table t1(id int) using parquet")
spark.sql("select * from (select * from t1 where id not in (1, 3, 6)) t where 
id = 7 and rand() <= 0.01").explain("cost")
{code}

Current:
{noformat}
== Optimized Logical Plan ==
Filter (isnotnull(id#0) AND ((id#0 = 7) AND (rand(-639771619343876662) <= 
0.01))), Statistics(sizeInBytes=1.0 B)
+- Filter NOT id#0 IN (1,3,6), Statistics(sizeInBytes=1.0 B)
   +- Relation default.t1[id#0] parquet, Statistics(sizeInBytes=0.0 B)
{noformat}

Expected:
{noformat}
== Optimized Logical Plan ==
Filter (isnotnull(id#0) AND (NOT id#0 IN (1,3,6) AND ((id#0 = 7) AND 
(rand(-1485510186481201685) <= 0.01, Statistics(sizeInBytes=1.0 B)
+- Relation default.t1[id#0] parquet, Statistics(sizeInBytes=0.0 B)
{noformat}


Another example:
{code:scala}
spark.sql("create table t1(id int) using parquet")
spark.sql("create view v1 as select * from t1 where id not in (1, 3, 6)")
spark.sql("select * from v1 where id = 7 and rand() <= 0.01").explain("cost")
{code}



> CombineFilters support non-deterministic expressions
> 
>
> Key: SPARK-35273
> URL: https://issues.apache.org/jira/browse/SPARK-35273
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
> Fix For: 3.2.0
>
>
> For example:
> {code:scala}
> spark.sql("create table t1(id int) using parquet")
> spark.sql("select * from (select * from t1 where id not in (1, 3, 6)) t where 
> id = 7 and rand() <= 0.01").explain("cost")
> {code}
> Current:
> {noformat}
> == Optimized Logical Plan ==
> Filter (isnotnull(id#0) AND ((id#0 = 7) AND (rand(-639771619343876662) <= 
> 0.01))), Statistics(sizeInBytes=1.0 B)
> +- Filter NOT id#0 IN (1,3,6), Statistics(sizeInBytes=1.0 B)
>+- Relation default.t1[id#0] parquet, Statistics(sizeInBytes=0.0 B)
> {noformat}
> Expected:
> {noformat}
> == Optimized Logical Plan ==
> Filter ((rand(-639771619343876662) <= 0.01))), Statistics(sizeInBytes=1.0 B)
> +- Filter NOT id#0 IN (1,3,6) and isnotnull(id#0) AND ((id#0 = 7) , 
> Statistics(sizeInBytes=1.0 B)
>+- Relation default.t1[id#0] parquet, Statistics(sizeInBytes=0.0 B)
> {noformat}
> Another example:
> {code:scala}
> spark.sql("create table t1(id int) using parquet")
> spark.sql("create view v1 as select * from t1 where id not in (1, 3, 6)")
> spark.sql("select * from v1 where id = 7 and rand() <= 0.01").explain("cost")
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35321) Spark 3.x can't talk to HMS 1.2.x and lower due to get_all_functions Thrift API missing

2021-05-05 Thread Yuming Wang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17339944#comment-17339944
 ] 

Yuming Wang commented on SPARK-35321:
-

Could we add a parameter to disable registerAllFunctionsOnce? 
https://issues.apache.org/jira/browse/HIVE-21563

> Spark 3.x can't talk to HMS 1.2.x and lower due to get_all_functions Thrift 
> API missing
> ---
>
> Key: SPARK-35321
> URL: https://issues.apache.org/jira/browse/SPARK-35321
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.2, 3.1.1, 3.2.0
>Reporter: Chao Sun
>Priority: Major
>
> https://issues.apache.org/jira/browse/HIVE-10319 introduced a new API 
> {{get_all_functions}} which is only supported in Hive 1.3.0/2.0.0 and up. 
> This is called when creating a new {{Hive}} object:
> {code}
>   private Hive(HiveConf c, boolean doRegisterAllFns) throws HiveException {
> conf = c;
> if (doRegisterAllFns) {
>   registerAllFunctionsOnce();
> }
>   }
> {code}
> {{registerAllFunctionsOnce}} will reload all the permanent functions by 
> calling the {{get_all_functions}} API from the megastore. In Spark, we always 
> pass {{doRegisterAllFns}} as true, and this will cause failure:
> {code}
> Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: 
> org.apache.thrift.TApplicationException: Invalid method name: 
> 'get_all_functions'
>   at 
> org.apache.hadoop.hive.ql.metadata.Hive.getAllFunctions(Hive.java:3897)
>   at 
> org.apache.hadoop.hive.ql.metadata.Hive.reloadFunctions(Hive.java:248)
>   at 
> org.apache.hadoop.hive.ql.metadata.Hive.registerAllFunctionsOnce(Hive.java:231)
>   ... 96 more
> Caused by: org.apache.thrift.TApplicationException: Invalid method name: 
> 'get_all_functions'
>   at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:79)
>   at 
> org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.recv_get_all_functions(ThriftHiveMetastore.java:3845)
>   at 
> org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.get_all_functions(ThriftHiveMetastore.java:3833)
> {code}
> It looks like Spark doesn't really need to call {{registerAllFunctionsOnce}} 
> since it loads the Hive permanent function directly from HMS API. The Hive 
> {{FunctionRegistry}} is only used for loading Hive built-in functions.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-35315) Keep benchmark result consistent between spark-submit and SBT

2021-05-05 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35315?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang resolved SPARK-35315.
-
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 32440
[https://github.com/apache/spark/pull/32440]

> Keep benchmark result consistent between spark-submit and SBT
> -
>
> Key: SPARK-35315
> URL: https://issues.apache.org/jira/browse/SPARK-35315
> Project: Spark
>  Issue Type: Improvement
>  Components: Tests
>Affects Versions: 3.2.0
>Reporter: Chao Sun
>Assignee: Chao Sun
>Priority: Minor
> Fix For: 3.2.0
>
>
> Currently benchmark can be done in two ways: {{spark-submit}}, or SBT 
> command. However in the former Spark will miss some properties such as 
> {{IS_TESTING}}, which is useful to turn on/off some behavior like codegen. 
> Therefore, the result could differ with the two methods. In addition, the 
> benchmark GitHub workflow is using the {{spark-submit}} approach.
> This propose to set {{IS_TESTING}} to true in {{BenchmarkBase}} so that it is 
> always on.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35315) Keep benchmark result consistent between spark-submit and SBT

2021-05-05 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35315?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang reassigned SPARK-35315:
---

Assignee: Chao Sun

> Keep benchmark result consistent between spark-submit and SBT
> -
>
> Key: SPARK-35315
> URL: https://issues.apache.org/jira/browse/SPARK-35315
> Project: Spark
>  Issue Type: Improvement
>  Components: Tests
>Affects Versions: 3.2.0
>Reporter: Chao Sun
>Assignee: Chao Sun
>Priority: Minor
>
> Currently benchmark can be done in two ways: {{spark-submit}}, or SBT 
> command. However in the former Spark will miss some properties such as 
> {{IS_TESTING}}, which is useful to turn on/off some behavior like codegen. 
> Therefore, the result could differ with the two methods. In addition, the 
> benchmark GitHub workflow is using the {{spark-submit}} approach.
> This propose to set {{IS_TESTING}} to true in {{BenchmarkBase}} so that it is 
> always on.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-35316) UnwrapCastInBinaryComparison support In/InSet predicate

2021-05-04 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35316?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-35316:

Description: 
It will not pushdown filters for In/InSet predicates:
{code:scala}
spark.range(50).selectExpr("cast(id as int) as 
id").write.mode("overwrite").parquet("/tmp/parquet/t1")
spark.read.parquet("/tmp/parquet/t1").where("id in (1L, 2L, 4L)").explain
spark.read.parquet("/tmp/parquet/t1").where("id = 1L or id = 2L or id = 
4L").explain
{code}


{noformat}
== Physical Plan ==
*(1) Filter cast(id#5 as bigint) IN (1,2,4)
+- *(1) ColumnarToRow
   +- FileScan parquet [id#5] Batched: true, DataFilters: [cast(id#5 as bigint) 
IN (1,2,4)], Format: Parquet, Location: InMemoryFileIndex(1 
paths)[file:/tmp/parquet/t1], PartitionFilters: [], PushedFilters: [], 
ReadSchema: struct


== Physical Plan ==
*(1) Filter (((id#7 = 1) OR (id#7 = 2)) OR (id#7 = 4))
+- *(1) ColumnarToRow
   +- FileScan parquet [id#7] Batched: true, DataFilters: [(((id#7 = 1) OR 
(id#7 = 2)) OR (id#7 = 4))], Format: Parquet, Location: InMemoryFileIndex(1 
paths)[file:/tmp/parquet/t1], PartitionFilters: [], PushedFilters: 
[Or(Or(EqualTo(id,1),EqualTo(id,2)),EqualTo(id,4))], ReadSchema: struct
{noformat}


  was:
In/InSet missing this optimization:
{code:scala}
spark.range(50).selectExpr("cast(id as int) as 
id").write.mode("overwrite").parquet("/tmp/parquet/t1")
spark.read.parquet("/tmp/parquet/t1").where("id in (1L, 2L, 4L)").explain
spark.read.parquet("/tmp/parquet/t1").where("id = 1L or id = 2L or id = 
4L").explain
{code}


{noformat}
== Physical Plan ==
*(1) Filter cast(id#5 as bigint) IN (1,2,4)
+- *(1) ColumnarToRow
   +- FileScan parquet [id#5] Batched: true, DataFilters: [cast(id#5 as bigint) 
IN (1,2,4)], Format: Parquet, Location: InMemoryFileIndex(1 
paths)[file:/tmp/parquet/t1], PartitionFilters: [], PushedFilters: [], 
ReadSchema: struct


== Physical Plan ==
*(1) Filter (((id#7 = 1) OR (id#7 = 2)) OR (id#7 = 4))
+- *(1) ColumnarToRow
   +- FileScan parquet [id#7] Batched: true, DataFilters: [(((id#7 = 1) OR 
(id#7 = 2)) OR (id#7 = 4))], Format: Parquet, Location: InMemoryFileIndex(1 
paths)[file:/tmp/parquet/t1], PartitionFilters: [], PushedFilters: 
[Or(Or(EqualTo(id,1),EqualTo(id,2)),EqualTo(id,4))], ReadSchema: struct
{noformat}



> UnwrapCastInBinaryComparison support In/InSet predicate
> ---
>
> Key: SPARK-35316
> URL: https://issues.apache.org/jira/browse/SPARK-35316
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Yuming Wang
>Priority: Major
>
> It will not pushdown filters for In/InSet predicates:
> {code:scala}
> spark.range(50).selectExpr("cast(id as int) as 
> id").write.mode("overwrite").parquet("/tmp/parquet/t1")
> spark.read.parquet("/tmp/parquet/t1").where("id in (1L, 2L, 4L)").explain
> spark.read.parquet("/tmp/parquet/t1").where("id = 1L or id = 2L or id = 
> 4L").explain
> {code}
> {noformat}
> == Physical Plan ==
> *(1) Filter cast(id#5 as bigint) IN (1,2,4)
> +- *(1) ColumnarToRow
>+- FileScan parquet [id#5] Batched: true, DataFilters: [cast(id#5 as 
> bigint) IN (1,2,4)], Format: Parquet, Location: InMemoryFileIndex(1 
> paths)[file:/tmp/parquet/t1], PartitionFilters: [], PushedFilters: [], 
> ReadSchema: struct
> == Physical Plan ==
> *(1) Filter (((id#7 = 1) OR (id#7 = 2)) OR (id#7 = 4))
> +- *(1) ColumnarToRow
>+- FileScan parquet [id#7] Batched: true, DataFilters: [(((id#7 = 1) OR 
> (id#7 = 2)) OR (id#7 = 4))], Format: Parquet, Location: InMemoryFileIndex(1 
> paths)[file:/tmp/parquet/t1], PartitionFilters: [], PushedFilters: 
> [Or(Or(EqualTo(id,1),EqualTo(id,2)),EqualTo(id,4))], ReadSchema: 
> struct
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-35316) UnwrapCastInBinaryComparison support In/InSet predicate

2021-05-04 Thread Yuming Wang (Jira)
Yuming Wang created SPARK-35316:
---

 Summary: UnwrapCastInBinaryComparison support In/InSet predicate
 Key: SPARK-35316
 URL: https://issues.apache.org/jira/browse/SPARK-35316
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.2.0
Reporter: Yuming Wang


In/InSet missing this optimization:
{code:scala}
spark.range(50).selectExpr("cast(id as int) as 
id").write.mode("overwrite").parquet("/tmp/parquet/t1")
spark.read.parquet("/tmp/parquet/t1").where("id in (1L, 2L, 4L)").explain
spark.read.parquet("/tmp/parquet/t1").where("id = 1L or id = 2L or id = 
4L").explain
{code}


{noformat}
== Physical Plan ==
*(1) Filter cast(id#5 as bigint) IN (1,2,4)
+- *(1) ColumnarToRow
   +- FileScan parquet [id#5] Batched: true, DataFilters: [cast(id#5 as bigint) 
IN (1,2,4)], Format: Parquet, Location: InMemoryFileIndex(1 
paths)[file:/tmp/parquet/t1], PartitionFilters: [], PushedFilters: [], 
ReadSchema: struct


== Physical Plan ==
*(1) Filter (((id#7 = 1) OR (id#7 = 2)) OR (id#7 = 4))
+- *(1) ColumnarToRow
   +- FileScan parquet [id#7] Batched: true, DataFilters: [(((id#7 = 1) OR 
(id#7 = 2)) OR (id#7 = 4))], Format: Parquet, Location: InMemoryFileIndex(1 
paths)[file:/tmp/parquet/t1], PartitionFilters: [], PushedFilters: 
[Or(Or(EqualTo(id,1),EqualTo(id,2)),EqualTo(id,4))], ReadSchema: struct
{noformat}




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-35286) Replace SessionState.start with SessionState.setCurrentSessionState

2021-05-01 Thread Yuming Wang (Jira)
Yuming Wang created SPARK-35286:
---

 Summary: Replace SessionState.start with 
SessionState.setCurrentSessionState
 Key: SPARK-35286
 URL: https://issues.apache.org/jira/browse/SPARK-35286
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.2.0
Reporter: Yuming Wang


To avoid SessionState.createSessionDirs creating too many directories:

https://user-images.githubusercontent.com/5399861/116766834-28ea7080-aa5f-11eb-85ff-07bcaee444e5.png



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35245) DynamicFilter pushdown not working

2021-05-01 Thread Yuming Wang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17337807#comment-17337807
 ] 

Yuming Wang commented on SPARK-35245:
-

This is because filtering side do not has selective predicate : 
https://github.com/apache/spark/blob/19c7d2f3d8cda8d9bc5dfc1a0bf5d46845b1bc2f/sql/core/src/main/scala/org/apache/spark/sql/execution/dynamicpruning/PartitionPruning.scala#L193-L201

> DynamicFilter pushdown not working
> --
>
> Key: SPARK-35245
> URL: https://issues.apache.org/jira/browse/SPARK-35245
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.1.1
>Reporter: jean-claude
>Priority: Minor
>
>  
> The pushed filters is always empty. `PushedFilters: []`
> I was expecting the filters to be pushed down on the probe side of the join.
> Not sure how to properly configure this to work. For example how to set 
> fallbackFilterRatio ?
> spark = ( SparkSession.builder
>  .master('local')
>  .config("spark.sql.optimizer.dynamicPartitionPruning.fallbackFilterRatio", 
> 100)
>  .getOrCreate()
>  )
>  
> df = 
> spark.read.parquet('abfss://warehouse@/iceberg/opensource//data/timeperiod=2021-04-25/0-0-929b48ef-7ec3-47bd-b0a1-e9172c2dca6a-1.parquet')
> df.createOrReplaceTempView('TMP')
>  
> spark.sql('''
> explain cost
> select 
>    timeperiod, rrname
> from
>    TMP
> where
>    timeperiod in (
>  select
>    TO_DATE(d, 'MM-dd-') AS timeperiod
>  from
>  values
>    ('01-01-2021'),
>    ('01-01-2021'),
>    ('01-01-2021') tbl(d)
>  )
> group by
>    timeperiod, rrname
> ''').show(truncate=False)
> |== Optimized Logical Plan ==
> Aggregate [timeperiod#597, rrname#594], [timeperiod#597, rrname#594], 
> Statistics(sizeInBytes=69.0 MiB)
> +- Join LeftSemi, (timeperiod#597 = timeperiod#669), 
> Statistics(sizeInBytes=69.0 MiB)
>  :- Project [rrname#594, timeperiod#597], Statistics(sizeInBytes=69.0 MiB)
>  : +- 
> Relation[count#591,time_first#592L,time_last#593L,rrname#594,rrtype#595,rdata#596,timeperiod#597]
>  parquet, Statistics(sizeInBytes=198.5 MiB)
>  +- LocalRelation [timeperiod#669], Statistics(sizeInBytes=36.0 B)
> == Physical Plan ==
> *(2) HashAggregate(keys=[timeperiod#597, rrname#594], functions=[], 
> output=[timeperiod#597, rrname#594])
> +- Exchange hashpartitioning(timeperiod#597, rrname#594, 200), 
> ENSURE_REQUIREMENTS, [id=#839]
>  +- *(1) HashAggregate(keys=[timeperiod#597, rrname#594], functions=[], 
> output=[timeperiod#597, rrname#594])
>  +- *(1) BroadcastHashJoin [timeperiod#597], [timeperiod#669], LeftSemi, 
> BuildRight, false
>  :- *(1) ColumnarToRow
>  : +- FileScan parquet [rrname#594,timeperiod#597] Batched: true, 
> DataFilters: [], Format: Parquet, Location: 
> InMemoryFileIndex[abfss://warehouse@/iceberg/opensource/..., 
> PartitionFilters: [], PushedFilters: [], ReadSchema: 
> struct
>  +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, date, 
> true]),false), [id=#822]
>  +- LocalTableScan [timeperiod#669]
>  
> ```



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-35273) CombineFilters support non-deterministic expressions

2021-04-29 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35273?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-35273:

Description: 
For example:
{code:scala}
spark.sql("create table t1(id int) using parquet")
spark.sql("select * from (select * from t1 where id not in (1, 3, 6)) t where 
id = 7 and rand() <= 0.01").explain("cost")
{code}

Current:
{noformat}
== Optimized Logical Plan ==
Filter (isnotnull(id#0) AND ((id#0 = 7) AND (rand(-639771619343876662) <= 
0.01))), Statistics(sizeInBytes=1.0 B)
+- Filter NOT id#0 IN (1,3,6), Statistics(sizeInBytes=1.0 B)
   +- Relation default.t1[id#0] parquet, Statistics(sizeInBytes=0.0 B)
{noformat}

Expected:
{noformat}
== Optimized Logical Plan ==
Filter (isnotnull(id#0) AND (NOT id#0 IN (1,3,6) AND ((id#0 = 7) AND 
(rand(-1485510186481201685) <= 0.01, Statistics(sizeInBytes=1.0 B)
+- Relation default.t1[id#0] parquet, Statistics(sizeInBytes=0.0 B)
{noformat}


Another example:
{code:scala}
spark.sql("create table t1(id int) using parquet")
spark.sql("create view v1 as select * from t1 where id not in (1, 3, 6)")
spark.sql("select * from v1 where id = 7 and rand() <= 0.01").explain("cost")
{code}


  was:
For example:
{code:scala}
spark.sql("create table t1(id int) using parquet")
spark.sql("select * from (select * from t1 where id not in (1, 3, 6)) t where 
id = 7 and rand() <= 0.01").explain("cost")
{code}

Current:
{noformat}
== Optimized Logical Plan ==
Filter (isnotnull(id#0) AND ((id#0 = 7) AND (rand(-639771619343876662) <= 
0.01))), Statistics(sizeInBytes=1.0 B)
+- Filter NOT id#0 IN (1,3,6), Statistics(sizeInBytes=1.0 B)
   +- Relation default.t1[id#0] parquet, Statistics(sizeInBytes=0.0 B)
{noformat}

Expected:
{noformat}
== Optimized Logical Plan ==
Filter (isnotnull(id#0) AND (NOT id#0 IN (1,3,6) AND ((id#0 = 7) AND 
(rand(-1485510186481201685) <= 0.01, Statistics(sizeInBytes=1.0 B)
+- Relation default.t1[id#0] parquet, Statistics(sizeInBytes=0.0 B)
{noformat}


Another example:
{code:scala}
spark.sql("create table t1(id int) using parquet")
spark.sql("create table v1 as select * from t1 where id not in (1, 3, 6)")
spark.sql("select * from v1 where id = 7 and rand() <= 0.01").explain("cost")
{code}



> CombineFilters support non-deterministic expressions
> 
>
> Key: SPARK-35273
> URL: https://issues.apache.org/jira/browse/SPARK-35273
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Yuming Wang
>Priority: Major
>
> For example:
> {code:scala}
> spark.sql("create table t1(id int) using parquet")
> spark.sql("select * from (select * from t1 where id not in (1, 3, 6)) t where 
> id = 7 and rand() <= 0.01").explain("cost")
> {code}
> Current:
> {noformat}
> == Optimized Logical Plan ==
> Filter (isnotnull(id#0) AND ((id#0 = 7) AND (rand(-639771619343876662) <= 
> 0.01))), Statistics(sizeInBytes=1.0 B)
> +- Filter NOT id#0 IN (1,3,6), Statistics(sizeInBytes=1.0 B)
>+- Relation default.t1[id#0] parquet, Statistics(sizeInBytes=0.0 B)
> {noformat}
> Expected:
> {noformat}
> == Optimized Logical Plan ==
> Filter (isnotnull(id#0) AND (NOT id#0 IN (1,3,6) AND ((id#0 = 7) AND 
> (rand(-1485510186481201685) <= 0.01, Statistics(sizeInBytes=1.0 B)
> +- Relation default.t1[id#0] parquet, Statistics(sizeInBytes=0.0 B)
> {noformat}
> Another example:
> {code:scala}
> spark.sql("create table t1(id int) using parquet")
> spark.sql("create view v1 as select * from t1 where id not in (1, 3, 6)")
> spark.sql("select * from v1 where id = 7 and rand() <= 0.01").explain("cost")
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-35273) CombineFilters support non-deterministic expressions

2021-04-29 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35273?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-35273:

Description: 
For example:
{code:scala}
spark.sql("create table t1(id int) using parquet")
spark.sql("select * from (select * from t1 where id not in (1, 3, 6)) t where 
id = 7 and rand() <= 0.01").explain("cost")
{code}

Current:
{noformat}
== Optimized Logical Plan ==
Filter (isnotnull(id#0) AND ((id#0 = 7) AND (rand(-639771619343876662) <= 
0.01))), Statistics(sizeInBytes=1.0 B)
+- Filter NOT id#0 IN (1,3,6), Statistics(sizeInBytes=1.0 B)
   +- Relation default.t1[id#0] parquet, Statistics(sizeInBytes=0.0 B)
{noformat}

Expected:
{noformat}
== Optimized Logical Plan ==
Filter (isnotnull(id#0) AND (NOT id#0 IN (1,3,6) AND ((id#0 = 7) AND 
(rand(-1485510186481201685) <= 0.01, Statistics(sizeInBytes=1.0 B)
+- Relation default.t1[id#0] parquet, Statistics(sizeInBytes=0.0 B)
{noformat}


Another example:
{code:scala}
spark.sql("create table t1(id int) using parquet")
spark.sql("create table v1 as select * from t1 where id not in (1, 3, 6)")
spark.sql("select * from v1 where id = 7 and rand() <= 0.01").explain("cost")
{code}


  was:
For example:
{code:scala}
spark.sql("create table t1(id int) using parquet")
spark.sql("select * from (select * from t1 where id not in (1, 3, 6)) t where 
id = 7 and rand() <= 0.01").explain("cost")
{code}

Current:
{noformat}
== Optimized Logical Plan ==
Filter (isnotnull(id#0) AND ((id#0 = 7) AND (rand(-639771619343876662) <= 
0.01))), Statistics(sizeInBytes=1.0 B)
+- Filter NOT id#0 IN (1,3,6), Statistics(sizeInBytes=1.0 B)
   +- Relation default.t1[id#0] parquet, Statistics(sizeInBytes=0.0 B)
{noformat}

Expected:
{noformat}
== Optimized Logical Plan ==
Filter (isnotnull(id#0) AND (NOT id#0 IN (1,3,6) AND ((id#0 = 7) AND 
(rand(-1485510186481201685) <= 0.01, Statistics(sizeInBytes=1.0 B)
+- Relation default.t1[id#0] parquet, Statistics(sizeInBytes=0.0 B)
{noformat}


Another example:
{code:scala}
spark.sql("create table t1(id int) using parquet")
spark.sql("create table v1 as select * from t1 where id not in (1, 3, 6)")
spark.sql("select * from v1 where id = 7 and rand() <= 0.01").explain("cost")
{code{



> CombineFilters support non-deterministic expressions
> 
>
> Key: SPARK-35273
> URL: https://issues.apache.org/jira/browse/SPARK-35273
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Yuming Wang
>Priority: Major
>
> For example:
> {code:scala}
> spark.sql("create table t1(id int) using parquet")
> spark.sql("select * from (select * from t1 where id not in (1, 3, 6)) t where 
> id = 7 and rand() <= 0.01").explain("cost")
> {code}
> Current:
> {noformat}
> == Optimized Logical Plan ==
> Filter (isnotnull(id#0) AND ((id#0 = 7) AND (rand(-639771619343876662) <= 
> 0.01))), Statistics(sizeInBytes=1.0 B)
> +- Filter NOT id#0 IN (1,3,6), Statistics(sizeInBytes=1.0 B)
>+- Relation default.t1[id#0] parquet, Statistics(sizeInBytes=0.0 B)
> {noformat}
> Expected:
> {noformat}
> == Optimized Logical Plan ==
> Filter (isnotnull(id#0) AND (NOT id#0 IN (1,3,6) AND ((id#0 = 7) AND 
> (rand(-1485510186481201685) <= 0.01, Statistics(sizeInBytes=1.0 B)
> +- Relation default.t1[id#0] parquet, Statistics(sizeInBytes=0.0 B)
> {noformat}
> Another example:
> {code:scala}
> spark.sql("create table t1(id int) using parquet")
> spark.sql("create table v1 as select * from t1 where id not in (1, 3, 6)")
> spark.sql("select * from v1 where id = 7 and rand() <= 0.01").explain("cost")
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-35273) CombineFilters support non-deterministic expressions

2021-04-29 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35273?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-35273:

Description: 
For example:
{code:scala}
spark.sql("create table t1(id int) using parquet")
spark.sql("select * from (select * from t1 where id not in (1, 3, 6)) t where 
id = 7 and rand() <= 0.01").explain("cost")
{code}

Current:
{noformat}
== Optimized Logical Plan ==
Filter (isnotnull(id#0) AND ((id#0 = 7) AND (rand(-639771619343876662) <= 
0.01))), Statistics(sizeInBytes=1.0 B)
+- Filter NOT id#0 IN (1,3,6), Statistics(sizeInBytes=1.0 B)
   +- Relation default.t1[id#0] parquet, Statistics(sizeInBytes=0.0 B)
{noformat}

Expected:
{noformat}
== Optimized Logical Plan ==
Filter (isnotnull(id#0) AND (NOT id#0 IN (1,3,6) AND ((id#0 = 7) AND 
(rand(-1485510186481201685) <= 0.01, Statistics(sizeInBytes=1.0 B)
+- Relation default.t1[id#0] parquet, Statistics(sizeInBytes=0.0 B)
{noformat}


Another example:
{code:scala}
spark.sql("create table t1(id int) using parquet")
spark.sql("create table v1 as select * from t1 where id not in (1, 3, 6)")
spark.sql("select * from v1 where id = 7 and rand() <= 0.01").explain("cost")
{code{


  was:
For example:
{code:scala}
spark.sql("create table t1(id int) using parquet")
spark.sql("select * from (select * from t1 where id not in (1, 3, 6)) t where 
id = 7 and rand() <= 0.01").explain("cost")
{code}

Current:
{noformat}
== Optimized Logical Plan ==
Filter (isnotnull(id#0) AND ((id#0 = 7) AND (rand(-639771619343876662) <= 
0.01))), Statistics(sizeInBytes=1.0 B)
+- Filter NOT id#0 IN (1,3,6), Statistics(sizeInBytes=1.0 B)
   +- Relation default.t1[id#0] parquet, Statistics(sizeInBytes=0.0 B)
{noformat}

Expected:
{noformat}
== Optimized Logical Plan ==
Filter (isnotnull(id#0) AND (NOT id#0 IN (1,3,6) AND ((id#0 = 7) AND 
(rand(-1485510186481201685) <= 0.01, Statistics(sizeInBytes=1.0 B)
+- Relation default.t1[id#0] parquet, Statistics(sizeInBytes=0.0 B)
{noformat}





> CombineFilters support non-deterministic expressions
> 
>
> Key: SPARK-35273
> URL: https://issues.apache.org/jira/browse/SPARK-35273
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Yuming Wang
>Priority: Major
>
> For example:
> {code:scala}
> spark.sql("create table t1(id int) using parquet")
> spark.sql("select * from (select * from t1 where id not in (1, 3, 6)) t where 
> id = 7 and rand() <= 0.01").explain("cost")
> {code}
> Current:
> {noformat}
> == Optimized Logical Plan ==
> Filter (isnotnull(id#0) AND ((id#0 = 7) AND (rand(-639771619343876662) <= 
> 0.01))), Statistics(sizeInBytes=1.0 B)
> +- Filter NOT id#0 IN (1,3,6), Statistics(sizeInBytes=1.0 B)
>+- Relation default.t1[id#0] parquet, Statistics(sizeInBytes=0.0 B)
> {noformat}
> Expected:
> {noformat}
> == Optimized Logical Plan ==
> Filter (isnotnull(id#0) AND (NOT id#0 IN (1,3,6) AND ((id#0 = 7) AND 
> (rand(-1485510186481201685) <= 0.01, Statistics(sizeInBytes=1.0 B)
> +- Relation default.t1[id#0] parquet, Statistics(sizeInBytes=0.0 B)
> {noformat}
> Another example:
> {code:scala}
> spark.sql("create table t1(id int) using parquet")
> spark.sql("create table v1 as select * from t1 where id not in (1, 3, 6)")
> spark.sql("select * from v1 where id = 7 and rand() <= 0.01").explain("cost")
> {code{



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-35273) CombineFilters support non-deterministic expressions

2021-04-29 Thread Yuming Wang (Jira)
Yuming Wang created SPARK-35273:
---

 Summary: CombineFilters support non-deterministic expressions
 Key: SPARK-35273
 URL: https://issues.apache.org/jira/browse/SPARK-35273
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.2.0
Reporter: Yuming Wang


For example:
{code:scala}
spark.sql("create table t1(id int) using parquet")
spark.sql("select * from (select * from t1 where id not in (1, 3, 6)) t where 
id = 7 and rand() <= 0.01").explain("cost")
{code}

Current:
{noformat}
== Optimized Logical Plan ==
Filter (isnotnull(id#0) AND ((id#0 = 7) AND (rand(-639771619343876662) <= 
0.01))), Statistics(sizeInBytes=1.0 B)
+- Filter NOT id#0 IN (1,3,6), Statistics(sizeInBytes=1.0 B)
   +- Relation default.t1[id#0] parquet, Statistics(sizeInBytes=0.0 B)
{noformat}

Expected:
{noformat}
== Optimized Logical Plan ==
Filter (isnotnull(id#0) AND (NOT id#0 IN (1,3,6) AND ((id#0 = 7) AND 
(rand(-1485510186481201685) <= 0.01, Statistics(sizeInBytes=1.0 B)
+- Relation default.t1[id#0] parquet, Statistics(sizeInBytes=0.0 B)
{noformat}






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-35251) Improve LiveEntityHelpers.newAccumulatorInfos performace

2021-04-27 Thread Yuming Wang (Jira)
Yuming Wang created SPARK-35251:
---

 Summary: Improve LiveEntityHelpers.newAccumulatorInfos performace
 Key: SPARK-35251
 URL: https://issues.apache.org/jira/browse/SPARK-35251
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 3.2.0
Reporter: Yuming Wang


[It|https://github.com/apache/spark/blob/3854ad87c78f2a331f9c9c1a34f9ec281900f8fe/core/src/main/scala/org/apache/spark/status/LiveEntity.scala#L646-L661]
 will impact performance if  there are lot of {{AccumulableInfo}} instances.

{noformat}
 num #instances #bytes  class name
--
   1: 33189785050708371592  [C
   2:79535824591381896  [J
   3: 33083623410586759488  java.lang.String
   4: 139726297 8942483008  
org.apache.spark.sql.execution.metric.SQLMetric
   5: 249040156 5976963744  scala.Some
   6: 146278719 5851148760  org.apache.spark.util.AccumulatorMetadata
   7:  38448113 5536528272  java.net.URI
   8:  37540162 360382  org.apache.hadoop.fs.FileStatus
   9:  69724130 3346758240  java.util.Hashtable$Entry
  10:  61521559 2953034832  java.util.concurrent.ConcurrentHashMap$Node
  11:  50421974 2823630544  scala.collection.mutable.LinkedEntry
  12:  43349222 2774350208  org.apache.spark.scheduler.AccumulableInfo
{noformat}


{noformat}
--- 15430388364 ns (2.03%), 1543 samples
  [ 0] scala.collection.TraversableLike.noneIn$1
  [ 1] scala.collection.TraversableLike.filterImpl
  [ 2] scala.collection.TraversableLike.filterImpl$
  [ 3] scala.collection.AbstractTraversable.filterImpl
  [ 4] scala.collection.TraversableLike.filter
  [ 5] scala.collection.TraversableLike.filter$
  [ 6] scala.collection.AbstractTraversable.filter
  [ 7] org.apache.spark.status.LiveEntityHelpers$.newAccumulatorInfos
  [ 8] org.apache.spark.status.LiveTask.doUpdate
  [ 9] org.apache.spark.status.LiveEntity.write
{noformat}







--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34897) Support reconcile schemas based on index after nested column pruning

2021-04-24 Thread Yuming Wang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34897?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17331225#comment-17331225
 ] 

Yuming Wang commented on SPARK-34897:
-

Issue resolved by pull request 31993
https://github.com/apache/spark/pull/31993

> Support reconcile schemas based on index after nested column pruning
> 
>
> Key: SPARK-34897
> URL: https://issues.apache.org/jira/browse/SPARK-34897
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.2, 3.1.1, 3.2.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
> Fix For: 3.0.3, 3.1.2, 3.2.0
>
>
> How to reproduce this issue:
> {code:scala}
> spark.sql(
>   """
> |CREATE TABLE `t1` (
> |  `_col0` INT,
> |  `_col1` STRING,
> |  `_col2` STRUCT<`c1`: STRING, `c2`: STRING, `c3`: STRING, `c4`: BIGINT>,
> |  `_col3` STRING)
> |USING orc
> |PARTITIONED BY (_col3)
> |""".stripMargin)
> spark.sql("INSERT INTO `t1` values(1, '2', null, '2021-02-01')")
> spark.sql("SELECT _col2.c1, _col0 FROM `t1` WHERE _col3 = '2021-02-01'").show
> {code}
> Error message:
> {noformat}
> java.lang.AssertionError: assertion failed: The given data schema 
> struct<_col0:int,_col2:struct> has less fields than the actual ORC 
> physical schema, no idea which columns were dropped, fail to read. Try to 
> disable 
>   at scala.Predef$.assert(Predef.scala:223)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcUtils$.requestedColumnIds(OrcUtils.scala:159)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcFileFormat.$anonfun$buildReaderWithPartitionValues$3(OrcFileFormat.scala:180)
>   at org.apache.spark.util.Utils$.tryWithResource(Utils.scala:2620)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcFileFormat.$anonfun$buildReaderWithPartitionValues$1(OrcFileFormat.scala:178)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:117)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:165)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:94)
>   at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:756)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-34897) Support reconcile schemas based on index after nested column pruning

2021-04-24 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34897?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang resolved SPARK-34897.
-
Fix Version/s: 3.2.0
   3.1.2
   3.0.3
 Assignee: Yuming Wang
   Resolution: Fixed

> Support reconcile schemas based on index after nested column pruning
> 
>
> Key: SPARK-34897
> URL: https://issues.apache.org/jira/browse/SPARK-34897
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.2, 3.1.1, 3.2.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
> Fix For: 3.0.3, 3.1.2, 3.2.0
>
>
> How to reproduce this issue:
> {code:scala}
> spark.sql(
>   """
> |CREATE TABLE `t1` (
> |  `_col0` INT,
> |  `_col1` STRING,
> |  `_col2` STRUCT<`c1`: STRING, `c2`: STRING, `c3`: STRING, `c4`: BIGINT>,
> |  `_col3` STRING)
> |USING orc
> |PARTITIONED BY (_col3)
> |""".stripMargin)
> spark.sql("INSERT INTO `t1` values(1, '2', null, '2021-02-01')")
> spark.sql("SELECT _col2.c1, _col0 FROM `t1` WHERE _col3 = '2021-02-01'").show
> {code}
> Error message:
> {noformat}
> java.lang.AssertionError: assertion failed: The given data schema 
> struct<_col0:int,_col2:struct> has less fields than the actual ORC 
> physical schema, no idea which columns were dropped, fail to read. Try to 
> disable 
>   at scala.Predef$.assert(Predef.scala:223)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcUtils$.requestedColumnIds(OrcUtils.scala:159)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcFileFormat.$anonfun$buildReaderWithPartitionValues$3(OrcFileFormat.scala:180)
>   at org.apache.spark.util.Utils$.tryWithResource(Utils.scala:2620)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcFileFormat.$anonfun$buildReaderWithPartitionValues$1(OrcFileFormat.scala:178)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:117)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:165)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:94)
>   at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:756)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-35203) Improve Repartition statistics estimation

2021-04-23 Thread Yuming Wang (Jira)
Yuming Wang created SPARK-35203:
---

 Summary: Improve Repartition statistics estimation
 Key: SPARK-35203
 URL: https://issues.apache.org/jira/browse/SPARK-35203
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.2.0
Reporter: Yuming Wang






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35191) all columns are read even if column pruning applies when spark3.0 read table written by spark2.2

2021-04-22 Thread Yuming Wang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17329932#comment-17329932
 ] 

Yuming Wang commented on SPARK-35191:
-

Could you check if it works after https://github.com/apache/spark/pull/31993?

> all columns are read even if column pruning applies when spark3.0 read table 
> written by spark2.2
> 
>
> Key: SPARK-35191
> URL: https://issues.apache.org/jira/browse/SPARK-35191
> Project: Spark
>  Issue Type: Question
>  Components: Spark Core
>Affects Versions: 3.0.0
> Environment: spark3.0
> spark.sql.hive.convertMetastoreOrc=true(default value in spark3.0)
> spark.sql.orc.impl=native(default value in spark3.0)
>Reporter: xiaoli
>Priority: Major
>
> Before I address this issue, let me talk about the issue background: The 
> current spark version we use is 2.2, and we plan to migrate to spark3.0 in 
> near future. Before migration, we test some query in both spark2.2 and 
> spark3.0 to check potential issue. The data source table of these query is 
> orc format written by spark2.2.
>  
> I find that even if column pruning is applied, spark3.0’s native reader will 
> read all columns.
>  
> Then I do remote debug. In OrcUtils.scala’s requestedColumnIds Method, it 
> will check whether field name is started with “_col”. In my case, field name 
> is started with “_col”, like “_col1”, “_col2”. So pruneCols is not done.  The 
> code is below:
>  
> if (orcFieldNames.forall(_.startsWith("_col"))) {
>   // This is a ORC file written by Hive, no field names in the physical 
> schema, assume the
>   // physical schema maps to the data scheme by index.
>   _assert_(orcFieldNames.length <= dataSchema.length, "The given data schema 
> " +
>     s"*$*{dataSchema.catalogString} has less fields than the actual ORC 
> physical schema, " +
>     "no idea which columns were dropped, fail to read.")
>   // for ORC file written by Hive, no field names
>   // in the physical schema, there is a need to send the
>   // entire dataSchema instead of required schema.
>   // So pruneCols is not done in this case
>   Some(requiredSchema.fieldNames.map { name =>
>     val index = dataSchema.fieldIndex(name)
>     if (index < orcFieldNames.length) {
>       index
>     } else {
>       -1
>     }
>   }, false)
>  
>  Although this code comment explains reason, I still do not understand. This 
> issue only happens in this case: spark3.0 uses native reader to read table 
> written by spark2.2. 
>  
> In other cases, there is no such issue. I do another 2 tests:
> Test1: use spark3.0’s hive reader (running with 
> spark.sql.hive.convertMetastoreOrc=false and spark.sql.orc.impl=hive) to read 
> the same table, it only reads pruned columns.
> Test2: use spark3.0 to write a table, then use spark3.0’s native reader to 
> read this new table, it only reads pruned columns.
>  
> This issue I mentioned is a block we use native reader in spark3.0. Can 
> anyone know further reason or provide solutions?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-35185) Improve Distinct statistics estimation

2021-04-22 Thread Yuming Wang (Jira)
Yuming Wang created SPARK-35185:
---

 Summary: Improve Distinct statistics estimation
 Key: SPARK-35185
 URL: https://issues.apache.org/jira/browse/SPARK-35185
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.2.0
Reporter: Yuming Wang






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-35121) Improve JoinSelection when join condition is not defined

2021-04-17 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-35121:

Summary: Improve JoinSelection when join condition is not defined  (was: 
Improve JoinSelection when join condition is empty)

> Improve JoinSelection when join condition is not defined
> 
>
> Key: SPARK-35121
> URL: https://issues.apache.org/jira/browse/SPARK-35121
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Yuming Wang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-35121) Improve JoinSelection when join condition is empty

2021-04-17 Thread Yuming Wang (Jira)
Yuming Wang created SPARK-35121:
---

 Summary: Improve JoinSelection when join condition is empty
 Key: SPARK-35121
 URL: https://issues.apache.org/jira/browse/SPARK-35121
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.2.0
Reporter: Yuming Wang






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-35118) Propagate empty relation through Join if join condition is empty

2021-04-17 Thread Yuming Wang (Jira)
Yuming Wang created SPARK-35118:
---

 Summary: Propagate empty relation through Join if join condition 
is empty
 Key: SPARK-35118
 URL: https://issues.apache.org/jira/browse/SPARK-35118
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.2.0
Reporter: Yuming Wang


{code:scala}
spark.sql("create table t1 using parquet as select id as a, id as b from 
range(10)")
spark.sql("create table t2 using parquet as select id as a, id as b from 
range(20)")
spark.sql("select t1.a from t1 left semi join t2").explain(true)
{code}

Current plan:
{noformat}
== Optimized Logical Plan ==
Join LeftSemi
:- Project [a#6L]
:  +- Relation default.t1[a#6L,b#7L] parquet
+- Project
   +- Relation default.t2[a#8L,b#9L] parquet
{noformat}


Excepted plan:
{noformat}
== Optimized Logical Plan ==
Project [a#6L]
+- Relation default.t1[a#6L,b#7L] parquet
{noformat}




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34087) a memory leak occurs when we clone the spark session

2021-04-11 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34087?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-34087:

Attachment: screenshot-1.png

> a memory leak occurs when we clone the spark session
> 
>
> Key: SPARK-34087
> URL: https://issues.apache.org/jira/browse/SPARK-34087
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.1
>Reporter: Fu Chen
>Assignee: wuyi
>Priority: Major
> Fix For: 3.2.0
>
> Attachments: 1610451044690.jpg
>
>
> In Spark-3.0.1, the memory leak occurs when we keep cloning the spark session 
> because a new ExecutionListenerBus instance will add to AsyncEventQueue when 
> we clone a new session.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34087) a memory leak occurs when we clone the spark session

2021-04-11 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34087?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-34087:

Attachment: (was: screenshot-1.png)

> a memory leak occurs when we clone the spark session
> 
>
> Key: SPARK-34087
> URL: https://issues.apache.org/jira/browse/SPARK-34087
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.1
>Reporter: Fu Chen
>Assignee: wuyi
>Priority: Major
> Fix For: 3.2.0
>
> Attachments: 1610451044690.jpg
>
>
> In Spark-3.0.1, the memory leak occurs when we keep cloning the spark session 
> because a new ExecutionListenerBus instance will add to AsyncEventQueue when 
> we clone a new session.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34897) Support reconcile schemas based on index after nested column pruning

2021-04-10 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34897?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-34897:

Summary: Support reconcile schemas based on index after nested column 
pruning  (was: The given data schema has less fields than the actual ORC 
physical schema)

> Support reconcile schemas based on index after nested column pruning
> 
>
> Key: SPARK-34897
> URL: https://issues.apache.org/jira/browse/SPARK-34897
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.2, 3.2.0, 3.1.1
>Reporter: Yuming Wang
>Priority: Major
>
> How to reproduce this issue:
> {code:scala}
> spark.sql(
>   """
> |CREATE TABLE `t1` (
> |  `_col0` INT,
> |  `_col1` STRING,
> |  `_col2` STRUCT<`c1`: STRING, `c2`: STRING, `c3`: STRING, `c4`: BIGINT>,
> |  `_col3` STRING)
> |USING orc
> |PARTITIONED BY (_col3)
> |""".stripMargin)
> spark.sql("INSERT INTO `t1` values(1, '2', null, '2021-02-01')")
> spark.sql("SELECT _col2.c1, _col0 FROM `t1` WHERE _col3 = '2021-02-01'").show
> {code}
> Error message:
> {noformat}
> java.lang.AssertionError: assertion failed: The given data schema 
> struct<_col0:int,_col2:struct> has less fields than the actual ORC 
> physical schema, no idea which columns were dropped, fail to read. Try to 
> disable 
>   at scala.Predef$.assert(Predef.scala:223)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcUtils$.requestedColumnIds(OrcUtils.scala:159)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcFileFormat.$anonfun$buildReaderWithPartitionValues$3(OrcFileFormat.scala:180)
>   at org.apache.spark.util.Utils$.tryWithResource(Utils.scala:2620)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcFileFormat.$anonfun$buildReaderWithPartitionValues$1(OrcFileFormat.scala:178)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:117)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:165)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:94)
>   at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:756)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-34897) Support reconcile schemas based on index after nested column pruning

2021-04-10 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34897?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-34897:

Comment: was deleted

(was: We can workaround this issue by setting 
spark.sql.optimizer.nestedSchemaPruning.enabled to false.)

> Support reconcile schemas based on index after nested column pruning
> 
>
> Key: SPARK-34897
> URL: https://issues.apache.org/jira/browse/SPARK-34897
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.2, 3.2.0, 3.1.1
>Reporter: Yuming Wang
>Priority: Major
>
> How to reproduce this issue:
> {code:scala}
> spark.sql(
>   """
> |CREATE TABLE `t1` (
> |  `_col0` INT,
> |  `_col1` STRING,
> |  `_col2` STRUCT<`c1`: STRING, `c2`: STRING, `c3`: STRING, `c4`: BIGINT>,
> |  `_col3` STRING)
> |USING orc
> |PARTITIONED BY (_col3)
> |""".stripMargin)
> spark.sql("INSERT INTO `t1` values(1, '2', null, '2021-02-01')")
> spark.sql("SELECT _col2.c1, _col0 FROM `t1` WHERE _col3 = '2021-02-01'").show
> {code}
> Error message:
> {noformat}
> java.lang.AssertionError: assertion failed: The given data schema 
> struct<_col0:int,_col2:struct> has less fields than the actual ORC 
> physical schema, no idea which columns were dropped, fail to read. Try to 
> disable 
>   at scala.Predef$.assert(Predef.scala:223)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcUtils$.requestedColumnIds(OrcUtils.scala:159)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcFileFormat.$anonfun$buildReaderWithPartitionValues$3(OrcFileFormat.scala:180)
>   at org.apache.spark.util.Utils$.tryWithResource(Utils.scala:2620)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcFileFormat.$anonfun$buildReaderWithPartitionValues$1(OrcFileFormat.scala:178)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:117)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:165)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:94)
>   at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:756)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-35007) Spark 2.4.x version does not support numeric

2021-04-10 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35007?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang resolved SPARK-35007.
-
Resolution: Won't Fix

> Spark 2.4.x version does not support numeric
> 
>
> Key: SPARK-35007
> URL: https://issues.apache.org/jira/browse/SPARK-35007
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.7
>Reporter: Sun BiaoBiao
>Priority: Major
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> as said in this pr:  [https://github.com/apache/spark/pull/31891] , 
>  
> So seems the key difference between DEICMAL and NUMERIC are:
> {quote}implementation-defined decimal precision equal to or greater than the 
> value of the specified precision.
> {quote}
> Decimal can have _at least_ specified precision (which means can have more) 
> whereas numeric should have exactly specified precision.
> Spark's decimal satisfy both so I think NUMERIC as a synonym of DECIMAL makes 
> sense.
>  
> my  code  needs to use NUMERIC type in spark2.4 version 
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-35007) Spark 2.4.x version does not support numeric

2021-04-10 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35007?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-35007:

Target Version/s:   (was: 2.4.8)

> Spark 2.4.x version does not support numeric
> 
>
> Key: SPARK-35007
> URL: https://issues.apache.org/jira/browse/SPARK-35007
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.7
>Reporter: Sun BiaoBiao
>Priority: Major
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> as said in this pr:  [https://github.com/apache/spark/pull/31891] , 
>  
> So seems the key difference between DEICMAL and NUMERIC are:
> {quote}implementation-defined decimal precision equal to or greater than the 
> value of the specified precision.
> {quote}
> Decimal can have _at least_ specified precision (which means can have more) 
> whereas numeric should have exactly specified precision.
> Spark's decimal satisfy both so I think NUMERIC as a synonym of DECIMAL makes 
> sense.
>  
> my  code  needs to use NUMERIC type in spark2.4 version 
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35010) nestedSchemaPruning causes issue when reading hive generated Orc files

2021-04-09 Thread Yuming Wang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17318347#comment-17318347
 ] 

Yuming Wang commented on SPARK-35010:
-

Yes. It is an issue:
https://github.com/apache/spark/pull/31993

> nestedSchemaPruning causes issue when reading hive generated Orc files
> --
>
> Key: SPARK-35010
> URL: https://issues.apache.org/jira/browse/SPARK-35010
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Baohe Zhang
>Priority: Critical
>
> In spark3, we have spark.sql.orc.imple=native and 
> spark.sql.optimizer.nestedSchemaPruning.enabled=true as the default settings. 
> And these would cause issues when query struct field of hive-generated orc 
> files.
> For example, we got an error when running this query in spark3
> {code:java}
> spark.table("testtable").filter(col("utc_date") === 
> "20210122").select(col("open_count.d35")).show(false)
> {code}
> The error is
> {code:java}
> Caused by: java.lang.AssertionError: assertion failed: The given data schema 
> struct>> has less fields than the 
> actual ORC physical schema, no idea which columns were dropped, fail to read.
>   at scala.Predef$.assert(Predef.scala:223)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcUtils$.requestedColumnIds(OrcUtils.scala:153)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcFileFormat.$anonfun$buildReaderWithPartitionValues$3(OrcFileFormat.scala:180)
>   at org.apache.spark.util.Utils$.tryWithResource(Utils.scala:2539)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcFileFormat.$anonfun$buildReaderWithPartitionValues$1(OrcFileFormat.scala:178)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:116)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:169)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93)
>   at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:729)
>   at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:340)
>   at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:872)
>   at 
> org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:872)
>   at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:313)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
>   at org.apache.spark.scheduler.Task.run(Task.scala:127)
>   at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:446)
>   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:449)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> {code}
>  
> I think the reason is that we apply the nestedSchemaPruning to the 
> dataSchema. 
> [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/SchemaPruning.scala#L75]
> This nestedSchemaPruning not only prunes the unused fields of the struct, it 
> also prunes the unused columns. In my test, the dataSchema originally has 48 
> columns, but after nested schema pruning, the dataSchema is pruned to 1 
> column. This pruning result in an assertion error in 
> [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcUtils.scala#L159]
>  because column pruning in hive generated orc files is not supported.
> This issue seems also related to the hive version, we use hive 1.2, and it 
> doesn't contain field names in the physical schema.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-35002) Fix the java.net.BindException when testing with Github Action

2021-04-08 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35002?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-35002:

Summary: Fix the java.net.BindException when testing with Github Action  
(was: Try to fix the java.net.BindException when testing with Github Action)

> Fix the java.net.BindException when testing with Github Action
> --
>
> Key: SPARK-35002
> URL: https://issues.apache.org/jira/browse/SPARK-35002
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Affects Versions: 3.2.0
>Reporter: Yuming Wang
>Priority: Major
>
> {noformat}
> [info] org.apache.spark.sql.kafka010.producer.InternalKafkaProducerPoolSuite 
> *** ABORTED *** (282 milliseconds)
> [info]   java.net.BindException: Cannot assign requested address: Service 
> 'sparkDriver' failed after 100 retries (on a random free port)! Consider 
> explicitly setting the appropriate binding address for the service 
> 'sparkDriver' (for example spark.driver.bindAddress for SparkDriver) to the 
> correct binding address.
> [info]   at sun.nio.ch.Net.bind0(Native Method)
> [info]   at sun.nio.ch.Net.bind(Net.java:461)
> [info]   at sun.nio.ch.Net.bind(Net.java:453)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-35002) Try to fix the java.net.BindException when testing with Github Action

2021-04-08 Thread Yuming Wang (Jira)
Yuming Wang created SPARK-35002:
---

 Summary: Try to fix the java.net.BindException when testing with 
Github Action
 Key: SPARK-35002
 URL: https://issues.apache.org/jira/browse/SPARK-35002
 Project: Spark
  Issue Type: Improvement
  Components: Project Infra
Affects Versions: 3.2.0
Reporter: Yuming Wang



{noformat}
[info] org.apache.spark.sql.kafka010.producer.InternalKafkaProducerPoolSuite 
*** ABORTED *** (282 milliseconds)
[info]   java.net.BindException: Cannot assign requested address: Service 
'sparkDriver' failed after 100 retries (on a random free port)! Consider 
explicitly setting the appropriate binding address for the service 
'sparkDriver' (for example spark.driver.bindAddress for SparkDriver) to the 
correct binding address.
[info]   at sun.nio.ch.Net.bind0(Native Method)
[info]   at sun.nio.ch.Net.bind(Net.java:461)
[info]   at sun.nio.ch.Net.bind(Net.java:453)
{noformat}




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-34966) Avoid shuffle if join type do not match

2021-04-06 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang resolved SPARK-34966.
-
Resolution: Invalid

https://github.com/apache/spark/blob/69aa727ff495f6698fe9b37e952dfaf36f1dd5eb/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/hash.scala#L504-L507

It is same:
import org.apache.spark.sql.catalyst.expressions.{Cast, EmptyRow, Literal, 
Murmur3Hash, Pmod}
println(Pmod(new Murmur3Hash(Seq(Literal(100.toShort))), 
Literal(10)).eval(EmptyRow))
println(Pmod(new Murmur3Hash(Seq(new Cast(Literal(100.toShort), IntegerType))), 
Literal(10)).eval(EmptyRow))

But it is not same:
println(Pmod(new Murmur3Hash(Seq(Literal(100))), Literal(10)).eval(EmptyRow))
println(Pmod(new Murmur3Hash(Seq(new Cast(Literal(100), LongType))), 
Literal(10)).eval(EmptyRow))

> Avoid shuffle if join type do not match
> ---
>
> Key: SPARK-34966
> URL: https://issues.apache.org/jira/browse/SPARK-34966
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Yuming Wang
>Priority: Major
>
> How to reproduce this issue:
> {code:scala}
> spark.sql("set spark.sql.autoBroadcastJoinThreshold=-1")
> spark.sql("CREATE TABLE t1 using parquet clustered by (id) into 200 
> buckets AS SELECT cast(id as bigint) FROM range(1000)")
> spark.sql("CREATE TABLE t2 using parquet clustered by (id) into 200 
> buckets AS SELECT cast(id as int) FROM range(500)")
> spark.sql("select * from t1 join t2 on (t1.id = t2.id)").explain
> {code}
> Current:
> {noformat}
> == Physical Plan ==
> AdaptiveSparkPlan isFinalPlan=false
> +- SortMergeJoin [id#14L], [cast(id#15 as bigint)], Inner
>:- Sort [id#14L ASC NULLS FIRST], false, 0
>:  +- Filter isnotnull(id#14L)
>: +- FileScan parquet default.t1[id#14L] Batched: true, DataFilters: 
> [isnotnull(id#14L)], Format: Parquet, Location: InMemoryFileIndex(1 
> paths)[file:/Users/yumwang/opensource/spark/spark-warehouse/org.apache.spark,
>  PartitionFilters: [], PushedFilters: [IsNotNull(id)], ReadSchema: 
> struct, SelectedBucketsCount: 200 out of 200
>+- Sort [cast(id#15 as bigint) ASC NULLS FIRST], false, 0
>   +- Exchange hashpartitioning(cast(id#15 as bigint), 200), 
> ENSURE_REQUIREMENTS, [id=#58]
>  +- Filter isnotnull(id#15)
> +- FileScan parquet default.t2[id#15] Batched: true, DataFilters: 
> [isnotnull(id#15)], Format: Parquet, Location: InMemoryFileIndex(1 
> paths)[file:/Users/yumwang/opensource/spark/spark-warehouse/org.apache.spark,
>  PartitionFilters: [], PushedFilters: [IsNotNull(id)], ReadSchema: 
> struct
> {noformat}
> Expected:
> {noformat}
> == Physical Plan ==
> AdaptiveSparkPlan isFinalPlan=false
> +- SortMergeJoin [id#14L], [cast(id#15 as bigint)], Inner
>:- Sort [id#14L ASC NULLS FIRST], false, 0
>:  +- Filter isnotnull(id#14L)
>: +- FileScan parquet default.t1[id#14L] Batched: true, DataFilters: 
> [isnotnull(id#14L)], Format: Parquet, Location: InMemoryFileIndex(1 
> paths)[file:/Users/yumwang/opensource/spark/spark-warehouse/org.apache.spark,
>  PartitionFilters: [], PushedFilters: [IsNotNull(id)], ReadSchema: 
> struct, SelectedBucketsCount: 200 out of 200
>+- Sort [cast(id#15 as bigint) ASC NULLS FIRST], false, 0
>   +- Filter isnotnull(id#15)
>  +- FileScan parquet default.t2[id#15] Batched: true, DataFilters: 
> [isnotnull(id#15)], Format: Parquet, Location: InMemoryFileIndex(1 
> paths)[file:/Users/yumwang/opensource/spark/spark-warehouse/org.apache.spark,
>  PartitionFilters: [], PushedFilters: [IsNotNull(id)], ReadSchema: 
> struct, SelectedBucketsCount: 200 out of 200
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34967) Regression in spark 3.1.1 for window function and struct binding resolution

2021-04-06 Thread Yuming Wang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17315564#comment-17315564
 ] 

Yuming Wang commented on SPARK-34967:
-

How to reproduce this issue?

> Regression in spark 3.1.1 for window function and struct binding resolution
> ---
>
> Key: SPARK-34967
> URL: https://issues.apache.org/jira/browse/SPARK-34967
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.1
>Reporter: Lukasz Z
>Priority: Major
>
> https://github.com/zulk666/SparkTestProject/blob/main/src/main/scala/FailStructWithWindow.scala
> When I execute this example code in spark 3.0.2 or 2.4.7 receive normal 
> result but in spark 3.1.1 I've got exception. 
> This is struct selection after window function aggregation.
> Exception in thread "main" 
> org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding 
> attribute, tree: _gen_alias_32#32Exception in thread "main" 
> org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding 
> attribute, tree: _gen_alias_32#32 at 
> org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56) at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:75)
>  at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:74)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$1(TreeNode.scala:317)
>  at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:73)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:317)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$3(TreeNode.scala:322)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$mapChildren$1(TreeNode.scala:407)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:243)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:405) 
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:358) 
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:322)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:306) at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReference(BoundAttribute.scala:74)
>  at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$.$anonfun$bindReferences$1(BoundAttribute.scala:96)
>  at 
> scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286) at 
> scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) at 
> scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) at 
> scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) at 
> scala.collection.TraversableLike.map(TraversableLike.scala:286) at 
> scala.collection.TraversableLike.map$(TraversableLike.scala:279) at 
> scala.collection.AbstractTraversable.map(Traversable.scala:108) at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReferences(BoundAttribute.scala:96)
>  at 
> org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection.(InterpretedMutableProjection.scala:35)
>  at 
> org.apache.spark.sql.catalyst.optimizer.ConvertToLocalRelation$$anonfun$apply$19.applyOrElse(Optimizer.scala:1589)
>  at 
> org.apache.spark.sql.catalyst.optimizer.ConvertToLocalRelation$$anonfun$apply$19.applyOrElse(Optimizer.scala:1586)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$1(TreeNode.scala:317)
>  at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:73)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:317)
>  at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDown(LogicalPlan.scala:29)
>  at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDown(AnalysisHelper.scala:171)
>  at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDown$(AnalysisHelper.scala:169)
>  at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDown(LogicalPlan.scala:29)
>  at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDown(LogicalPlan.scala:29)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$3(TreeNode.scala:322)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$mapChildren$1(TreeNode.scala:407)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:243)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:405) 
> at 
> 

[jira] [Created] (SPARK-34966) Avoid shuffle if join type do not match

2021-04-06 Thread Yuming Wang (Jira)
Yuming Wang created SPARK-34966:
---

 Summary: Avoid shuffle if join type do not match
 Key: SPARK-34966
 URL: https://issues.apache.org/jira/browse/SPARK-34966
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.2.0
Reporter: Yuming Wang


How to reproduce this issue:
{code:scala}
spark.sql("set spark.sql.autoBroadcastJoinThreshold=-1")
spark.sql("CREATE TABLE t1 using parquet clustered by (id) into 200 buckets 
AS SELECT cast(id as bigint) FROM range(1000)")
spark.sql("CREATE TABLE t2 using parquet clustered by (id) into 200 buckets 
AS SELECT cast(id as int) FROM range(500)")
spark.sql("select * from t1 join t2 on (t1.id = t2.id)").explain
{code}

Current:
{noformat}
== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- SortMergeJoin [id#14L], [cast(id#15 as bigint)], Inner
   :- Sort [id#14L ASC NULLS FIRST], false, 0
   :  +- Filter isnotnull(id#14L)
   : +- FileScan parquet default.t1[id#14L] Batched: true, DataFilters: 
[isnotnull(id#14L)], Format: Parquet, Location: InMemoryFileIndex(1 
paths)[file:/Users/yumwang/opensource/spark/spark-warehouse/org.apache.spark,
 PartitionFilters: [], PushedFilters: [IsNotNull(id)], ReadSchema: 
struct, SelectedBucketsCount: 200 out of 200
   +- Sort [cast(id#15 as bigint) ASC NULLS FIRST], false, 0
  +- Exchange hashpartitioning(cast(id#15 as bigint), 200), 
ENSURE_REQUIREMENTS, [id=#58]
 +- Filter isnotnull(id#15)
+- FileScan parquet default.t2[id#15] Batched: true, DataFilters: 
[isnotnull(id#15)], Format: Parquet, Location: InMemoryFileIndex(1 
paths)[file:/Users/yumwang/opensource/spark/spark-warehouse/org.apache.spark,
 PartitionFilters: [], PushedFilters: [IsNotNull(id)], ReadSchema: 
struct
{noformat}

Expected:
{noformat}
== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- SortMergeJoin [id#14L], [cast(id#15 as bigint)], Inner
   :- Sort [id#14L ASC NULLS FIRST], false, 0
   :  +- Filter isnotnull(id#14L)
   : +- FileScan parquet default.t1[id#14L] Batched: true, DataFilters: 
[isnotnull(id#14L)], Format: Parquet, Location: InMemoryFileIndex(1 
paths)[file:/Users/yumwang/opensource/spark/spark-warehouse/org.apache.spark,
 PartitionFilters: [], PushedFilters: [IsNotNull(id)], ReadSchema: 
struct, SelectedBucketsCount: 200 out of 200
   +- Sort [cast(id#15 as bigint) ASC NULLS FIRST], false, 0
  +- Filter isnotnull(id#15)
 +- FileScan parquet default.t2[id#15] Batched: true, DataFilters: 
[isnotnull(id#15)], Format: Parquet, Location: InMemoryFileIndex(1 
paths)[file:/Users/yumwang/opensource/spark/spark-warehouse/org.apache.spark,
 PartitionFilters: [], PushedFilters: [IsNotNull(id)], ReadSchema: 
struct, SelectedBucketsCount: 200 out of 200
{noformat}





--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33979) Filter predicate reorder

2021-04-02 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-33979:

Description: 
Reorder filter predicate to improve query performance:
{noformat}
others < In < Like < UDF/CaseWhen/If < Inset < LikeAny/LikeAll
{noformat}
[https://www.ibm.com/support/knowledgecenter/SSSHTQ_8.1.0/com.ibm.netcool_OMNIbus.doc_8.1.0/omnibus/wip/admin/reference/omn_adm_per_optimizationrules.html#omn_adm_per_optimizationrules__reorder]
 
[https://docs.oracle.com/en/database/oracle/oracle-database/21/addci/extensible-optimizer-interface.html#GUID-28A4EDA6-19DD-4773-B3B8-1802C3B01E21]
 [https://docs.oracle.com/cd/B10501_01/server.920/a96533/hintsref.htm#13676]

https://issues.apache.org/jira/browse/HIVE-21857
https://issues.apache.org/jira/browse/IMPALA-2805

 

  was:
Reorder filter predicate to improve query performance:
{noformat}
others < In < Like < UDF/CaseWhen/If < Inset < LikeAny/LikeAll
{noformat}
[https://www.ibm.com/support/knowledgecenter/SSSHTQ_8.1.0/com.ibm.netcool_OMNIbus.doc_8.1.0/omnibus/wip/admin/reference/omn_adm_per_optimizationrules.html#omn_adm_per_optimizationrules__reorder]
 
[https://docs.oracle.com/en/database/oracle/oracle-database/21/addci/extensible-optimizer-interface.html#GUID-28A4EDA6-19DD-4773-B3B8-1802C3B01E21]
 [https://docs.oracle.com/cd/B10501_01/server.920/a96533/hintsref.htm#13676]

https://issues.apache.org/jira/browse/HIVE-21857

 


> Filter predicate reorder
> 
>
> Key: SPARK-33979
> URL: https://issues.apache.org/jira/browse/SPARK-33979
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Yuming Wang
>Priority: Major
>
> Reorder filter predicate to improve query performance:
> {noformat}
> others < In < Like < UDF/CaseWhen/If < Inset < LikeAny/LikeAll
> {noformat}
> [https://www.ibm.com/support/knowledgecenter/SSSHTQ_8.1.0/com.ibm.netcool_OMNIbus.doc_8.1.0/omnibus/wip/admin/reference/omn_adm_per_optimizationrules.html#omn_adm_per_optimizationrules__reorder]
>  
> [https://docs.oracle.com/en/database/oracle/oracle-database/21/addci/extensible-optimizer-interface.html#GUID-28A4EDA6-19DD-4773-B3B8-1802C3B01E21]
>  [https://docs.oracle.com/cd/B10501_01/server.920/a96533/hintsref.htm#13676]
> https://issues.apache.org/jira/browse/HIVE-21857
> https://issues.apache.org/jira/browse/IMPALA-2805
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-34931) CoarseGrainedExecutorBackend send wrong 'Reason' when executor exits which leading to job failed.

2021-04-01 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-34931:

Comment: was deleted

(was: User 'sarutak' has created a pull request for this issue:
https://github.com/apache/spark/pull/32028)

> CoarseGrainedExecutorBackend send wrong 'Reason' when executor exits which 
> leading to job failed.
> -
>
> Key: SPARK-34931
> URL: https://issues.apache.org/jira/browse/SPARK-34931
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: ice bai
>Priority: Major
>
> when executor lost for some reason(e.g.  Unable to register with external 
> shuffle server),CoarseGrainedExecutorBackend will send a RemoveExecutor event 
> with 'ExecutorLossReason'.   But this will cause TaskSetManager handle 
> handleFailedTask function with exitCausedByApp=true.   This is not correct



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-34931) CoarseGrainedExecutorBackend send wrong 'Reason' when executor exits which leading to job failed.

2021-04-01 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-34931:

Comment: was deleted

(was: User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/32021)

> CoarseGrainedExecutorBackend send wrong 'Reason' when executor exits which 
> leading to job failed.
> -
>
> Key: SPARK-34931
> URL: https://issues.apache.org/jira/browse/SPARK-34931
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: ice bai
>Priority: Major
>
> when executor lost for some reason(e.g.  Unable to register with external 
> shuffle server),CoarseGrainedExecutorBackend will send a RemoveExecutor event 
> with 'ExecutorLossReason'.   But this will cause TaskSetManager handle 
> handleFailedTask function with exitCausedByApp=true.   This is not correct



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-34931) CoarseGrainedExecutorBackend send wrong 'Reason' when executor exits which leading to job failed.

2021-04-01 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-34931:

Comment: was deleted

(was: User 'sarutak' has created a pull request for this issue:
https://github.com/apache/spark/pull/32028)

> CoarseGrainedExecutorBackend send wrong 'Reason' when executor exits which 
> leading to job failed.
> -
>
> Key: SPARK-34931
> URL: https://issues.apache.org/jira/browse/SPARK-34931
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: ice bai
>Priority: Major
>
> when executor lost for some reason(e.g.  Unable to register with external 
> shuffle server),CoarseGrainedExecutorBackend will send a RemoveExecutor event 
> with 'ExecutorLossReason'.   But this will cause TaskSetManager handle 
> handleFailedTask function with exitCausedByApp=true.   This is not correct



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-34931) CoarseGrainedExecutorBackend send wrong 'Reason' when executor exits which leading to job failed.

2021-04-01 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-34931:

Comment: was deleted

(was: User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/32021)

> CoarseGrainedExecutorBackend send wrong 'Reason' when executor exits which 
> leading to job failed.
> -
>
> Key: SPARK-34931
> URL: https://issues.apache.org/jira/browse/SPARK-34931
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: ice bai
>Priority: Major
>
> when executor lost for some reason(e.g.  Unable to register with external 
> shuffle server),CoarseGrainedExecutorBackend will send a RemoveExecutor event 
> with 'ExecutorLossReason'.   But this will cause TaskSetManager handle 
> handleFailedTask function with exitCausedByApp=true.   This is not correct



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



<    5   6   7   8   9   10   11   12   13   14   >