from:"Yuming Wang \(Jira\)"

[jira] [Updated] (SPARK-45954) Avoid generating redundant ShuffleExchangeExec node

2023-11-16 Thread Yuming Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-45954?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-45954:

Summary: Avoid generating redundant ShuffleExchangeExec node  (was: Remove 
redundant shuffles)

> Avoid generating redundant ShuffleExchangeExec node
> ---
>
> Key: SPARK-45954
> URL: https://issues.apache.org/jira/browse/SPARK-45954
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Yuming Wang
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-45954) Remove redundant shuffles

2023-11-16 Thread Yuming Wang (Jira)

Yuming Wang created SPARK-45954:
---

 Summary: Remove redundant shuffles
 Key: SPARK-45954
 URL: https://issues.apache.org/jira/browse/SPARK-45954
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 4.0.0
Reporter: Yuming Wang






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-45947) Set a human readable description for Dataset api

2023-11-15 Thread Yuming Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-45947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-45947:

Description: 
We should set the view name to 
sparkSession.sparkContext.setJobDescription("xxx")


 !screenshot-1.png! 


  was:
Need to sparkSession.sparkContext.setJobDescription("xxx")
 !screenshot-1.png! 



> Set a human readable description for Dataset api
> 
>
> Key: SPARK-45947
> URL: https://issues.apache.org/jira/browse/SPARK-45947
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Yuming Wang
>Priority: Major
> Attachments: screenshot-1.png
>
>
> We should set the view name to 
> sparkSession.sparkContext.setJobDescription("xxx")
>  !screenshot-1.png! 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-45947) Set a human readable description for Dataset api

2023-11-15 Thread Yuming Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-45947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-45947:

Description: 
Need to sparkSession.sparkContext.setJobDescription("xxx")
 !screenshot-1.png! 


> Set a human readable description for Dataset api
> 
>
> Key: SPARK-45947
> URL: https://issues.apache.org/jira/browse/SPARK-45947
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Yuming Wang
>Priority: Major
> Attachments: screenshot-1.png
>
>
> Need to sparkSession.sparkContext.setJobDescription("xxx")
>  !screenshot-1.png! 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-45947) Set a human readable description for Dataset api

2023-11-15 Thread Yuming Wang (Jira)

Yuming Wang created SPARK-45947:
---

 Summary: Set a human readable description for Dataset api
 Key: SPARK-45947
 URL: https://issues.apache.org/jira/browse/SPARK-45947
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 4.0.0
Reporter: Yuming Wang
 Attachments: screenshot-1.png





--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-45947) Set a human readable description for Dataset api

2023-11-15 Thread Yuming Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-45947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-45947:

Attachment: screenshot-1.png

> Set a human readable description for Dataset api
> 
>
> Key: SPARK-45947
> URL: https://issues.apache.org/jira/browse/SPARK-45947
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Yuming Wang
>Priority: Major
> Attachments: screenshot-1.png
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-45915) Treat decimal(x, 0) the same as IntegralType in PromoteStrings

2023-11-14 Thread Yuming Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-45915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-45915:

Summary: Treat decimal(x, 0) the same as IntegralType in PromoteStrings  
(was: Unwrap cast in predicate)

> Treat decimal(x, 0) the same as IntegralType in PromoteStrings
> --
>
> Key: SPARK-45915
> URL: https://issues.apache.org/jira/browse/SPARK-45915
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Yuming Wang
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-45915) Unwrap cast in predicate

2023-11-13 Thread Yuming Wang (Jira)

Yuming Wang created SPARK-45915:
---

 Summary: Unwrap cast in predicate
 Key: SPARK-45915
 URL: https://issues.apache.org/jira/browse/SPARK-45915
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 4.0.0
Reporter: Yuming Wang






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-45909) Remove the cast if it can safely up-cast in IsNotNull

2023-11-13 Thread Yuming Wang (Jira)

Yuming Wang created SPARK-45909:
---

 Summary: Remove the cast if it can safely up-cast in IsNotNull
 Key: SPARK-45909
 URL: https://issues.apache.org/jira/browse/SPARK-45909
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 4.0.0
Reporter: Yuming Wang






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-45894) hive table level setting hadoop.mapred.max.split.size

2023-11-11 Thread Yuming Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-45894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-45894:

Target Version/s:   (was: 3.5.0)

> hive table level setting hadoop.mapred.max.split.size
> -
>
> Key: SPARK-45894
> URL: https://issues.apache.org/jira/browse/SPARK-45894
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: guihuawen
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.5.0
>
>
> In the scenario of hive table scan, by configuring the 
> hadoop.mapred.max.split.size parameter, you can increase the parallelism of 
> the scan hive table stage, thereby reducing the running time.
> However, if a large table and a small table are in the same query, if only a 
> separate hadoop.mapred.max.split.size parameter is configured, some stages 
> will run a very large number of tasks, and some stages will The number of 
> tasks running is very small. For runtime tasks, the 
> hadoop.mapred.max.split.size parameter can be set separately for each hive 
> table to ensure this balance.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-45895) Combine multiple like to like all

2023-11-11 Thread Yuming Wang (Jira)

Yuming Wang created SPARK-45895:
---

 Summary: Combine multiple like to like all
 Key: SPARK-45895
 URL: https://issues.apache.org/jira/browse/SPARK-45895
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 4.0.0
Reporter: Yuming Wang



{code:scala}
   spark.sql("create table t(a string, b string, c string) using parquet")
spark.sql(
  """
|select * from t where
|substr(a, 1, 5) like '%a%' and
|substr(a, 1, 5) like '%b%'
|""".stripMargin).explain(true)
{code}

We can optimize the query to:
{code:scala}
spark.sql(
  """
|select * from t where
|substr(a, 1, 5) like all('%a%', '%b%')
|""".stripMargin).explain(true)
{code}





--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-45853) Add Iceberg and Hudi to third party projects

2023-11-09 Thread Yuming Wang (Jira)

Yuming Wang created SPARK-45853:
---

 Summary: Add Iceberg and Hudi to third party projects
 Key: SPARK-45853
 URL: https://issues.apache.org/jira/browse/SPARK-45853
 Project: Spark
  Issue Type: Improvement
  Components: Documentation
Affects Versions: 4.0.0
Reporter: Yuming Wang



{noformat}
Error: org.apache.hive.service.cli.HiveSQLException: Error running query: 
java.util.concurrent.ExecutionException: 
org.apache.spark.SparkClassNotFoundException: [DATA_SOURCE_NOT_FOUND] Failed to 
find the data source: iceberg. Please find packages at 
`https://spark.apache.org/third-party-projects.html`.
at 
org.apache.spark.sql.hive.thriftserver.HiveThriftServerErrors$.runningQueryError(HiveThriftServerErrors.scala:46)
at 
org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.org$apache$spark$sql$hive$thriftserver$SparkExecuteStatementOperation$$execute(SparkExecuteStatementOperation.scala:262)
at 
org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$2$$anon$3.$anonfun$run$2(SparkExecuteStatementOperation.scala:166)
at 
scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
at 
org.apache.spark.sql.hive.thriftserver.SparkOperation.withLocalProperties(SparkOperation.scala:79)
at 
org.apache.spark.sql.hive.thriftserver.SparkOperation.withLocalProperties$(SparkOperation.scala:63)
at 
org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.withLocalProperties(SparkExecuteStatementOperation.scala:41)
at 
org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$2$$anon$3.run(SparkExecuteStatementOperation.scala:166)
at 
org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$2$$anon$3.run(SparkExecuteStatementOperation.scala:161)
at 
java.base/java.security.AccessController.doPrivileged(AccessController.java:712)
at java.base/javax.security.auth.Subject.doAs(Subject.java:439)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1878)
at 
org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$2.run(SparkExecuteStatementOperation.scala:175)
at 
java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539)
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
at 
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
at 
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
at java.base/java.lang.Thread.run(Thread.java:833)
{noformat}




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-45848) spark-build-info.ps1 missing the docroot property

2023-11-08 Thread Yuming Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-45848?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-45848:

Description: 
https://github.com/apache/spark/blob/master/build/spark-build-info.ps1#L38-L44
https://github.com/apache/spark/blob/master/build/spark-build-info#L30-L36

  
was:https://github.com/apache/spark/blob/master/build/spark-build-info.ps1#L38-L44


> spark-build-info.ps1 missing the docroot property
> -
>
> Key: SPARK-45848
> URL: https://issues.apache.org/jira/browse/SPARK-45848
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 4.0.0
>Reporter: Yuming Wang
>Priority: Major
>
> https://github.com/apache/spark/blob/master/build/spark-build-info.ps1#L38-L44
> https://github.com/apache/spark/blob/master/build/spark-build-info#L30-L36



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-45848) spark-build-info.ps1 missing the docroot property

2023-11-08 Thread Yuming Wang (Jira)

Yuming Wang created SPARK-45848:
---

 Summary: spark-build-info.ps1 missing the docroot property
 Key: SPARK-45848
 URL: https://issues.apache.org/jira/browse/SPARK-45848
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 4.0.0
Reporter: Yuming Wang


https://github.com/apache/spark/blob/master/build/spark-build-info.ps1#L38-L44



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-45755) Push down limit through Dataset.isEmpty()

2023-10-31 Thread Yuming Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-45755?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-45755:

Description: 
Push down LocalLimit can not optimize the case of distinct.

{code:scala}
  def isEmpty: Boolean = withAction("isEmpty",
withTypedPlan { LocalLimit(Literal(1), select().logicalPlan) 
}.queryExecution) { plan =>
plan.executeTake(1).isEmpty
  }
{code}


> Push down limit through Dataset.isEmpty()
> -
>
> Key: SPARK-45755
> URL: https://issues.apache.org/jira/browse/SPARK-45755
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Yuming Wang
>Priority: Major
>
> Push down LocalLimit can not optimize the case of distinct.
> {code:scala}
>   def isEmpty: Boolean = withAction("isEmpty",
> withTypedPlan { LocalLimit(Literal(1), select().logicalPlan) 
> }.queryExecution) { plan =>
> plan.executeTake(1).isEmpty
>   }
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-45755) Push down limit through Dataset.isEmpty()

2023-10-31 Thread Yuming Wang (Jira)

Yuming Wang created SPARK-45755:
---

 Summary: Push down limit through Dataset.isEmpty()
 Key: SPARK-45755
 URL: https://issues.apache.org/jira/browse/SPARK-45755
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 4.0.0
Reporter: Yuming Wang






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-45658) Canonicalization of DynamicPruningSubquery is broken

2023-10-25 Thread Yuming Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-45658?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-45658:

Target Version/s:   (was: 3.5.1)

> Canonicalization of DynamicPruningSubquery is broken
> 
>
> Key: SPARK-45658
> URL: https://issues.apache.org/jira/browse/SPARK-45658
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.5.0, 3.5.1
>Reporter: Asif
>Priority: Major
>
> The canonicalization of (buildKeys: Seq[Expression]) in the class 
> DynamicPruningSubquery is broken, as the buildKeys are canonicalized just by 
> calling 
> buildKeys.map(_.canonicalized)
> The  above would result in incorrect canonicalization as it would not be 
> normalizing the exprIds relative to buildQuery output
> The fix is to use the buildQuery : LogicalPlan's output to normalize the 
> buildKeys expression
> as given below, using the standard approach.
> buildKeys.map(QueryPlan.normalizeExpressions(_, buildQuery.output)),
> Will be filing a PR and bug test for the same.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-45658) Canonicalization of DynamicPruningSubquery is broken

2023-10-25 Thread Yuming Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-45658?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-45658:

Affects Version/s: (was: 3.5.1)

> Canonicalization of DynamicPruningSubquery is broken
> 
>
> Key: SPARK-45658
> URL: https://issues.apache.org/jira/browse/SPARK-45658
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Asif
>Priority: Major
>
> The canonicalization of (buildKeys: Seq[Expression]) in the class 
> DynamicPruningSubquery is broken, as the buildKeys are canonicalized just by 
> calling 
> buildKeys.map(_.canonicalized)
> The  above would result in incorrect canonicalization as it would not be 
> normalizing the exprIds relative to buildQuery output
> The fix is to use the buildQuery : LogicalPlan's output to normalize the 
> buildKeys expression
> as given below, using the standard approach.
> buildKeys.map(QueryPlan.normalizeExpressions(_, buildQuery.output)),
> Will be filing a PR and bug test for the same.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-43851) Support LCA in grouping expressions

2023-10-19 Thread Yuming Wang (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-43851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=1591#comment-1591
 ] 

Yuming Wang commented on SPARK-43851:
-

The resolution should be unresolved.

> Support LCA in grouping expressions
> ---
>
> Key: SPARK-43851
> URL: https://issues.apache.org/jira/browse/SPARK-43851
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Yuming Wang
>Priority: Major
>
> Teradata supports it:
> {code:sql}
> create table t1(a int) using  parquet;
> select a + 1 as a1, a1 + 1 as a2 from t1 group by a1, a2;
> {code}
> {noformat}
> [UNSUPPORTED_FEATURE.LATERAL_COLUMN_ALIAS_IN_GROUP_BY] The feature is not 
> supported: Referencing a lateral column alias via GROUP BY alias/ALL is not 
> supported yet.
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Reopened] (SPARK-43851) Support LCA in grouping expressions

2023-10-19 Thread Yuming Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-43851?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang reopened SPARK-43851:
-
  Assignee: (was: Jia Fan)

> Support LCA in grouping expressions
> ---
>
> Key: SPARK-43851
> URL: https://issues.apache.org/jira/browse/SPARK-43851
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Yuming Wang
>Priority: Major
> Fix For: 3.5.0
>
>
> Teradata supports it:
> {code:sql}
> create table t1(a int) using  parquet;
> select a + 1 as a1, a1 + 1 as a2 from t1 group by a1, a2;
> {code}
> {noformat}
> [UNSUPPORTED_FEATURE.LATERAL_COLUMN_ALIAS_IN_GROUP_BY] The feature is not 
> supported: Referencing a lateral column alias via GROUP BY alias/ALL is not 
> supported yet.
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-43851) Support LCA in grouping expressions

2023-10-19 Thread Yuming Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-43851?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-43851:

Fix Version/s: (was: 3.5.0)

> Support LCA in grouping expressions
> ---
>
> Key: SPARK-43851
> URL: https://issues.apache.org/jira/browse/SPARK-43851
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Yuming Wang
>Priority: Major
>
> Teradata supports it:
> {code:sql}
> create table t1(a int) using  parquet;
> select a + 1 as a1, a1 + 1 as a2 from t1 group by a1, a2;
> {code}
> {noformat}
> [UNSUPPORTED_FEATURE.LATERAL_COLUMN_ALIAS_IN_GROUP_BY] The feature is not 
> supported: Referencing a lateral column alias via GROUP BY alias/ALL is not 
> supported yet.
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-45454) Set the table's default owner to current_user

2023-10-07 Thread Yuming Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-45454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-45454:

Parent: (was: SPARK-30016)
Issue Type: Improvement  (was: Sub-task)

> Set the table's default owner to current_user
> -
>
> Key: SPARK-45454
> URL: https://issues.apache.org/jira/browse/SPARK-45454
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Yuming Wang
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-45454) Set the table's default owner to current_user

2023-10-07 Thread Yuming Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-45454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-45454:

Summary: Set the table's default owner to current_user  (was: Set owner of 
DS v2 table to CURRENT_USER if it is already set)

> Set the table's default owner to current_user
> -
>
> Key: SPARK-45454
> URL: https://issues.apache.org/jira/browse/SPARK-45454
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Yuming Wang
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-45454) Set owner of DS v2 table to CURRENT_USER if it is already set

2023-10-07 Thread Yuming Wang (Jira)

Yuming Wang created SPARK-45454:
---

 Summary: Set owner of DS v2 table to CURRENT_USER if it is already 
set
 Key: SPARK-45454
 URL: https://issues.apache.org/jira/browse/SPARK-45454
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 4.0.0
Reporter: Yuming Wang






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-45387) Partition key filter cannot be pushed down when using cast

2023-09-30 Thread Yuming Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-45387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-45387:

Target Version/s:   (was: 3.1.1, 3.3.0)

> Partition key filter cannot be pushed down when using cast
> --
>
> Key: SPARK-45387
> URL: https://issues.apache.org/jira/browse/SPARK-45387
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.1, 3.1.2, 3.3.0, 3.4.0
>Reporter: TianyiMa
>Priority: Critical
>
> Suppose we have a partitioned table `table_pt` with partition colum `dt` 
> which is StringType and the table metadata is managed by Hive Metastore, if 
> we filter partition by dt = '123', this filter can be pushed down to data 
> source, but if the filter condition is number, e.g. dt = 123, that cannot be 
> pushed down to data source, causing spark to pull all of that table's 
> partition meta data to client, which is poor of performance if the table has 
> thousands of partitions and increasing the risk of hive metastore oom.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-45369) Push down limit through generate

2023-09-27 Thread Yuming Wang (Jira)

Yuming Wang created SPARK-45369:
---

 Summary: Push down limit through generate
 Key: SPARK-45369
 URL: https://issues.apache.org/jira/browse/SPARK-45369
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 4.0.0
Reporter: Yuming Wang






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-45282) Join loses records for cached datasets

2023-09-24 Thread Yuming Wang (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-45282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17768399#comment-17768399
 ] 

Yuming Wang commented on SPARK-45282:
-

cc [~ulysses] [~cloud_fan]

> Join loses records for cached datasets
> --
>
> Key: SPARK-45282
> URL: https://issues.apache.org/jira/browse/SPARK-45282
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.1, 3.5.0
> Environment: spark 3.4.1 on apache hadoop 3.3.6 or kubernetes 1.26 or 
> databricks 13.3
>Reporter: koert kuipers
>Priority: Major
>  Labels: CorrectnessBug, correctness
>
> we observed this issue on spark 3.4.1 but it is also present on 3.5.0. it is 
> not present on spark 3.3.1.
> it only shows up in distributed environment. i cannot replicate in unit test. 
> however i did get it to show up on hadoop cluster, kubernetes, and on 
> databricks 13.3
> the issue is that records are dropped when two cached dataframes are joined. 
> it seems in spark 3.4.1 in queryplan some Exchanges are dropped as an 
> optimization while in spark 3.3.1 these Exhanges are still present. it seems 
> to be an issue with AQE with canChangeCachedPlanOutputPartitioning=true.
> to reproduce on distributed cluster these settings needed are:
> {code:java}
> spark.sql.adaptive.advisoryPartitionSizeInBytes 33554432
> spark.sql.adaptive.coalescePartitions.parallelismFirst false
> spark.sql.adaptive.enabled true
> spark.sql.optimizer.canChangeCachedPlanOutputPartitioning true {code}
> code using scala to reproduce is:
> {code:java}
> import java.util.UUID
> import org.apache.spark.sql.functions.col
> import spark.implicits._
> val data = (1 to 100).toDS().map(i => 
> UUID.randomUUID().toString).persist()
> val left = data.map(k => (k, 1))
> val right = data.map(k => (k, k)) // if i change this to k => (k, 1) it works!
> println("number of left " + left.count())
> println("number of right " + right.count())
> println("number of (left join right) " +
>   left.toDF("key", "value1").join(right.toDF("key", "value2"), "key").count()
> )
> val left1 = left
>   .toDF("key", "value1")
>   .repartition(col("key")) // comment out this line to make it work
>   .persist()
> println("number of left1 " + left1.count())
> val right1 = right
>   .toDF("key", "value2")
>   .repartition(col("key")) // comment out this line to make it work
>   .persist()
> println("number of right1 " + right1.count())
> println("number of (left1 join right1) " +  left1.join(right1, 
> "key").count()) // this gives incorrect result{code}
> this produces the following output:
> {code:java}
> number of left 100
> number of right 100
> number of (left join right) 100
> number of left1 100
> number of right1 100
> number of (left1 join right1) 859531 {code}
> note that the last number (the incorrect one) actually varies depending on 
> settings and cluster size etc.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-43406) enable spark sql to drop multiple partitions in one call

2023-09-14 Thread Yuming Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-43406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang resolved SPARK-43406.
-
Resolution: Duplicate

> enable spark sql to drop multiple partitions in one call
> 
>
> Key: SPARK-43406
> URL: https://issues.apache.org/jira/browse/SPARK-43406
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.1, 3.3.2, 3.4.0
>Reporter: chenruotao
>Priority: Major
>
> Now spark sql cannot drop multiple partitions in one call, so I fix it
> With this patch we can drop multiple partitions like this : 
> alter table test.table_partition drop partition(dt<='2023-04-02', 
> dt>='2023-03-31')



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-43406) enable spark sql to drop multiple partitions in one call

2023-09-14 Thread Yuming Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-43406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-43406:

Target Version/s:   (was: 4.0.0)

> enable spark sql to drop multiple partitions in one call
> 
>
> Key: SPARK-43406
> URL: https://issues.apache.org/jira/browse/SPARK-43406
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.1, 3.3.2, 3.4.0
>Reporter: chenruotao
>Priority: Major
>
> Now spark sql cannot drop multiple partitions in one call, so I fix it
> With this patch we can drop multiple partitions like this : 
> alter table test.table_partition drop partition(dt<='2023-04-02', 
> dt>='2023-03-31')



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-43406) enable spark sql to drop multiple partitions in one call

2023-09-14 Thread Yuming Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-43406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-43406:

Fix Version/s: (was: 3.5.0)

> enable spark sql to drop multiple partitions in one call
> 
>
> Key: SPARK-43406
> URL: https://issues.apache.org/jira/browse/SPARK-43406
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.1, 3.3.2, 3.4.0
>Reporter: chenruotao
>Priority: Major
>
> Now spark sql cannot drop multiple partitions in one call, so I fix it
> With this patch we can drop multiple partitions like this : 
> alter table test.table_partition drop partition(dt<='2023-04-02', 
> dt>='2023-03-31')



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-43406) enable spark sql to drop multiple partitions in one call

2023-09-14 Thread Yuming Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-43406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-43406:

Target Version/s: 4.0.0

> enable spark sql to drop multiple partitions in one call
> 
>
> Key: SPARK-43406
> URL: https://issues.apache.org/jira/browse/SPARK-43406
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.1, 3.3.2, 3.4.0
>Reporter: chenruotao
>Priority: Major
>
> Now spark sql cannot drop multiple partitions in one call, so I fix it
> With this patch we can drop multiple partitions like this : 
> alter table test.table_partition drop partition(dt<='2023-04-02', 
> dt>='2023-03-31')



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-45089) Remove obsolete repo of DB2 JDBC driver

2023-09-07 Thread Yuming Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-45089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang resolved SPARK-45089.
-
Fix Version/s: 4.0.0
 Assignee: Cheng Pan
   Resolution: Fixed

Issue resolved by pull request 42820
https://github.com/apache/spark/pull/42820

> Remove obsolete repo of DB2 JDBC driver
> ---
>
> Key: SPARK-45089
> URL: https://issues.apache.org/jira/browse/SPARK-45089
> Project: Spark
>  Issue Type: Test
>  Components: Build, Tests
>Affects Versions: 4.0.0
>Reporter: Cheng Pan
>Assignee: Cheng Pan
>Priority: Major
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-45071) Optimize the processing speed of `BinaryArithmetic#dataType` when processing multi-column data

2023-09-05 Thread Yuming Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-45071?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-45071:

Fix Version/s: 3.5.1
   (was: 3.5.0)

> Optimize the processing speed of `BinaryArithmetic#dataType` when processing 
> multi-column data
> --
>
> Key: SPARK-45071
> URL: https://issues.apache.org/jira/browse/SPARK-45071
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0, 3.5.0
>Reporter: ming95
>Assignee: ming95
>Priority: Major
> Fix For: 3.4.2, 4.0.0, 3.5.1
>
>
> Since `BinaryArithmetic#dataType` will recursively process the datatype of 
> each node, the driver will be very slow when multiple columns are processed.
> For example, the following code:
> {code:java}
> ```
>     import spark.implicits._
>     import scala.util.Random
>     import org.apache.spark.sql.functions.sum
>     import org.apache.spark.sql.types.{StructType, StructField, IntegerType}
>     val N = 30
>     val M = 100
>     val columns = Seq.fill(N)(Random.alphanumeric.take(8).mkString)
>     val data = Seq.fill(M)(Seq.fill(N)(Random.nextInt(16) - 5))
>     val schema = StructType(columns.map(StructField(_, IntegerType)))
>     val rdd = spark.sparkContext.parallelize(data.map(Row.fromSeq(_)))
>     val df = spark.createDataFrame(rdd, schema)
>     val colExprs = columns.map(sum(_))
>     // gen a new column , and add the other 30 column
>     df.withColumn("new_col_sum", expr(columns.mkString(" + ")))
> ```
> {code}
>  
> This code will take a few minutes for the driver to execute in the spark3.4 
> version, but only takes a few seconds to execute in the spark3.2 version. 
> Related issue: SPARK-39316



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-45071) Optimize the processing speed of `BinaryArithmetic#dataType` when processing multi-column data

2023-09-05 Thread Yuming Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-45071?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang resolved SPARK-45071.
-
Fix Version/s: 3.5.0
   4.0.0
   3.4.2
   Resolution: Fixed

Issue resolved by pull request 42804
[https://github.com/apache/spark/pull/42804]

> Optimize the processing speed of `BinaryArithmetic#dataType` when processing 
> multi-column data
> --
>
> Key: SPARK-45071
> URL: https://issues.apache.org/jira/browse/SPARK-45071
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0, 3.5.0
>Reporter: ming95
>Assignee: ming95
>Priority: Major
> Fix For: 3.5.0, 4.0.0, 3.4.2
>
>
> Since `BinaryArithmetic#dataType` will recursively process the datatype of 
> each node, the driver will be very slow when multiple columns are processed.
> For example, the following code:
> {code:java}
> ```
>     import spark.implicits._
>     import scala.util.Random
>     import org.apache.spark.sql.functions.sum
>     import org.apache.spark.sql.types.{StructType, StructField, IntegerType}
>     val N = 30
>     val M = 100
>     val columns = Seq.fill(N)(Random.alphanumeric.take(8).mkString)
>     val data = Seq.fill(M)(Seq.fill(N)(Random.nextInt(16) - 5))
>     val schema = StructType(columns.map(StructField(_, IntegerType)))
>     val rdd = spark.sparkContext.parallelize(data.map(Row.fromSeq(_)))
>     val df = spark.createDataFrame(rdd, schema)
>     val colExprs = columns.map(sum(_))
>     // gen a new column , and add the other 30 column
>     df.withColumn("new_col_sum", expr(columns.mkString(" + ")))
> ```
> {code}
>  
> This code will take a few minutes for the driver to execute in the spark3.4 
> version, but only takes a few seconds to execute in the spark3.2 version. 
> Related issue: SPARK-39316



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-45071) Optimize the processing speed of `BinaryArithmetic#dataType` when processing multi-column data

2023-09-05 Thread Yuming Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-45071?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang reassigned SPARK-45071:
---

Assignee: ming95

> Optimize the processing speed of `BinaryArithmetic#dataType` when processing 
> multi-column data
> --
>
> Key: SPARK-45071
> URL: https://issues.apache.org/jira/browse/SPARK-45071
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0, 3.5.0
>Reporter: ming95
>Assignee: ming95
>Priority: Major
>
> Since `BinaryArithmetic#dataType` will recursively process the datatype of 
> each node, the driver will be very slow when multiple columns are processed.
> For example, the following code:
> {code:java}
> ```
>     import spark.implicits._
>     import scala.util.Random
>     import org.apache.spark.sql.functions.sum
>     import org.apache.spark.sql.types.{StructType, StructField, IntegerType}
>     val N = 30
>     val M = 100
>     val columns = Seq.fill(N)(Random.alphanumeric.take(8).mkString)
>     val data = Seq.fill(M)(Seq.fill(N)(Random.nextInt(16) - 5))
>     val schema = StructType(columns.map(StructField(_, IntegerType)))
>     val rdd = spark.sparkContext.parallelize(data.map(Row.fromSeq(_)))
>     val df = spark.createDataFrame(rdd, schema)
>     val colExprs = columns.map(sum(_))
>     // gen a new column , and add the other 30 column
>     df.withColumn("new_col_sum", expr(columns.mkString(" + ")))
> ```
> {code}
>  
> This code will take a few minutes for the driver to execute in the spark3.4 
> version, but only takes a few seconds to execute in the spark3.2 version. 
> Related issue: SPARK-39316



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-45020) org.apache.spark.sql.catalyst.analysis.NoSuchDatabaseException: Database 'default' not found (state=08S01,code=0)

2023-09-04 Thread Yuming Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-45020?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-45020:

Fix Version/s: (was: 3.1.0)

> org.apache.spark.sql.catalyst.analysis.NoSuchDatabaseException: Database 
> 'default' not found (state=08S01,code=0)
> -
>
> Key: SPARK-45020
> URL: https://issues.apache.org/jira/browse/SPARK-45020
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: Sruthi Mooriyathvariam
>Priority: Minor
>
> There is an alert that fires up when a Spark 3.1 cluster is created using 
> shared metastore with Spark 2.4. The alert says DefaultDatabase does not 
> exist. This is misleading and thus we need to suppress this alert. 
> In the class SessionCatalog.scala, the method requireDbExists() is not 
> handling the case when the db = defaultDB. This needs to be added to suppress 
> this misleading alert. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-44846) PushFoldableIntoBranches in complex grouping expressions may cause bindReference error

2023-09-04 Thread Yuming Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44846?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang reassigned SPARK-44846:
---

Assignee: zhuml

> PushFoldableIntoBranches in complex grouping expressions may cause 
> bindReference error
> --
>
> Key: SPARK-44846
> URL: https://issues.apache.org/jira/browse/SPARK-44846
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.1
>Reporter: zhuml
>Assignee: zhuml
>Priority: Major
>
> SQL:
> {code:java}
> select c*2 as d from
> (select if(b > 1, 1, b) as c from
> (select if(a < 0, 0 ,a) as b from t group by b) t1
> group by c) t2 {code}
> ERROR:
> {code:java}
> Couldn't find _groupingexpression#15 in [if ((_groupingexpression#15 > 1)) 1 
> else _groupingexpression#15#16]
> java.lang.IllegalStateException: Couldn't find _groupingexpression#15 in [if 
> ((_groupingexpression#15 > 1)) 1 else _groupingexpression#15#16]
>     at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:80)
>     at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:73)
>     at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:461)
>     at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(origin.scala:76)
>     at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:461)
>     at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$3(TreeNode.scala:466)
>     at 
> org.apache.spark.sql.catalyst.trees.BinaryLike.mapChildren(TreeNode.scala:1241)
>     at 
> org.apache.spark.sql.catalyst.trees.BinaryLike.mapChildren$(TreeNode.scala:1240)
>     at 
> org.apache.spark.sql.catalyst.expressions.BinaryExpression.mapChildren(Expression.scala:653)
>     at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:466)
>     at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$3(TreeNode.scala:466)
>     at 
> org.apache.spark.sql.catalyst.trees.TernaryLike.mapChildren(TreeNode.scala:1272)
>     at 
> org.apache.spark.sql.catalyst.trees.TernaryLike.mapChildren$(TreeNode.scala:1271)
>     at 
> org.apache.spark.sql.catalyst.expressions.If.mapChildren(conditionalExpressions.scala:41)
>     at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:466)
>     at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$3(TreeNode.scala:466)
>     at 
> org.apache.spark.sql.catalyst.trees.UnaryLike.mapChildren(TreeNode.scala:1215)
>     at 
> org.apache.spark.sql.catalyst.trees.UnaryLike.mapChildren$(TreeNode.scala:1214)
>     at 
> org.apache.spark.sql.catalyst.expressions.UnaryExpression.mapChildren(Expression.scala:533)
>     at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:466)
>     at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:437)
>     at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:405)
>     at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReference(BoundAttribute.scala:73)
>     at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$.$anonfun$bindReferences$1(BoundAttribute.scala:94)
>     at scala.collection.immutable.List.map(List.scala:293)
>     at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReferences(BoundAttribute.scala:94)
>     at 
> org.apache.spark.sql.execution.aggregate.HashAggregateExec.generateResultFunction(HashAggregateExec.scala:360)
>     at 
> org.apache.spark.sql.execution.aggregate.HashAggregateExec.doProduceWithKeys(HashAggregateExec.scala:538)
>     at 
> org.apache.spark.sql.execution.aggregate.AggregateCodegenSupport.doProduce(AggregateCodegenSupport.scala:69)
>     at 
> org.apache.spark.sql.execution.aggregate.AggregateCodegenSupport.doProduce$(AggregateCodegenSupport.scala:65)
>     at 
> org.apache.spark.sql.execution.aggregate.HashAggregateExec.doProduce(HashAggregateExec.scala:49)
>     at 
> org.apache.spark.sql.execution.CodegenSupport.$anonfun$produce$1(WholeStageCodegenExec.scala:97)
>     at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:246)
>     at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>     at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:243)
>     at 
> org.apache.spark.sql.execution.CodegenSupport.produce(WholeStageCodegenExec.scala:92)
>     at 
> org.apache.spark.sql.execution.CodegenSupport.produce$(WholeStageCodegenExec.scala:92)
>     at

[jira] [Resolved] (SPARK-44846) PushFoldableIntoBranches in complex grouping expressions may cause bindReference error

2023-09-04 Thread Yuming Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44846?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang resolved SPARK-44846.
-
Fix Version/s: 3.5.0
   4.0.0
   3.4.2
   Resolution: Fixed

Issue resolved by pull request 42633
[https://github.com/apache/spark/pull/42633]

> PushFoldableIntoBranches in complex grouping expressions may cause 
> bindReference error
> --
>
> Key: SPARK-44846
> URL: https://issues.apache.org/jira/browse/SPARK-44846
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.1
>Reporter: zhuml
>Assignee: zhuml
>Priority: Major
> Fix For: 3.5.0, 4.0.0, 3.4.2
>
>
> SQL:
> {code:java}
> select c*2 as d from
> (select if(b > 1, 1, b) as c from
> (select if(a < 0, 0 ,a) as b from t group by b) t1
> group by c) t2 {code}
> ERROR:
> {code:java}
> Couldn't find _groupingexpression#15 in [if ((_groupingexpression#15 > 1)) 1 
> else _groupingexpression#15#16]
> java.lang.IllegalStateException: Couldn't find _groupingexpression#15 in [if 
> ((_groupingexpression#15 > 1)) 1 else _groupingexpression#15#16]
>     at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:80)
>     at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:73)
>     at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:461)
>     at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(origin.scala:76)
>     at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:461)
>     at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$3(TreeNode.scala:466)
>     at 
> org.apache.spark.sql.catalyst.trees.BinaryLike.mapChildren(TreeNode.scala:1241)
>     at 
> org.apache.spark.sql.catalyst.trees.BinaryLike.mapChildren$(TreeNode.scala:1240)
>     at 
> org.apache.spark.sql.catalyst.expressions.BinaryExpression.mapChildren(Expression.scala:653)
>     at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:466)
>     at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$3(TreeNode.scala:466)
>     at 
> org.apache.spark.sql.catalyst.trees.TernaryLike.mapChildren(TreeNode.scala:1272)
>     at 
> org.apache.spark.sql.catalyst.trees.TernaryLike.mapChildren$(TreeNode.scala:1271)
>     at 
> org.apache.spark.sql.catalyst.expressions.If.mapChildren(conditionalExpressions.scala:41)
>     at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:466)
>     at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$3(TreeNode.scala:466)
>     at 
> org.apache.spark.sql.catalyst.trees.UnaryLike.mapChildren(TreeNode.scala:1215)
>     at 
> org.apache.spark.sql.catalyst.trees.UnaryLike.mapChildren$(TreeNode.scala:1214)
>     at 
> org.apache.spark.sql.catalyst.expressions.UnaryExpression.mapChildren(Expression.scala:533)
>     at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:466)
>     at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:437)
>     at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:405)
>     at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReference(BoundAttribute.scala:73)
>     at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$.$anonfun$bindReferences$1(BoundAttribute.scala:94)
>     at scala.collection.immutable.List.map(List.scala:293)
>     at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReferences(BoundAttribute.scala:94)
>     at 
> org.apache.spark.sql.execution.aggregate.HashAggregateExec.generateResultFunction(HashAggregateExec.scala:360)
>     at 
> org.apache.spark.sql.execution.aggregate.HashAggregateExec.doProduceWithKeys(HashAggregateExec.scala:538)
>     at 
> org.apache.spark.sql.execution.aggregate.AggregateCodegenSupport.doProduce(AggregateCodegenSupport.scala:69)
>     at 
> org.apache.spark.sql.execution.aggregate.AggregateCodegenSupport.doProduce$(AggregateCodegenSupport.scala:65)
>     at 
> org.apache.spark.sql.execution.aggregate.HashAggregateExec.doProduce(HashAggregateExec.scala:49)
>     at 
> org.apache.spark.sql.execution.CodegenSupport.$anonfun$produce$1(WholeStageCodegenExec.scala:97)
>     at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:246)
>     at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>     at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:243)
>

[jira] [Updated] (SPARK-44892) Add official image Dockerfile for Spark 3.3.3

2023-08-22 Thread Yuming Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-44892:

Fix Version/s: (was: 4.0.0)

> Add official image Dockerfile for Spark 3.3.3
> -
>
> Key: SPARK-44892
> URL: https://issues.apache.org/jira/browse/SPARK-44892
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Docker
>Affects Versions: 3.3.3
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-44892) Add official image Dockerfile for Spark 3.3.3

2023-08-22 Thread Yuming Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang resolved SPARK-44892.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 54
[https://github.com/apache/spark-docker/pull/54]

> Add official image Dockerfile for Spark 3.3.3
> -
>
> Key: SPARK-44892
> URL: https://issues.apache.org/jira/browse/SPARK-44892
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Docker
>Affects Versions: 3.3.3
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-44892) Add official image Dockerfile for Spark 3.3.3

2023-08-22 Thread Yuming Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang reassigned SPARK-44892:
---

Assignee: Yuming Wang

> Add official image Dockerfile for Spark 3.3.3
> -
>
> Key: SPARK-44892
> URL: https://issues.apache.org/jira/browse/SPARK-44892
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Docker
>Affects Versions: 3.3.3
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-44892) Add official image Dockerfile for Spark 3.3.3

2023-08-21 Thread Yuming Wang (Jira)

Yuming Wang created SPARK-44892:
---

 Summary: Add official image Dockerfile for Spark 3.3.3
 Key: SPARK-44892
 URL: https://issues.apache.org/jira/browse/SPARK-44892
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Docker
Affects Versions: 3.3.3
Reporter: Yuming Wang






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-44813) The JIRA Python misses our assignee when it searches user again

2023-08-21 Thread Yuming Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-44813:

Fix Version/s: 3.3.4
   (was: 3.3.3)

> The JIRA Python misses our assignee when it searches user again
> ---
>
> Key: SPARK-44813
> URL: https://issues.apache.org/jira/browse/SPARK-44813
> Project: Spark
>  Issue Type: Bug
>  Components: Project Infra
>Affects Versions: 4.0.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Major
> Fix For: 3.4.2, 3.5.0, 4.0.0, 3.3.4
>
>
> {code:java}
> >>> assignee = asf_jira.user("yao")
> >>> "SPARK-44801"'SPARK-44801'
> >>> asf_jira.assign_issue(issue.key, assignee.name)
> response text = {"errorMessages":[],"errors":{"assignee":"User 'airhot' 
> cannot be assigned issues."}} {code}
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-44857) Fix getBaseURI error in Spark Worker LogPage UI buttons

2023-08-21 Thread Yuming Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44857?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-44857:

Fix Version/s: 3.3.4
   (was: 3.3.3)

> Fix getBaseURI error in Spark Worker LogPage UI buttons
> ---
>
> Key: SPARK-44857
> URL: https://issues.apache.org/jira/browse/SPARK-44857
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Web UI
>Affects Versions: 3.2.0, 3.2.4, 3.3.2, 3.4.1, 3.5.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 3.4.2, 3.5.0, 4.0.0, 3.3.4
>
> Attachments: Screenshot 2023-08-17 at 2.38.45 PM.png
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-44880) Remove unnecessary curly braces at the end of the thread locks info

2023-08-19 Thread Yuming Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang reassigned SPARK-44880:
---

Assignee: Kent Yao

> Remove unnecessary curly braces at the end of the thread locks info
> ---
>
> Key: SPARK-44880
> URL: https://issues.apache.org/jira/browse/SPARK-44880
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 3.3.2, 3.4.1, 3.5.0, 4.0.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Major
> Fix For: 3.5.0, 4.0.0
>
>
> Remove unnecessary curly braces at the end of the thread locks info



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-44880) Remove unnecessary curly braces at the end of the thread locks info

2023-08-19 Thread Yuming Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang resolved SPARK-44880.
-
Fix Version/s: 3.5.0
   4.0.0
   Resolution: Fixed

Issue resolved by pull request 42571
[https://github.com/apache/spark/pull/42571]

> Remove unnecessary curly braces at the end of the thread locks info
> ---
>
> Key: SPARK-44880
> URL: https://issues.apache.org/jira/browse/SPARK-44880
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 3.3.2, 3.4.1, 3.5.0, 4.0.0
>Reporter: Kent Yao
>Priority: Major
> Fix For: 3.5.0, 4.0.0
>
>
> Remove unnecessary curly braces at the end of the thread locks info



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-44880) Remove unnecessary curly braces at the end of the thread locks info

2023-08-19 Thread Yuming Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-44880:

Fix Version/s: 3.5.1
   (was: 3.5.0)

> Remove unnecessary curly braces at the end of the thread locks info
> ---
>
> Key: SPARK-44880
> URL: https://issues.apache.org/jira/browse/SPARK-44880
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 3.3.2, 3.4.1, 3.5.0, 4.0.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Major
> Fix For: 4.0.0, 3.5.1
>
>
> Remove unnecessary curly braces at the end of the thread locks info



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-44792) Upgrade curator to 5.2.0

2023-08-13 Thread Yuming Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang resolved SPARK-44792.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 42474
[https://github.com/apache/spark/pull/42474]

> Upgrade curator to 5.2.0
> 
>
> Key: SPARK-44792
> URL: https://issues.apache.org/jira/browse/SPARK-44792
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 4.0.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
> Fix For: 4.0.0
>
>
> https://issues.apache.org/jira/browse/HADOOP-17612
> https://issues.apache.org/jira/browse/HADOOP-18515



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-44792) Upgrade curator to 5.2.0

2023-08-13 Thread Yuming Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang reassigned SPARK-44792:
---

Assignee: Yuming Wang

> Upgrade curator to 5.2.0
> 
>
> Key: SPARK-44792
> URL: https://issues.apache.org/jira/browse/SPARK-44792
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 4.0.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
>
> https://issues.apache.org/jira/browse/HADOOP-17612
> https://issues.apache.org/jira/browse/HADOOP-18515



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-44792) Upgrade curator to 5.2.0

2023-08-12 Thread Yuming Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-44792:

Description: 
https://issues.apache.org/jira/browse/HADOOP-17612
https://issues.apache.org/jira/browse/HADOOP-18515

  was:https://issues.apache.org/jira/browse/HADOOP-17612


> Upgrade curator to 5.2.0
> 
>
> Key: SPARK-44792
> URL: https://issues.apache.org/jira/browse/SPARK-44792
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 4.0.0
>Reporter: Yuming Wang
>Priority: Major
>
> https://issues.apache.org/jira/browse/HADOOP-17612
> https://issues.apache.org/jira/browse/HADOOP-18515



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-44792) Upgrade curator to 5.2.0

2023-08-12 Thread Yuming Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-44792:

Description: https://issues.apache.org/jira/browse/HADOOP-17612

> Upgrade curator to 5.2.0
> 
>
> Key: SPARK-44792
> URL: https://issues.apache.org/jira/browse/SPARK-44792
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 4.0.0
>Reporter: Yuming Wang
>Priority: Major
>
> https://issues.apache.org/jira/browse/HADOOP-17612



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-44792) Upgrade curator to 5.2.0

2023-08-12 Thread Yuming Wang (Jira)

Yuming Wang created SPARK-44792:
---

 Summary: Upgrade curator to 5.2.0
 Key: SPARK-44792
 URL: https://issues.apache.org/jira/browse/SPARK-44792
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 4.0.0
Reporter: Yuming Wang






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-44700) Rule OptimizeCsvJsonExprs should not be applied to expression like from_json(regexp_replace)

2023-08-10 Thread Yuming Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44700?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-44700:

Fix Version/s: 3.3.0

> Rule OptimizeCsvJsonExprs should not be applied to expression like 
> from_json(regexp_replace)
> 
>
> Key: SPARK-44700
> URL: https://issues.apache.org/jira/browse/SPARK-44700
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.1
>Reporter: jiahong.li
>Priority: Minor
> Fix For: 3.3.0
>
>
> _SQL_ like below:
> select tmp.* 
>  from
>  (select
>         device_id, ads_id, 
>         from_json(regexp_replace(device_personas, '(?<=(\\{|,))"device_', 
> '"user_device_'), ${device_schema}) as tmp
>         from input )
> ${device_schema} includes more than 100 fields.
> if Rule: OptimizeCsvJsonExprs  been applied, the expression, regexp_replace, 
> will be invoked many times, that costs so much time.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-44700) Rule OptimizeCsvJsonExprs should not be applied to expression like from_json(regexp_replace)

2023-08-10 Thread Yuming Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44700?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang resolved SPARK-44700.
-
Resolution: Fixed

Please upgrade Spark to the latest version to fix this issue.

> Rule OptimizeCsvJsonExprs should not be applied to expression like 
> from_json(regexp_replace)
> 
>
> Key: SPARK-44700
> URL: https://issues.apache.org/jira/browse/SPARK-44700
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.1
>Reporter: jiahong.li
>Priority: Minor
>
> _SQL_ like below:
> select tmp.* 
>  from
>  (select
>         device_id, ads_id, 
>         from_json(regexp_replace(device_personas, '(?<=(\\{|,))"device_', 
> '"user_device_'), ${device_schema}) as tmp
>         from input )
> ${device_schema} includes more than 100 fields.
> if Rule: OptimizeCsvJsonExprs  been applied, the expression, regexp_replace, 
> will be invoked many times, that costs so much time.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-44700) Rule OptimizeCsvJsonExprs should not be applied to expression like from_json(regexp_replace)

2023-08-10 Thread Yuming Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44700?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-44700:

Affects Version/s: 3.1.1
   (was: 3.4.0)
   (was: 3.4.1)

> Rule OptimizeCsvJsonExprs should not be applied to expression like 
> from_json(regexp_replace)
> 
>
> Key: SPARK-44700
> URL: https://issues.apache.org/jira/browse/SPARK-44700
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.1
>Reporter: jiahong.li
>Priority: Minor
>
> _SQL_ like below:
> select tmp.* 
>  from
>  (select
>         device_id, ads_id, 
>         from_json(regexp_replace(device_personas, '(?<=(\\{|,))"device_', 
> '"user_device_'), ${device_schema}) as tmp
>         from input )
> ${device_schema} includes more than 100 fields.
> if Rule: OptimizeCsvJsonExprs  been applied, the expression, regexp_replace, 
> will be invoked many times, that costs so much time.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24087) Avoid shuffle when join keys are a super-set of bucket keys

2023-08-08 Thread Yuming Wang (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-24087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17752239#comment-17752239
 ] 

Yuming Wang commented on SPARK-24087:
-

Fixed by SPARK-35703.

> Avoid shuffle when join keys are a super-set of bucket keys
> ---
>
> Key: SPARK-24087
> URL: https://issues.apache.org/jira/browse/SPARK-24087
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: yucai
>Priority: Major
>  Labels: bulk-closed
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-44719) NoClassDefFoundError when using Hive UDF

2023-08-08 Thread Yuming Wang (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-44719?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17752023#comment-17752023
 ] 

Yuming Wang commented on SPARK-44719:
-

There are two ways to fix it:
1. Upgrade the built-in hive to 2.3.10 with the following patch.
2. Revert SPARK-43225.

https://github.com/apache/hive/pull/4562
https://github.com/apache/hive/pull/4563
https://github.com/apache/hive/pull/4564

> NoClassDefFoundError when using Hive UDF
> 
>
> Key: SPARK-44719
> URL: https://issues.apache.org/jira/browse/SPARK-44719
> Project: Spark
>  Issue Type: Bug
>  Components: Build, SQL
>Affects Versions: 3.5.0
>Reporter: Yuming Wang
>Priority: Major
> Attachments: HiveUDFs-1.0-SNAPSHOT.jar
>
>
> How to reproduce:
> {noformat}
> spark-sql (default)> add jar 
> /Users/yumwang/Downloads/HiveUDFs-1.0-SNAPSHOT.jar;
> Time taken: 0.413 seconds
> spark-sql (default)> CREATE TEMPORARY FUNCTION long_to_ip as 
> 'net.petrabarus.hiveudfs.LongToIP';
> Time taken: 0.038 seconds
> spark-sql (default)> SELECT long_to_ip(2130706433L) FROM range(10);
> 23/08/08 20:17:58 ERROR SparkSQLDriver: Failed in [SELECT 
> long_to_ip(2130706433L) FROM range(10)]
> java.lang.NoClassDefFoundError: org/codehaus/jackson/map/type/TypeFactory
>   at org.apache.hadoop.hive.ql.udf.UDFJson.(UDFJson.java:64)
>   at java.lang.Class.forName0(Native Method)
>   at java.lang.Class.forName(Class.java:348)
> ...
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-44719) NoClassDefFoundError when using Hive UDF

2023-08-08 Thread Yuming Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44719?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-44719:

Description: 
How to reproduce:
{noformat}
spark-sql (default)> add jar /Users/yumwang/Downloads/HiveUDFs-1.0-SNAPSHOT.jar;
Time taken: 0.413 seconds
spark-sql (default)> CREATE TEMPORARY FUNCTION long_to_ip as 
'net.petrabarus.hiveudfs.LongToIP';
Time taken: 0.038 seconds
spark-sql (default)> SELECT long_to_ip(2130706433L) FROM range(10);
23/08/08 20:17:58 ERROR SparkSQLDriver: Failed in [SELECT 
long_to_ip(2130706433L) FROM range(10)]
java.lang.NoClassDefFoundError: org/codehaus/jackson/map/type/TypeFactory
at org.apache.hadoop.hive.ql.udf.UDFJson.(UDFJson.java:64)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:348)
...
{noformat}


  was:
How to reproduce:
```
spark-sql (default)> add jar /Users/yumwang/Downloads/HiveUDFs-1.0-SNAPSHOT.jar;
Time taken: 0.413 seconds
spark-sql (default)> CREATE TEMPORARY FUNCTION long_to_ip as 
'net.petrabarus.hiveudfs.LongToIP';
Time taken: 0.038 seconds
spark-sql (default)> SELECT long_to_ip(2130706433L) FROM range(10);
23/08/08 20:17:58 ERROR SparkSQLDriver: Failed in [SELECT 
long_to_ip(2130706433L) FROM range(10)]
java.lang.NoClassDefFoundError: org/codehaus/jackson/map/type/TypeFactory
at org.apache.hadoop.hive.ql.udf.UDFJson.(UDFJson.java:64)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:348)
...
```


> NoClassDefFoundError when using Hive UDF
> 
>
> Key: SPARK-44719
> URL: https://issues.apache.org/jira/browse/SPARK-44719
> Project: Spark
>  Issue Type: Bug
>  Components: Build, SQL
>Affects Versions: 3.5.0
>Reporter: Yuming Wang
>Priority: Major
> Attachments: HiveUDFs-1.0-SNAPSHOT.jar
>
>
> How to reproduce:
> {noformat}
> spark-sql (default)> add jar 
> /Users/yumwang/Downloads/HiveUDFs-1.0-SNAPSHOT.jar;
> Time taken: 0.413 seconds
> spark-sql (default)> CREATE TEMPORARY FUNCTION long_to_ip as 
> 'net.petrabarus.hiveudfs.LongToIP';
> Time taken: 0.038 seconds
> spark-sql (default)> SELECT long_to_ip(2130706433L) FROM range(10);
> 23/08/08 20:17:58 ERROR SparkSQLDriver: Failed in [SELECT 
> long_to_ip(2130706433L) FROM range(10)]
> java.lang.NoClassDefFoundError: org/codehaus/jackson/map/type/TypeFactory
>   at org.apache.hadoop.hive.ql.udf.UDFJson.(UDFJson.java:64)
>   at java.lang.Class.forName0(Native Method)
>   at java.lang.Class.forName(Class.java:348)
> ...
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-44719) NoClassDefFoundError when using Hive UDF

2023-08-08 Thread Yuming Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44719?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-44719:

Attachment: HiveUDFs-1.0-SNAPSHOT.jar

> NoClassDefFoundError when using Hive UDF
> 
>
> Key: SPARK-44719
> URL: https://issues.apache.org/jira/browse/SPARK-44719
> Project: Spark
>  Issue Type: Bug
>  Components: Build, SQL
>Affects Versions: 3.5.0
>Reporter: Yuming Wang
>Priority: Major
> Attachments: HiveUDFs-1.0-SNAPSHOT.jar
>
>
> How to reproduce:
> ```
> spark-sql (default)> add jar 
> /Users/yumwang/Downloads/HiveUDFs-1.0-SNAPSHOT.jar;
> Time taken: 0.413 seconds
> spark-sql (default)> CREATE TEMPORARY FUNCTION long_to_ip as 
> 'net.petrabarus.hiveudfs.LongToIP';
> Time taken: 0.038 seconds
> spark-sql (default)> SELECT long_to_ip(2130706433L) FROM range(10);
> 23/08/08 20:17:58 ERROR SparkSQLDriver: Failed in [SELECT 
> long_to_ip(2130706433L) FROM range(10)]
> java.lang.NoClassDefFoundError: org/codehaus/jackson/map/type/TypeFactory
>   at org.apache.hadoop.hive.ql.udf.UDFJson.(UDFJson.java:64)
>   at java.lang.Class.forName0(Native Method)
>   at java.lang.Class.forName(Class.java:348)
> ...
> ```



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-44719) NoClassDefFoundError when using Hive UDF

2023-08-08 Thread Yuming Wang (Jira)

Yuming Wang created SPARK-44719:
---

 Summary: NoClassDefFoundError when using Hive UDF
 Key: SPARK-44719
 URL: https://issues.apache.org/jira/browse/SPARK-44719
 Project: Spark
  Issue Type: Bug
  Components: Build, SQL
Affects Versions: 3.5.0
Reporter: Yuming Wang
 Attachments: HiveUDFs-1.0-SNAPSHOT.jar

How to reproduce:
```
spark-sql (default)> add jar /Users/yumwang/Downloads/HiveUDFs-1.0-SNAPSHOT.jar;
Time taken: 0.413 seconds
spark-sql (default)> CREATE TEMPORARY FUNCTION long_to_ip as 
'net.petrabarus.hiveudfs.LongToIP';
Time taken: 0.038 seconds
spark-sql (default)> SELECT long_to_ip(2130706433L) FROM range(10);
23/08/08 20:17:58 ERROR SparkSQLDriver: Failed in [SELECT 
long_to_ip(2130706433L) FROM range(10)]
java.lang.NoClassDefFoundError: org/codehaus/jackson/map/type/TypeFactory
at org.apache.hadoop.hive.ql.udf.UDFJson.(UDFJson.java:64)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:348)
...
```



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-42500) ConstantPropagation support more cases

2023-08-06 Thread Yuming Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-42500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang resolved SPARK-42500.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 42038
[https://github.com/apache/spark/pull/42038]

> ConstantPropagation support more cases
> --
>
> Key: SPARK-42500
> URL: https://issues.apache.org/jira/browse/SPARK-42500
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Yuming Wang
>Assignee: Tongwei
>Priority: Major
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-42500) ConstantPropagation support more cases

2023-08-06 Thread Yuming Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-42500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang reassigned SPARK-42500:
---

Assignee: Tongwei

> ConstantPropagation support more cases
> --
>
> Key: SPARK-42500
> URL: https://issues.apache.org/jira/browse/SPARK-42500
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Yuming Wang
>Assignee: Tongwei
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-44662) SPIP: Improving performance of BroadcastHashJoin queries with stream side join key on non partition columns

2023-08-04 Thread Yuming Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44662?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-44662:

Target Version/s:   (was: 3.3.3)

> SPIP: Improving performance of BroadcastHashJoin queries with stream side 
> join key on non partition columns
> ---
>
> Key: SPARK-44662
> URL: https://issues.apache.org/jira/browse/SPARK-44662
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.3
>Reporter: Asif
>Priority: Major
>
> h2. *Q1. What are you trying to do? Articulate your objectives using 
> absolutely no jargon.*
> On the lines of DPP which helps DataSourceV2 relations when the joining key 
> is a partition column, the same concept can be extended over to the case 
> where joining key is not a partition column.
> The Keys of BroadcastHashJoin are already available before actual evaluation 
> of the stream iterator. These keys can be pushed down to the DataSource as a 
> SortedSet.
> For non partition columns, the DataSources like iceberg have max/min stats on 
> column available at manifest level, and for formats like parquet , they have 
> max/min stats at various storage level. The passed SortedSet can be used to 
> prune using ranges at both driver level ( manifests files) as well as 
> executor level ( while actually going through chunks , row groups etc at 
> parquet level)
> If the data is stored as Columnar Batch format , then it would not be 
> possible to filter out individual row at DataSource level, even though we 
> have keys.
> But at the scan level, ( ColumnToRowExec) it is still possible to filter out 
> as many rows as possible , if the query involves nested joins. Thus reducing 
> the number of rows to join at the higher join levels.
> Will be adding more details..
> h2. *Q2. What problem is this proposal NOT designed to solve?*
> This can only help in BroadcastHashJoin's performance if the join is Inner or 
> Left Semi.
> This will also not work if there are nodes like Expand, Generator , Aggregate 
> (without group by on keys not part of joining column etc) below the 
> BroadcastHashJoin node being targeted.
> h2. *Q3. How is it done today, and what are the limits of current practice?*
> Currently this sort of pruning at DataSource level is being done using DPP 
> (Dynamic Partition Pruning ) and IFF one of the join key column is a 
> Partitioning column ( so that cost of DPP query is justified and way less 
> than amount of data it will be filtering by skipping partitions).
> The limitation is that DPP type approach is not implemented ( intentionally I 
> believe), if the join column is a non partition column ( because of cost of 
> "DPP type" query would most likely be way high as compared to any possible 
> pruning ( especially if the column is not stored in a sorted manner).
> h2. *Q4. What is new in your approach and why do you think it will be 
> successful?*
> 1) This allows pruning on non partition column based joins. 
> 2) Because it piggy backs on Broadcasted Keys, there is no extra cost of "DPP 
> type" query. 
> 3) The Data can be used by DataSource to prune at driver (possibly) and also 
> at executor level ( as in case of parquet which has max/min at various 
> structure levels)
> 4) The big benefit should be seen in multilevel nested join queries. In the 
> current code base, if I am correct, only one join's pruning filter would get 
> pushed at scan level. Since it is on partition key may be that is sufficient. 
> But if it is a nested Join query , and may be involving different columns on 
> streaming side for join, each such filter push could do significant pruning. 
> This requires some handling in case of AQE, as the stream side iterator ( & 
> hence stage evaluation needs to be delayed, till all the available join 
> filters in the nested tree are pushed at their respective target 
> BatchScanExec).
> h4. *Single Row Filteration*
> 5) In case of nested broadcasted joins, if the datasource is column vector 
> oriented , then what spark would get is a ColumnarBatch. But because scans 
> have Filters from multiple joins, they can be retrieved and can be applied in 
> code generated at ColumnToRowExec level, using a new "containsKey" method on 
> HashedRelation. Thus only those rows which satisfy all the 
> BroadcastedHashJoins ( whose keys have been pushed) , will be used for join 
> evaluation.
> The code is already there , will be opening a PR. For non partition table 
> TPCDS run on laptop with TPCDS data size of ( scale factor 4), I am seeing 
> 15% gain.
> For partition table TPCDS, there is improvement in 4 - 5 queries to the tune 
> of 10% to 37%.
> h2. *Q5. Who cares? If you are successful, what difference will it make?*
> If use cases involve

[jira] [Updated] (SPARK-44662) SPIP: Improving performance of BroadcastHashJoin queries with stream side join key on non partition columns

2023-08-04 Thread Yuming Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44662?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-44662:

Fix Version/s: (was: 3.3.3)

> SPIP: Improving performance of BroadcastHashJoin queries with stream side 
> join key on non partition columns
> ---
>
> Key: SPARK-44662
> URL: https://issues.apache.org/jira/browse/SPARK-44662
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.3
>Reporter: Asif
>Priority: Major
>
> h2. *Q1. What are you trying to do? Articulate your objectives using 
> absolutely no jargon.*
> On the lines of DPP which helps DataSourceV2 relations when the joining key 
> is a partition column, the same concept can be extended over to the case 
> where joining key is not a partition column.
> The Keys of BroadcastHashJoin are already available before actual evaluation 
> of the stream iterator. These keys can be pushed down to the DataSource as a 
> SortedSet.
> For non partition columns, the DataSources like iceberg have max/min stats on 
> column available at manifest level, and for formats like parquet , they have 
> max/min stats at various storage level. The passed SortedSet can be used to 
> prune using ranges at both driver level ( manifests files) as well as 
> executor level ( while actually going through chunks , row groups etc at 
> parquet level)
> If the data is stored as Columnar Batch format , then it would not be 
> possible to filter out individual row at DataSource level, even though we 
> have keys.
> But at the scan level, ( ColumnToRowExec) it is still possible to filter out 
> as many rows as possible , if the query involves nested joins. Thus reducing 
> the number of rows to join at the higher join levels.
> Will be adding more details..
> h2. *Q2. What problem is this proposal NOT designed to solve?*
> This can only help in BroadcastHashJoin's performance if the join is Inner or 
> Left Semi.
> This will also not work if there are nodes like Expand, Generator , Aggregate 
> (without group by on keys not part of joining column etc) below the 
> BroadcastHashJoin node being targeted.
> h2. *Q3. How is it done today, and what are the limits of current practice?*
> Currently this sort of pruning at DataSource level is being done using DPP 
> (Dynamic Partition Pruning ) and IFF one of the join key column is a 
> Partitioning column ( so that cost of DPP query is justified and way less 
> than amount of data it will be filtering by skipping partitions).
> The limitation is that DPP type approach is not implemented ( intentionally I 
> believe), if the join column is a non partition column ( because of cost of 
> "DPP type" query would most likely be way high as compared to any possible 
> pruning ( especially if the column is not stored in a sorted manner).
> h2. *Q4. What is new in your approach and why do you think it will be 
> successful?*
> 1) This allows pruning on non partition column based joins. 
> 2) Because it piggy backs on Broadcasted Keys, there is no extra cost of "DPP 
> type" query. 
> 3) The Data can be used by DataSource to prune at driver (possibly) and also 
> at executor level ( as in case of parquet which has max/min at various 
> structure levels)
> 4) The big benefit should be seen in multilevel nested join queries. In the 
> current code base, if I am correct, only one join's pruning filter would get 
> pushed at scan level. Since it is on partition key may be that is sufficient. 
> But if it is a nested Join query , and may be involving different columns on 
> streaming side for join, each such filter push could do significant pruning. 
> This requires some handling in case of AQE, as the stream side iterator ( & 
> hence stage evaluation needs to be delayed, till all the available join 
> filters in the nested tree are pushed at their respective target 
> BatchScanExec).
> h4. *Single Row Filteration*
> 5) In case of nested broadcasted joins, if the datasource is column vector 
> oriented , then what spark would get is a ColumnarBatch. But because scans 
> have Filters from multiple joins, they can be retrieved and can be applied in 
> code generated at ColumnToRowExec level, using a new "containsKey" method on 
> HashedRelation. Thus only those rows which satisfy all the 
> BroadcastedHashJoins ( whose keys have been pushed) , will be used for join 
> evaluation.
> The code is already there , will be opening a PR. For non partition table 
> TPCDS run on laptop with TPCDS data size of ( scale factor 4), I am seeing 
> 15% gain.
> For partition table TPCDS, there is improvement in 4 - 5 queries to the tune 
> of 10% to 37%.
> h2. *Q5. Who cares? If you are successful, what difference will it make?*
> If use cases involve m

[jira] [Assigned] (SPARK-44675) Increase ReservedCodeCacheSize for release build

2023-08-04 Thread Yuming Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44675?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang reassigned SPARK-44675:
---

Assignee: Yuming Wang

> Increase ReservedCodeCacheSize for release build
> 
>
> Key: SPARK-44675
> URL: https://issues.apache.org/jira/browse/SPARK-44675
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Affects Versions: 4.0.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-44675) Increase ReservedCodeCacheSize for release build

2023-08-04 Thread Yuming Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44675?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang resolved SPARK-44675.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 42344
[https://github.com/apache/spark/pull/42344]

> Increase ReservedCodeCacheSize for release build
> 
>
> Key: SPARK-44675
> URL: https://issues.apache.org/jira/browse/SPARK-44675
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Affects Versions: 4.0.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-44675) Increase ReservedCodeCacheSize for release build

2023-08-04 Thread Yuming Wang (Jira)

Yuming Wang created SPARK-44675:
---

 Summary: Increase ReservedCodeCacheSize for release build
 Key: SPARK-44675
 URL: https://issues.apache.org/jira/browse/SPARK-44675
 Project: Spark
  Issue Type: Improvement
  Components: Project Infra
Affects Versions: 4.0.0
Reporter: Yuming Wang






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-44654) In subquery cannot perform partition pruning

2023-08-03 Thread Yuming Wang (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-44654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17750739#comment-17750739
 ] 

Yuming Wang commented on SPARK-44654:
-

Another way is convert join to filter if maximum number of rows on one side is 
1: https://github.com/apache/spark/pull/42114

> In subquery cannot perform partition pruning
> 
>
> Key: SPARK-44654
> URL: https://issues.apache.org/jira/browse/SPARK-44654
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: 7mming7
>Priority: Minor
>  Labels: performance
> Attachments: image-2023-08-03-17-22-53-981.png
>
>
> The following SQL cannot perform partition pruning
> {code:java}
> SELECT * FROM parquet_part WHERE id_type in (SELECT max(id_type) from 
> parquet_part){code}
> As can be seen from the execution plan below, the partition pruning of left 
> cannot be performed after the subquery of in is converted into join
> !image-2023-08-03-17-22-53-981.png!
> The current issue proposes to optimize insubquery. Only when the value of in 
> is greater than a threshold, insubquery will be converted into Join



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-44651) Make do-release-docker.sh compatible with Mac m2

2023-08-02 Thread Yuming Wang (Jira)

Yuming Wang created SPARK-44651:
---

 Summary: Make do-release-docker.sh compatible with Mac m2
 Key: SPARK-44651
 URL: https://issues.apache.org/jira/browse/SPARK-44651
 Project: Spark
  Issue Type: Improvement
  Components: Project Infra
Affects Versions: 4.0.0
Reporter: Yuming Wang


How to test:
{code:sh}
dev/create-release/do-release-docker.sh -d /Users/yumwang/release-spark/output 
-s docs -n
{code}


Install python3-dev and build-essential:
{code:sh}
$APT_INSTALL python-is-python3 python3-pip python3-setuptools python3-dev 
build-essential
{code}

{noformat}
Collecting grpcio==1.56.0
  Downloading grpcio-1.56.0.tar.gz (24.3 MB)
 || 24.3 MB 6.7 MB/s 
ERROR: Command errored out with exit status 1:
 command: /usr/bin/python3 -c 'import sys, setuptools, tokenize; 
sys.argv[0] = '"'"'/tmp/pip-install-qmfpon02/grpcio/setup.py'"'"'; 
__file__='"'"'/tmp/pip-install-qmfpon02/grpcio/setup.py'"'"';f=getattr(tokenize,
 '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', 
'"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' egg_info 
--egg-base /tmp/pip-install-qmfpon02/grpcio/pip-egg-info
 cwd: /tmp/pip-install-qmfpon02/grpcio/
Complete output (11 lines):
Traceback (most recent call last):
  File "", line 1, in 
  File "/tmp/pip-install-qmfpon02/grpcio/setup.py", line 263, in 
if check_linker_need_libatomic():
  File "/tmp/pip-install-qmfpon02/grpcio/setup.py", line 210, in 
check_linker_need_libatomic
cpp_test = subprocess.Popen(cxx + ['-x', 'c++', '-std=c++14', '-'],
  File "/usr/lib/python3.8/subprocess.py", line 858, in __init__
self._execute_child(args, executable, preexec_fn, close_fds,
  File "/usr/lib/python3.8/subprocess.py", line 1704, in _execute_child
raise child_exception_type(errno_num, err_msg, err_filename)
FileNotFoundError: [Errno 2] No such file or directory: 'c++'

ERROR: Command errored out with exit status 1: python setup.py egg_info Check 
the logs for full command output.
...

  Could not find . This could mean the following:
* You're on Ubuntu and haven't run `apt-get install python3-dev`.
* You're on RHEL/Fedora and haven't run `yum install python3-devel` or
  `dnf install python3-devel` (make sure you also have redhat-rpm-config
  installed)
* You're on Mac OS X and the usual Python framework was somehow corrupted
  (check your environment variables or try re-installing?)
* You're on Windows and your Python installation was somehow corrupted
  (check your environment variables or try re-installing?)
{noformat}



{noformat}
#5 848.0 Successfully built grpcio future
#5 848.0 Failed to build pyarrow
#5 848.7 ERROR: Could not build wheels for pyarrow which use PEP 517 and cannot 
be installed directly
{noformat}


{noformat}
root@c57ec74c8d32:/# $APT_INSTALL r-base r-base-dev
Reading package lists... Done
Building dependency tree
Reading state information... Done
Some packages could not be installed. This may mean that you have
requested an impossible situation or if you are using the unstable
distribution that some required packages have not yet been created
or been moved out of Incoming.
The following information may help to resolve the situation:

The following packages have unmet dependencies:
 r-base : Depends: r-base-core (>= 4.3.1-3.2004.0) but it is not going to be 
installed
  Depends: r-recommended (= 4.3.1-3.2004.0) but it is not going to be 
installed
 r-base-dev : Depends: r-base-core (>= 4.3.1-3.2004.0) but it is not going to 
be installed
E: Unable to correct problems, you have held broken packages.
{noformat}






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-38506) Push partial aggregation through join

2023-08-02 Thread Yuming Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38506?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-38506:

Description: Please see 
https://docs.teradata.com/r/Enterprise_IntelliFlex_VMware/SQL-Request-and-Transaction-Processing/Join-Planning-and-Optimization/Partial-GROUP-BY-Block-Optimization
 for more details.  (was: Please see 
https://docs.teradata.com/r/Teradata-VantageTM-SQL-Request-and-Transaction-Processing/March-2019/Join-Planning-and-Optimization/Partial-GROUP-BY-Block-Optimization
 for more details.)

> Push partial aggregation through join
> -
>
> Key: SPARK-38506
> URL: https://issues.apache.org/jira/browse/SPARK-38506
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Yuming Wang
>Priority: Major
>
> Please see 
> https://docs.teradata.com/r/Enterprise_IntelliFlex_VMware/SQL-Request-and-Transaction-Processing/Join-Planning-and-Optimization/Partial-GROUP-BY-Block-Optimization
>  for more details.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-44562) Add OptimizeOneRowRelationSubquery in batch of Subquery

2023-08-02 Thread Yuming Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang reassigned SPARK-44562:
---

Assignee: Yuming Wang

> Add OptimizeOneRowRelationSubquery in batch of Subquery
> ---
>
> Key: SPARK-44562
> URL: https://issues.apache.org/jira/browse/SPARK-44562
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-44562) Add OptimizeOneRowRelationSubquery in batch of Subquery

2023-08-02 Thread Yuming Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang resolved SPARK-44562.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 42180
[https://github.com/apache/spark/pull/42180]

> Add OptimizeOneRowRelationSubquery in batch of Subquery
> ---
>
> Key: SPARK-44562
> URL: https://issues.apache.org/jira/browse/SPARK-44562
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-44598) spark 3.2+ can not read hive table with hbase serde when hbase StorefileSize is 0

2023-07-31 Thread Yuming Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44598?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang resolved SPARK-44598.
-
Resolution: Not A Problem

> spark 3.2+ can not read hive table with hbase serde when hbase StorefileSize  
> is 0
> --
>
> Key: SPARK-44598
> URL: https://issues.apache.org/jira/browse/SPARK-44598
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.2.3
>Reporter: ming95
>Priority: Major
>
> We using spark to read a hive table with hbase serde . We found that when the 
> hbase table data is relatively small (hbase StorefileSize is 0), the data 
> read by spark 3.2 or 3.5 is empty, and there is no error message.
> But when using spark2.4 or hive to read, the data can be read normally. Other 
> information shows that spark3.1 can also read data normally, can anyone 
> provide some ideas?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Reopened] (SPARK-44598) spark 3.2+ can not read hive table with hbase serde when hbase StorefileSize is 0

2023-07-31 Thread Yuming Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44598?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang reopened SPARK-44598:
-

> spark 3.2+ can not read hive table with hbase serde when hbase StorefileSize  
> is 0
> --
>
> Key: SPARK-44598
> URL: https://issues.apache.org/jira/browse/SPARK-44598
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.2.3
>Reporter: ming95
>Priority: Major
>
> We using spark to read a hive table with hbase serde . We found that when the 
> hbase table data is relatively small (hbase StorefileSize is 0), the data 
> read by spark 3.2 or 3.5 is empty, and there is no error message.
> But when using spark2.4 or hive to read, the data can be read normally. Other 
> information shows that spark3.1 can also read data normally, can anyone 
> provide some ideas?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-44598) spark 3.2+ can not read hive table with hbase serde when hbase StorefileSize is 0

2023-07-31 Thread Yuming Wang (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-44598?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17749418#comment-17749418
 ] 

Yuming Wang commented on SPARK-44598:
-

How to reproduce this issue?

> spark 3.2+ can not read hive table with hbase serde when hbase StorefileSize  
> is 0
> --
>
> Key: SPARK-44598
> URL: https://issues.apache.org/jira/browse/SPARK-44598
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.2.3
>Reporter: ming95
>Priority: Major
>
> We using spark to read a hive table with hbase serde . We found that when the 
> hbase table data is relatively small (hbase StorefileSize is 0), the data 
> read by spark 3.2 or 3.5 is empty, and there is no error message.
> But when using spark2.4 or hive to read, the data can be read normally. Other 
> information shows that spark3.1 can also read data normally, can anyone 
> provide some ideas?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-44454) HiveShim getTablesByType support fallback

2023-07-27 Thread Yuming Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang reassigned SPARK-44454:
---

Assignee: dzcxzl

> HiveShim getTablesByType support fallback
> -
>
> Key: SPARK-44454
> URL: https://issues.apache.org/jira/browse/SPARK-44454
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.1
>Reporter: dzcxzl
>Assignee: dzcxzl
>Priority: Minor
>
> When we use a high version of Hive Client to communicate with a low version 
> of Hive meta store, we may encounter Invalid method name: 
> 'get_tables_by_type'.
>  
> {code:java}
> 23/07/17 12:45:24,391 [main] DEBUG SparkSqlParser: Parsing command: show views
> 23/07/17 12:45:24,489 [main] ERROR log: Got exception: 
> org.apache.thrift.TApplicationException Invalid method name: 
> 'get_tables_by_type'
> org.apache.thrift.TApplicationException: Invalid method name: 
> 'get_tables_by_type'
>     at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:79)
>     at 
> org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.recv_get_tables_by_type(ThriftHiveMetastore.java:1433)
>     at 
> org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.get_tables_by_type(ThriftHiveMetastore.java:1418)
>     at 
> org.apache.hadoop.hive.metastore.HiveMetaStoreClient.getTables(HiveMetaStoreClient.java:1411)
>     at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>     at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>     at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>     at java.lang.reflect.Method.invoke(Method.java:498)
>     at 
> org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:173)
>     at com.sun.proxy.$Proxy23.getTables(Unknown Source)
>     at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>     at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>     at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>     at java.lang.reflect.Method.invoke(Method.java:498)
>     at 
> org.apache.hadoop.hive.metastore.HiveMetaStoreClient$SynchronizedHandler.invoke(HiveMetaStoreClient.java:2344)
>     at com.sun.proxy.$Proxy23.getTables(Unknown Source)
>     at org.apache.hadoop.hive.ql.metadata.Hive.getTablesByType(Hive.java:1427)
>     at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>     at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>     at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>     at java.lang.reflect.Method.invoke(Method.java:498)
>     at 
> org.apache.spark.sql.hive.client.Shim_v2_3.getTablesByType(HiveShim.scala:1408)
>     at 
> org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$listTablesByType$1(HiveClientImpl.scala:789)
>     at 
> org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$withHiveState$1(HiveClientImpl.scala:294)
>     at 
> org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:225)
>     at 
> org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:224)
>     at 
> org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:274)
>     at 
> org.apache.spark.sql.hive.client.HiveClientImpl.listTablesByType(HiveClientImpl.scala:785)
>     at 
> org.apache.spark.sql.hive.HiveExternalCatalog.$anonfun$listViews$1(HiveExternalCatalog.scala:895)
>     at 
> org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:108)
>     at 
> org.apache.spark.sql.hive.HiveExternalCatalog.listViews(HiveExternalCatalog.scala:893)
>     at 
> org.apache.spark.sql.catalyst.catalog.ExternalCatalogWithListener.listViews(ExternalCatalogWithListener.scala:158)
>     at 
> org.apache.spark.sql.catalyst.catalog.SessionCatalog.listViews(SessionCatalog.scala:1040)
>     at 
> org.apache.spark.sql.execution.command.ShowViewsCommand.$anonfun$run$5(views.scala:407)
>     at scala.Option.getOrElse(Option.scala:189)
>     at 
> org.apache.spark.sql.execution.command.ShowViewsCommand.run(views.scala:407) 
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-44454) HiveShim getTablesByType support fallback

2023-07-27 Thread Yuming Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang resolved SPARK-44454.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 42033
[https://github.com/apache/spark/pull/42033]

> HiveShim getTablesByType support fallback
> -
>
> Key: SPARK-44454
> URL: https://issues.apache.org/jira/browse/SPARK-44454
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.1
>Reporter: dzcxzl
>Assignee: dzcxzl
>Priority: Minor
> Fix For: 4.0.0
>
>
> When we use a high version of Hive Client to communicate with a low version 
> of Hive meta store, we may encounter Invalid method name: 
> 'get_tables_by_type'.
>  
> {code:java}
> 23/07/17 12:45:24,391 [main] DEBUG SparkSqlParser: Parsing command: show views
> 23/07/17 12:45:24,489 [main] ERROR log: Got exception: 
> org.apache.thrift.TApplicationException Invalid method name: 
> 'get_tables_by_type'
> org.apache.thrift.TApplicationException: Invalid method name: 
> 'get_tables_by_type'
>     at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:79)
>     at 
> org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.recv_get_tables_by_type(ThriftHiveMetastore.java:1433)
>     at 
> org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.get_tables_by_type(ThriftHiveMetastore.java:1418)
>     at 
> org.apache.hadoop.hive.metastore.HiveMetaStoreClient.getTables(HiveMetaStoreClient.java:1411)
>     at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>     at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>     at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>     at java.lang.reflect.Method.invoke(Method.java:498)
>     at 
> org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:173)
>     at com.sun.proxy.$Proxy23.getTables(Unknown Source)
>     at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>     at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>     at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>     at java.lang.reflect.Method.invoke(Method.java:498)
>     at 
> org.apache.hadoop.hive.metastore.HiveMetaStoreClient$SynchronizedHandler.invoke(HiveMetaStoreClient.java:2344)
>     at com.sun.proxy.$Proxy23.getTables(Unknown Source)
>     at org.apache.hadoop.hive.ql.metadata.Hive.getTablesByType(Hive.java:1427)
>     at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>     at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>     at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>     at java.lang.reflect.Method.invoke(Method.java:498)
>     at 
> org.apache.spark.sql.hive.client.Shim_v2_3.getTablesByType(HiveShim.scala:1408)
>     at 
> org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$listTablesByType$1(HiveClientImpl.scala:789)
>     at 
> org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$withHiveState$1(HiveClientImpl.scala:294)
>     at 
> org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:225)
>     at 
> org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:224)
>     at 
> org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:274)
>     at 
> org.apache.spark.sql.hive.client.HiveClientImpl.listTablesByType(HiveClientImpl.scala:785)
>     at 
> org.apache.spark.sql.hive.HiveExternalCatalog.$anonfun$listViews$1(HiveExternalCatalog.scala:895)
>     at 
> org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:108)
>     at 
> org.apache.spark.sql.hive.HiveExternalCatalog.listViews(HiveExternalCatalog.scala:893)
>     at 
> org.apache.spark.sql.catalyst.catalog.ExternalCatalogWithListener.listViews(ExternalCatalogWithListener.scala:158)
>     at 
> org.apache.spark.sql.catalyst.catalog.SessionCatalog.listViews(SessionCatalog.scala:1040)
>     at 
> org.apache.spark.sql.execution.command.ShowViewsCommand.$anonfun$run$5(views.scala:407)
>     at scala.Option.getOrElse(Option.scala:189)
>     at 
> org.apache.spark.sql.execution.command.ShowViewsCommand.run(views.scala:407) 
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-44513) Upgrade snappy-java to 1.1.10.3

2023-07-27 Thread Yuming Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44513?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-44513:

Fix Version/s: 3.4.2

> Upgrade snappy-java to 1.1.10.3
> ---
>
> Key: SPARK-44513
> URL: https://issues.apache.org/jira/browse/SPARK-44513
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.4.1
>Reporter: BingKun Pan
>Assignee: BingKun Pan
>Priority: Trivial
> Fix For: 3.4.2, 3.5.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-44562) Add OptimizeOneRowRelationSubquery in batch of Subquery

2023-07-26 Thread Yuming Wang (Jira)

Yuming Wang created SPARK-44562:
---

 Summary: Add OptimizeOneRowRelationSubquery in batch of Subquery
 Key: SPARK-44562
 URL: https://issues.apache.org/jira/browse/SPARK-44562
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 4.0.0
Reporter: Yuming Wang






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-44466) Exclude configs starting with SPARK_DRIVER_PREFIX and SPARK_EXECUTOR_PREFIX from modifiedConfigs

2023-07-24 Thread Yuming Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang resolved SPARK-44466.
-
Fix Version/s: 3.5.0
   4.0.0
   Resolution: Fixed

Issue resolved by pull request 42049
[https://github.com/apache/spark/pull/42049]

> Exclude configs starting with SPARK_DRIVER_PREFIX and SPARK_EXECUTOR_PREFIX 
> from modifiedConfigs
> 
>
> Key: SPARK-44466
> URL: https://issues.apache.org/jira/browse/SPARK-44466
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.1
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
> Fix For: 3.5.0, 4.0.0
>
> Attachments: screenshot-1.png
>
>
> Should not include this value: 
> !screenshot-1.png! 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-44466) Exclude configs starting with SPARK_DRIVER_PREFIX and SPARK_EXECUTOR_PREFIX from modifiedConfigs

2023-07-24 Thread Yuming Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang reassigned SPARK-44466:
---

Assignee: Yuming Wang

> Exclude configs starting with SPARK_DRIVER_PREFIX and SPARK_EXECUTOR_PREFIX 
> from modifiedConfigs
> 
>
> Key: SPARK-44466
> URL: https://issues.apache.org/jira/browse/SPARK-44466
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.1
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
> Attachments: screenshot-1.png
>
>
> Should not include this value: 
> !screenshot-1.png! 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-44527) Simplify BinaryComparison if its children contain ScalarSubquery with empty output

2023-07-24 Thread Yuming Wang (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-44527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17746488#comment-17746488
 ] 

Yuming Wang commented on SPARK-44527:
-

https://github.com/apache/spark/pull/42129

> Simplify BinaryComparison if its children contain ScalarSubquery with empty 
> output
> --
>
> Key: SPARK-44527
> URL: https://issues.apache.org/jira/browse/SPARK-44527
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Yuming Wang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-44527) Simplify BinaryComparison if its children contain ScalarSubquery with empty output

2023-07-24 Thread Yuming Wang (Jira)

Yuming Wang created SPARK-44527:
---

 Summary: Simplify BinaryComparison if its children contain 
ScalarSubquery with empty output
 Key: SPARK-44527
 URL: https://issues.apache.org/jira/browse/SPARK-44527
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 4.0.0
Reporter: Yuming Wang






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-44523) Filter's maxRows/maxRowsPerPartition is 0 if condition is FalseLiteral

2023-07-24 Thread Yuming Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-44523:

Summary: Filter's maxRows/maxRowsPerPartition is 0 if condition is 
FalseLiteral  (was: Filter's maxRows should be 0 if condition is FalseLiteral)

> Filter's maxRows/maxRowsPerPartition is 0 if condition is FalseLiteral
> --
>
> Key: SPARK-44523
> URL: https://issues.apache.org/jira/browse/SPARK-44523
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Yuming Wang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-44523) Filter's maxRows should be 0 if condition is FalseLiteral

2023-07-24 Thread Yuming Wang (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-44523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17746290#comment-17746290
 ] 

Yuming Wang commented on SPARK-44523:
-

https://github.com/apache/spark/pull/42126

> Filter's maxRows should be 0 if condition is FalseLiteral
> -
>
> Key: SPARK-44523
> URL: https://issues.apache.org/jira/browse/SPARK-44523
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Yuming Wang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-44523) Filter's maxRows should be 0 if condition is FalseLiteral

2023-07-24 Thread Yuming Wang (Jira)

Yuming Wang created SPARK-44523:
---

 Summary: Filter's maxRows should be 0 if condition is FalseLiteral
 Key: SPARK-44523
 URL: https://issues.apache.org/jira/browse/SPARK-44523
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 4.0.0
Reporter: Yuming Wang






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-44514) Optimize join if maximum number of rows on one side is 1

2023-07-23 Thread Yuming Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-44514:

Summary: Optimize join if maximum number of rows on one side is 1  (was: 
Rewrite the join to filter if one side maximum number of rows is 1)

> Optimize join if maximum number of rows on one side is 1
> 
>
> Key: SPARK-44514
> URL: https://issues.apache.org/jira/browse/SPARK-44514
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Yuming Wang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-44514) Rewrite the join to filter if one side maximum number of rows is 1

2023-07-23 Thread Yuming Wang (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-44514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17746143#comment-17746143
 ] 

Yuming Wang commented on SPARK-44514:
-

https://github.com/apache/spark/pull/42114

> Rewrite the join to filter if one side maximum number of rows is 1
> --
>
> Key: SPARK-44514
> URL: https://issues.apache.org/jira/browse/SPARK-44514
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Yuming Wang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-44514) Rewrite the join to filter if one side maximum number of rows is 1

2023-07-23 Thread Yuming Wang (Jira)

Yuming Wang created SPARK-44514:
---

 Summary: Rewrite the join to filter if one side maximum number of 
rows is 1
 Key: SPARK-44514
 URL: https://issues.apache.org/jira/browse/SPARK-44514
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 4.0.0
Reporter: Yuming Wang






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-44466) Exclude configs starting with SPARK_DRIVER_PREFIX and SPARK_EXECUTOR_PREFIX from modifiedConfigs

2023-07-20 Thread Yuming Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-44466:

Summary: Exclude configs starting with SPARK_DRIVER_PREFIX and 
SPARK_EXECUTOR_PREFIX from modifiedConfigs  (was: Update initialSessionOptions 
to the value after supplementation)

> Exclude configs starting with SPARK_DRIVER_PREFIX and SPARK_EXECUTOR_PREFIX 
> from modifiedConfigs
> 
>
> Key: SPARK-44466
> URL: https://issues.apache.org/jira/browse/SPARK-44466
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.1
>Reporter: Yuming Wang
>Priority: Major
> Attachments: screenshot-1.png
>
>
> Should not include this value: 
> !screenshot-1.png! 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-44493) Extract pushable predicates from disjunctive predicates

2023-07-20 Thread Yuming Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44493?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-44493:

Attachment: before.png

> Extract pushable predicates from disjunctive predicates
> ---
>
> Key: SPARK-44493
> URL: https://issues.apache.org/jira/browse/SPARK-44493
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Yuming Wang
>Priority: Major
> Attachments: after.png, before.png
>
>
> Example:
> {code:sql}
> select count(*)
> from
>   db.very_large_table
> where
>   session_start_dt between date_sub('2023-07-15', 1) and 
> date_add('2023-07-16', 1)
>   and type = 'event'
>   and date(event_timestamp) between '2023-07-15' and '2023-07-16'
>   and (
> (
>   page_id in (2627, 2835, 2402999)
>   and -- other predicates
>   and rdt = 0
> ) or (
>   page_id in (2616, 3411350)
>   and rdt = 0
> ) or (
>   page_id = 2403006
> ) or (
>   page_id in (2208336, 2356359)
>   and -- other predicates
>   and rdt = 0
> )
>   )
> {code}
> We can push down {{page_id in(2627, 2835, 2402999, 2616, 3411350, 2403006, 
> 2208336, 2356359)}} to datasource.
> Before:
> After:



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-44493) Extract pushable predicates from disjunctive predicates

2023-07-20 Thread Yuming Wang (Jira)

Yuming Wang created SPARK-44493:
---

 Summary: Extract pushable predicates from disjunctive predicates
 Key: SPARK-44493
 URL: https://issues.apache.org/jira/browse/SPARK-44493
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 4.0.0
Reporter: Yuming Wang
 Attachments: after.png, before.png

Example:
{code:sql}
select count(*)
from
  db.very_large_table
where
  session_start_dt between date_sub('2023-07-15', 1) and date_add('2023-07-16', 
1)
  and type = 'event'
  and date(event_timestamp) between '2023-07-15' and '2023-07-16'
  and (
(
  page_id in (2627, 2835, 2402999)
  and -- other predicates
  and rdt = 0
) or (
  page_id in (2616, 3411350)
  and rdt = 0
) or (
  page_id = 2403006
) or (
  page_id in (2208336, 2356359)
  and -- other predicates
  and rdt = 0
)
  )
{code}

We can push down {{page_id in(2627, 2835, 2402999, 2616, 3411350, 2403006, 
2208336, 2356359)}} to datasource.
Before:

After:





--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-44493) Extract pushable predicates from disjunctive predicates

2023-07-20 Thread Yuming Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44493?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-44493:

Description: 
Example:
{code:sql}
select count(*)
from
  db.very_large_table
where
  session_start_dt between date_sub('2023-07-15', 1) and date_add('2023-07-16', 
1)
  and type = 'event'
  and date(event_timestamp) between '2023-07-15' and '2023-07-16'
  and (
(
  page_id in (2627, 2835, 2402999)
  and -- other predicates
  and rdt = 0
) or (
  page_id in (2616, 3411350)
  and rdt = 0
) or (
  page_id = 2403006
) or (
  page_id in (2208336, 2356359)
  and -- other predicates
  and rdt = 0
)
  )
{code}

We can push down {{page_id in(2627, 2835, 2402999, 2616, 3411350, 2403006, 
2208336, 2356359)}} to datasource.
Before:
 !before.png! 
After:
 !after.png! 



  was:
Example:
{code:sql}
select count(*)
from
  db.very_large_table
where
  session_start_dt between date_sub('2023-07-15', 1) and date_add('2023-07-16', 
1)
  and type = 'event'
  and date(event_timestamp) between '2023-07-15' and '2023-07-16'
  and (
(
  page_id in (2627, 2835, 2402999)
  and -- other predicates
  and rdt = 0
) or (
  page_id in (2616, 3411350)
  and rdt = 0
) or (
  page_id = 2403006
) or (
  page_id in (2208336, 2356359)
  and -- other predicates
  and rdt = 0
)
  )
{code}

We can push down {{page_id in(2627, 2835, 2402999, 2616, 3411350, 2403006, 
2208336, 2356359)}} to datasource.
Before:

After:




> Extract pushable predicates from disjunctive predicates
> ---
>
> Key: SPARK-44493
> URL: https://issues.apache.org/jira/browse/SPARK-44493
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Yuming Wang
>Priority: Major
> Attachments: after.png, before.png
>
>
> Example:
> {code:sql}
> select count(*)
> from
>   db.very_large_table
> where
>   session_start_dt between date_sub('2023-07-15', 1) and 
> date_add('2023-07-16', 1)
>   and type = 'event'
>   and date(event_timestamp) between '2023-07-15' and '2023-07-16'
>   and (
> (
>   page_id in (2627, 2835, 2402999)
>   and -- other predicates
>   and rdt = 0
> ) or (
>   page_id in (2616, 3411350)
>   and rdt = 0
> ) or (
>   page_id = 2403006
> ) or (
>   page_id in (2208336, 2356359)
>   and -- other predicates
>   and rdt = 0
> )
>   )
> {code}
> We can push down {{page_id in(2627, 2835, 2402999, 2616, 3411350, 2403006, 
> 2208336, 2356359)}} to datasource.
> Before:
>  !before.png! 
> After:
>  !after.png! 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-44493) Extract pushable predicates from disjunctive predicates

2023-07-20 Thread Yuming Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44493?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-44493:

Attachment: after.png

> Extract pushable predicates from disjunctive predicates
> ---
>
> Key: SPARK-44493
> URL: https://issues.apache.org/jira/browse/SPARK-44493
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Yuming Wang
>Priority: Major
> Attachments: after.png, before.png
>
>
> Example:
> {code:sql}
> select count(*)
> from
>   db.very_large_table
> where
>   session_start_dt between date_sub('2023-07-15', 1) and 
> date_add('2023-07-16', 1)
>   and type = 'event'
>   and date(event_timestamp) between '2023-07-15' and '2023-07-16'
>   and (
> (
>   page_id in (2627, 2835, 2402999)
>   and -- other predicates
>   and rdt = 0
> ) or (
>   page_id in (2616, 3411350)
>   and rdt = 0
> ) or (
>   page_id = 2403006
> ) or (
>   page_id in (2208336, 2356359)
>   and -- other predicates
>   and rdt = 0
> )
>   )
> {code}
> We can push down {{page_id in(2627, 2835, 2402999, 2616, 3411350, 2403006, 
> 2208336, 2356359)}} to datasource.
> Before:
> After:



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-44466) Update initialSessionOptions to the value after supplementation

2023-07-17 Thread Yuming Wang (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-44466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17744040#comment-17744040
 ] 

Yuming Wang commented on SPARK-44466:
-

https://github.com/apache/spark/pull/42049

> Update initialSessionOptions to the value after supplementation
> ---
>
> Key: SPARK-44466
> URL: https://issues.apache.org/jira/browse/SPARK-44466
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.1
>Reporter: Yuming Wang
>Priority: Major
> Attachments: screenshot-1.png
>
>
> Should not include this value: 
> !screenshot-1.png! 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-44466) Update initialSessionOptions to the value after supplementation

2023-07-17 Thread Yuming Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-44466:

Description: 
Should not include this value: 
!screenshot-1.png! 

> Update initialSessionOptions to the value after supplementation
> ---
>
> Key: SPARK-44466
> URL: https://issues.apache.org/jira/browse/SPARK-44466
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.1
>Reporter: Yuming Wang
>Priority: Major
> Attachments: screenshot-1.png
>
>
> Should not include this value: 
> !screenshot-1.png! 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-44466) Update initialSessionOptions to the value after supplementation

2023-07-17 Thread Yuming Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-44466:

Attachment: screenshot-1.png

> Update initialSessionOptions to the value after supplementation
> ---
>
> Key: SPARK-44466
> URL: https://issues.apache.org/jira/browse/SPARK-44466
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.1
>Reporter: Yuming Wang
>Priority: Major
> Attachments: screenshot-1.png
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-44466) Update initialSessionOptions to the value after supplementation

2023-07-17 Thread Yuming Wang (Jira)

Yuming Wang created SPARK-44466:
---

 Summary: Update initialSessionOptions to the value after 
supplementation
 Key: SPARK-44466
 URL: https://issues.apache.org/jira/browse/SPARK-44466
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.4.1
Reporter: Yuming Wang
 Attachments: screenshot-1.png





--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-44448) Wrong results for dense_rank() <= k from InferWindowGroupLimit and DenseRankLimitIterator

2023-07-16 Thread Yuming Wang (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-8?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17743592#comment-17743592
 ] 

Yuming Wang commented on SPARK-8:
-

cc [~beliefer]

> Wrong results for dense_rank() <= k from InferWindowGroupLimit and 
> DenseRankLimitIterator
> -
>
> Key: SPARK-8
> URL: https://issues.apache.org/jira/browse/SPARK-8
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Jack Chen
>Priority: Major
>
> Top-k filters on a dense_rank() window function return wrong results, due to 
> a bug in optimization InferWindowGroupLimit, specifically in the code for 
> DenseRankLimitIterator, introduced in 
> https://issues.apache.org/jira/browse/SPARK-37099.
> Repro:
> {code:java}
> create or replace temp view t1 (p, o) as values (1, 1), (1, 1), (1, 2), (2, 
> 1), (2, 1), (2, 2);
> select * from (select *, dense_rank() over (partition by p order by o) as rnk 
> from t1) where rnk = 1;{code}
> Spark result:
> {code:java}
> [1,1,1]
> [1,1,1]
> [2,1,1]{code}
> Correct result:
> {code:java}
> [1,1,1]
> [1,1,1]
> [2,1,1]
> [2,1,1]{code}
>  
> The bug is in {{{}DenseRankLimitIterator{}}}, it fails to reset state 
> properly when transitioning from one window partition to the next. {{reset}} 
> only resets {{{}rank = 0{}}}, what it is missing is to reset 
> {{{}currentRankRow = null{}}}. This means that when processing the second and 
> later window partitions, the rank incorrectly gets incremented based on 
> comparing the ordering of the last row of the previous partition to the first 
> row of the new partition.
> This means that a dense_rank window func that has more than one window 
> partition and more than one row with dense_rank = 1 in the second or later 
> partitions can give wrong results when optimized.
> ({{{}RankLimitIterator{}}} narrowly avoids this bug by happenstance, the 
> first row in the new partition will try to increment rank, but increment it 
> by the value of count which is 0, so it happens to work by accident).
> Unfortunately, tests for the optimization only had a single row per rank, so 
> did not catch the bug as the bug requires multiple rows per rank.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-44448) Wrong results for dense_rank() <= k from InferWindowGroupLimit and DenseRankLimitIterator

2023-07-16 Thread Yuming Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-8?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-8:

Target Version/s: 3.5.0

> Wrong results for dense_rank() <= k from InferWindowGroupLimit and 
> DenseRankLimitIterator
> -
>
> Key: SPARK-8
> URL: https://issues.apache.org/jira/browse/SPARK-8
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Jack Chen
>Priority: Major
>
> Top-k filters on a dense_rank() window function return wrong results, due to 
> a bug in optimization InferWindowGroupLimit, specifically in the code for 
> DenseRankLimitIterator, introduced in 
> https://issues.apache.org/jira/browse/SPARK-37099.
> Repro:
> {code:java}
> create or replace temp view t1 (p, o) as values (1, 1), (1, 1), (1, 2), (2, 
> 1), (2, 1), (2, 2);
> select * from (select *, dense_rank() over (partition by p order by o) as rnk 
> from t1) where rnk = 1;{code}
> Spark result:
> {code:java}
> [1,1,1]
> [1,1,1]
> [2,1,1]{code}
> Correct result:
> {code:java}
> [1,1,1]
> [1,1,1]
> [2,1,1]
> [2,1,1]{code}
>  
> The bug is in {{{}DenseRankLimitIterator{}}}, it fails to reset state 
> properly when transitioning from one window partition to the next. {{reset}} 
> only resets {{{}rank = 0{}}}, what it is missing is to reset 
> {{{}currentRankRow = null{}}}. This means that when processing the second and 
> later window partitions, the rank incorrectly gets incremented based on 
> comparing the ordering of the last row of the previous partition to the first 
> row of the new partition.
> This means that a dense_rank window func that has more than one window 
> partition and more than one row with dense_rank = 1 in the second or later 
> partitions can give wrong results when optimized.
> ({{{}RankLimitIterator{}}} narrowly avoids this bug by happenstance, the 
> first row in the new partition will try to increment rank, but increment it 
> by the value of count which is 0, so it happens to work by accident).
> Unfortunately, tests for the optimization only had a single row per rank, so 
> did not catch the bug as the bug requires multiple rows per rank.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

< 1 2 3 4 5 6 7 8 9 10 >

101 - 200 of 2847 matches

Mail list logo