[jira] [Updated] (SPARK-38584) Unify the data validation

2023-03-16 Thread Ruifeng Zheng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng updated SPARK-38584:
--
Priority: Major  (was: Minor)

> Unify the data validation
> -
>
> Key: SPARK-38584
> URL: https://issues.apache.org/jira/browse/SPARK-38584
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Major
> Fix For: 3.4.0
>
>
> 1, input vector validation is missing in most algorithms, when the input 
> dataset contains some invalid values (NaN/Infinity), then:
>  * the training may run successfuly and return model containing invalid 
> coefficients, like LinearSVC
>  * the training may fail with irrelevant message, like KMeans
>  
> {code:java}
> import org.apache.spark.ml.feature._
> import org.apache.spark.ml.linalg._
> import org.apache.spark.ml.classification._
> import org.apache.spark.ml.clustering._
> val df = sc.parallelize(Seq(LabeledPoint(1.0, Vectors.dense(1.0, 
> Double.NaN)), LabeledPoint(0.0, Vectors.dense(Double.PositiveInfinity, 
> 2.0.toDF()
> val svc = new LinearSVC()
> val model = svc.fit(df)
> scala> model.intercept
> res0: Double = NaN
> scala> model.coefficients
> res1: org.apache.spark.ml.linalg.Vector = [NaN,NaN]
> val km = new KMeans().setK(2)
> scala> km.fit(df)
> 22/03/17 14:29:10 ERROR Executor: Exception in task 11.0 in stage 10.0 (TID 
> 113)
> java.lang.IllegalArgumentException: requirement failed: Both norms should be 
> greater or equal to 0.0, found norm1=NaN, norm2=Infinity
>     at scala.Predef$.require(Predef.scala:281)
>     at 
> org.apache.spark.mllib.util.MLUtils$.fastSquaredDistance(MLUtils.scala:543)
> {code}
>  
> We should make ml algorithms fail fast, if the input dataset is invalid.
>  
> 2, there exists some methods to validate input labels and weights in 
> different files:
>  * {{org.apache.spark.ml.functions}}
>  * org.apache.spark.ml.util.DatasetUtils
>  * org.apache.spark.ml.util.MetadataUtils,
>  * org.apache.spark.ml.Predictor
>  * etc.
>  
> I think it is time to unify realtive methods to one source file.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42584) Improve output of Column.explain

2023-03-16 Thread jiaan.geng (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17701498#comment-17701498
 ] 

jiaan.geng commented on SPARK-42584:


I will take a look!

> Improve output of Column.explain
> 
>
> Key: SPARK-42584
> URL: https://issues.apache.org/jira/browse/SPARK-42584
> Project: Spark
>  Issue Type: New Feature
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Herman van Hövell
>Priority: Major
>
> We currently display the structure of the proto in both the regular and 
> extended version of explain. We should display a more compact sql-a-like 
> string for the regular version.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42778) QueryStageExec should respect supportsRowBased

2023-03-16 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17701487#comment-17701487
 ] 

Dongjoon Hyun commented on SPARK-42778:
---

This is backported to branch-3.4 via https://github.com/apache/spark/pull/40417

> QueryStageExec should respect supportsRowBased
> --
>
> Key: SPARK-42778
> URL: https://issues.apache.org/jira/browse/SPARK-42778
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: XiDuo You
>Assignee: XiDuo You
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-42778) QueryStageExec should respect supportsRowBased

2023-03-16 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42778?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-42778:
--
Fix Version/s: 3.4.0
   (was: 3.5.0)

> QueryStageExec should respect supportsRowBased
> --
>
> Key: SPARK-42778
> URL: https://issues.apache.org/jira/browse/SPARK-42778
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: XiDuo You
>Assignee: XiDuo You
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-42778) QueryStageExec should respect supportsRowBased

2023-03-16 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42778?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-42778:
--
Affects Version/s: 3.4.0
   (was: 3.5.0)

> QueryStageExec should respect supportsRowBased
> --
>
> Key: SPARK-42778
> URL: https://issues.apache.org/jira/browse/SPARK-42778
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: XiDuo You
>Assignee: XiDuo You
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42778) QueryStageExec should respect supportsRowBased

2023-03-16 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42778?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-42778:
-

Assignee: XiDuo You

> QueryStageExec should respect supportsRowBased
> --
>
> Key: SPARK-42778
> URL: https://issues.apache.org/jira/browse/SPARK-42778
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: XiDuo You
>Assignee: XiDuo You
>Priority: Major
> Fix For: 3.5.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42814) Upgrade some maven-plugins

2023-03-16 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-42814:
-

Assignee: Yang Jie

> Upgrade some maven-plugins
> --
>
> Key: SPARK-42814
> URL: https://issues.apache.org/jira/browse/SPARK-42814
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.5.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Minor
>
> maven-enforcer-plugin 3.0.0-M2 -> 3.2.1
>  - [https://github.com/apache/maven-enforcer/releases/tag/enforcer-3.2.1]
>  - [https://github.com/apache/maven-enforcer/releases/tag/enforcer-3.1.0]
> build-helper-maven-plugin 3.2.0 -> 3.3.0
>  - 
> [https://github.com/mojohaus/build-helper-maven-plugin/releases/tag/build-helper-maven-plugin-3.3.0]
> maven-compiler-plugin 3.10.1 -> 3.11.0
>  - 
> [https://github.com/apache/maven-compiler-plugin/releases/tag/maven-compiler-plugin-3.11.0]
> maven-surefire-plugin 3.0.0-M9 -> 3.0.0
>  - [https://github.com/apache/maven-surefire/releases/tag/surefire-3.0.0]
> maven-javadoc-plugin 3.4.1 -> 3.5.0
>  - 
> [https://github.com/apache/maven-javadoc-plugin/releases/tag/maven-javadoc-plugin-3.5.0]
> maven-deploy-plugin 3.0.0 -> 3.1.0
>  - 
> [https://github.com/apache/maven-deploy-plugin/releases/tag/maven-deploy-plugin-3.1.0]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-42814) Upgrade some maven-plugins

2023-03-16 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-42814.
---
Fix Version/s: 3.5.0
   Resolution: Fixed

Issue resolved by pull request 40445
[https://github.com/apache/spark/pull/40445]

> Upgrade some maven-plugins
> --
>
> Key: SPARK-42814
> URL: https://issues.apache.org/jira/browse/SPARK-42814
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.5.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Minor
> Fix For: 3.5.0
>
>
> maven-enforcer-plugin 3.0.0-M2 -> 3.2.1
>  - [https://github.com/apache/maven-enforcer/releases/tag/enforcer-3.2.1]
>  - [https://github.com/apache/maven-enforcer/releases/tag/enforcer-3.1.0]
> build-helper-maven-plugin 3.2.0 -> 3.3.0
>  - 
> [https://github.com/mojohaus/build-helper-maven-plugin/releases/tag/build-helper-maven-plugin-3.3.0]
> maven-compiler-plugin 3.10.1 -> 3.11.0
>  - 
> [https://github.com/apache/maven-compiler-plugin/releases/tag/maven-compiler-plugin-3.11.0]
> maven-surefire-plugin 3.0.0-M9 -> 3.0.0
>  - [https://github.com/apache/maven-surefire/releases/tag/surefire-3.0.0]
> maven-javadoc-plugin 3.4.1 -> 3.5.0
>  - 
> [https://github.com/apache/maven-javadoc-plugin/releases/tag/maven-javadoc-plugin-3.5.0]
> maven-deploy-plugin 3.0.0 -> 3.1.0
>  - 
> [https://github.com/apache/maven-deploy-plugin/releases/tag/maven-deploy-plugin-3.1.0]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-42823) spark-sql shell supports multipart namespaces for initialization

2023-03-16 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-42823:
--
Fix Version/s: 3.4.1
   (was: 3.4.0)

> spark-sql shell supports multipart namespaces for initialization
> 
>
> Key: SPARK-42823
> URL: https://issues.apache.org/jira/browse/SPARK-42823
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Major
> Fix For: 3.4.1
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42823) spark-sql shell supports multipart namespaces for initialization

2023-03-16 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-42823:
-

Assignee: Kent Yao

> spark-sql shell supports multipart namespaces for initialization
> 
>
> Key: SPARK-42823
> URL: https://issues.apache.org/jira/browse/SPARK-42823
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-42823) spark-sql shell supports multipart namespaces for initialization

2023-03-16 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-42823.
---
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 40457
[https://github.com/apache/spark/pull/40457]

> spark-sql shell supports multipart namespaces for initialization
> 
>
> Key: SPARK-42823
> URL: https://issues.apache.org/jira/browse/SPARK-42823
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Kent Yao
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42832) Remove repartition if it is the child of LocalLimit

2023-03-16 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17701478#comment-17701478
 ] 

Apache Spark commented on SPARK-42832:
--

User 'wangyum' has created a pull request for this issue:
https://github.com/apache/spark/pull/40462

> Remove repartition if it is the child of LocalLimit
> ---
>
> Key: SPARK-42832
> URL: https://issues.apache.org/jira/browse/SPARK-42832
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Yuming Wang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42832) Remove repartition if it is the child of LocalLimit

2023-03-16 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42832?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42832:


Assignee: (was: Apache Spark)

> Remove repartition if it is the child of LocalLimit
> ---
>
> Key: SPARK-42832
> URL: https://issues.apache.org/jira/browse/SPARK-42832
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Yuming Wang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42832) Remove repartition if it is the child of LocalLimit

2023-03-16 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42832?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42832:


Assignee: Apache Spark

> Remove repartition if it is the child of LocalLimit
> ---
>
> Key: SPARK-42832
> URL: https://issues.apache.org/jira/browse/SPARK-42832
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Yuming Wang
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42831) Show result expressions in AggregateExec

2023-03-16 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42831?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17701471#comment-17701471
 ] 

Apache Spark commented on SPARK-42831:
--

User 'wankunde' has created a pull request for this issue:
https://github.com/apache/spark/pull/40461

> Show result expressions in AggregateExec
> 
>
> Key: SPARK-42831
> URL: https://issues.apache.org/jira/browse/SPARK-42831
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Wan Kun
>Priority: Minor
>
> If the result expressions in AggregateExec are not empty, we should display 
> them. Or we will get confused because some important expressions do not show 
> up in the DAG.
> For example, the plan for query *SELECT sum(p) from values(cast(23.4 as 
> decimal(7,2))) t(p)*  was incorrect because the result expression 
> *MakeDecimal(sum(UnscaledValue(p#0))#1L,17,2) AS sum(p)#2* is not displayed
> Before 
> {code:java}
> == Physical Plan ==
> AdaptiveSparkPlan isFinalPlan=false
> +- HashAggregate(keys=[], functions=[sum(UnscaledValue(p#0))], 
> output=[sum(p)#2])
>+- Exchange SinglePartition, ENSURE_REQUIREMENTS, [plan_id=11]
>   +- HashAggregate(keys=[], functions=[partial_sum(UnscaledValue(p#0))], 
> output=[sum#5L])
>  +- LocalTableScan [p#0]
> {code}
> After
> {code:java}
> == Physical Plan == 
> AdaptiveSparkPlan isFinalPlan=false
> +- HashAggregate(keys=[], functions=[sum(UnscaledValue(p#0))], 
> results=[MakeDecimal(sum(UnscaledValue(p#0))#1L,17,2) AS sum(p)#2], 
> output=[sum(p)#2])
>+- Exchange SinglePartition, ENSURE_REQUIREMENTS, [plan_id=38]
>   +- HashAggregate(keys=[], functions=[partial_sum(UnscaledValue(p#0))], 
> results=[sum#13L], output=[sum#13L])
>  +- LocalTableScan [p#0]
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42831) Show result expressions in AggregateExec

2023-03-16 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42831?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42831:


Assignee: (was: Apache Spark)

> Show result expressions in AggregateExec
> 
>
> Key: SPARK-42831
> URL: https://issues.apache.org/jira/browse/SPARK-42831
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Wan Kun
>Priority: Minor
>
> If the result expressions in AggregateExec are not empty, we should display 
> them. Or we will get confused because some important expressions do not show 
> up in the DAG.
> For example, the plan for query *SELECT sum(p) from values(cast(23.4 as 
> decimal(7,2))) t(p)*  was incorrect because the result expression 
> *MakeDecimal(sum(UnscaledValue(p#0))#1L,17,2) AS sum(p)#2* is not displayed
> Before 
> {code:java}
> == Physical Plan ==
> AdaptiveSparkPlan isFinalPlan=false
> +- HashAggregate(keys=[], functions=[sum(UnscaledValue(p#0))], 
> output=[sum(p)#2])
>+- Exchange SinglePartition, ENSURE_REQUIREMENTS, [plan_id=11]
>   +- HashAggregate(keys=[], functions=[partial_sum(UnscaledValue(p#0))], 
> output=[sum#5L])
>  +- LocalTableScan [p#0]
> {code}
> After
> {code:java}
> == Physical Plan == 
> AdaptiveSparkPlan isFinalPlan=false
> +- HashAggregate(keys=[], functions=[sum(UnscaledValue(p#0))], 
> results=[MakeDecimal(sum(UnscaledValue(p#0))#1L,17,2) AS sum(p)#2], 
> output=[sum(p)#2])
>+- Exchange SinglePartition, ENSURE_REQUIREMENTS, [plan_id=38]
>   +- HashAggregate(keys=[], functions=[partial_sum(UnscaledValue(p#0))], 
> results=[sum#13L], output=[sum#13L])
>  +- LocalTableScan [p#0]
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42831) Show result expressions in AggregateExec

2023-03-16 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42831?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17701470#comment-17701470
 ] 

Apache Spark commented on SPARK-42831:
--

User 'wankunde' has created a pull request for this issue:
https://github.com/apache/spark/pull/40461

> Show result expressions in AggregateExec
> 
>
> Key: SPARK-42831
> URL: https://issues.apache.org/jira/browse/SPARK-42831
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Wan Kun
>Priority: Minor
>
> If the result expressions in AggregateExec are not empty, we should display 
> them. Or we will get confused because some important expressions do not show 
> up in the DAG.
> For example, the plan for query *SELECT sum(p) from values(cast(23.4 as 
> decimal(7,2))) t(p)*  was incorrect because the result expression 
> *MakeDecimal(sum(UnscaledValue(p#0))#1L,17,2) AS sum(p)#2* is not displayed
> Before 
> {code:java}
> == Physical Plan ==
> AdaptiveSparkPlan isFinalPlan=false
> +- HashAggregate(keys=[], functions=[sum(UnscaledValue(p#0))], 
> output=[sum(p)#2])
>+- Exchange SinglePartition, ENSURE_REQUIREMENTS, [plan_id=11]
>   +- HashAggregate(keys=[], functions=[partial_sum(UnscaledValue(p#0))], 
> output=[sum#5L])
>  +- LocalTableScan [p#0]
> {code}
> After
> {code:java}
> == Physical Plan == 
> AdaptiveSparkPlan isFinalPlan=false
> +- HashAggregate(keys=[], functions=[sum(UnscaledValue(p#0))], 
> results=[MakeDecimal(sum(UnscaledValue(p#0))#1L,17,2) AS sum(p)#2], 
> output=[sum(p)#2])
>+- Exchange SinglePartition, ENSURE_REQUIREMENTS, [plan_id=38]
>   +- HashAggregate(keys=[], functions=[partial_sum(UnscaledValue(p#0))], 
> results=[sum#13L], output=[sum#13L])
>  +- LocalTableScan [p#0]
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42831) Show result expressions in AggregateExec

2023-03-16 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42831?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42831:


Assignee: Apache Spark

> Show result expressions in AggregateExec
> 
>
> Key: SPARK-42831
> URL: https://issues.apache.org/jira/browse/SPARK-42831
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Wan Kun
>Assignee: Apache Spark
>Priority: Minor
>
> If the result expressions in AggregateExec are not empty, we should display 
> them. Or we will get confused because some important expressions do not show 
> up in the DAG.
> For example, the plan for query *SELECT sum(p) from values(cast(23.4 as 
> decimal(7,2))) t(p)*  was incorrect because the result expression 
> *MakeDecimal(sum(UnscaledValue(p#0))#1L,17,2) AS sum(p)#2* is not displayed
> Before 
> {code:java}
> == Physical Plan ==
> AdaptiveSparkPlan isFinalPlan=false
> +- HashAggregate(keys=[], functions=[sum(UnscaledValue(p#0))], 
> output=[sum(p)#2])
>+- Exchange SinglePartition, ENSURE_REQUIREMENTS, [plan_id=11]
>   +- HashAggregate(keys=[], functions=[partial_sum(UnscaledValue(p#0))], 
> output=[sum#5L])
>  +- LocalTableScan [p#0]
> {code}
> After
> {code:java}
> == Physical Plan == 
> AdaptiveSparkPlan isFinalPlan=false
> +- HashAggregate(keys=[], functions=[sum(UnscaledValue(p#0))], 
> results=[MakeDecimal(sum(UnscaledValue(p#0))#1L,17,2) AS sum(p)#2], 
> output=[sum(p)#2])
>+- Exchange SinglePartition, ENSURE_REQUIREMENTS, [plan_id=38]
>   +- HashAggregate(keys=[], functions=[partial_sum(UnscaledValue(p#0))], 
> results=[sum#13L], output=[sum#13L])
>  +- LocalTableScan [p#0]
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-42831) Show result expressions in AggregateExec

2023-03-16 Thread Wan Kun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42831?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wan Kun updated SPARK-42831:

Description: 
If the result expressions in AggregateExec are not empty, we should display 
them. Or we will get confused because some important expressions do not show up 
in the DAG.

For example, the plan for query *SELECT sum(p) from values(cast(23.4 as 
decimal(7,2))) t(p)*  was incorrect because the result expression 
*MakeDecimal(sum(UnscaledValue(p#0))#1L,17,2) AS sum(p)#2* is not displayed

Before 

{code:java}
== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- HashAggregate(keys=[], functions=[sum(UnscaledValue(p#0))], 
output=[sum(p)#2])
   +- Exchange SinglePartition, ENSURE_REQUIREMENTS, [plan_id=11]
  +- HashAggregate(keys=[], functions=[partial_sum(UnscaledValue(p#0))], 
output=[sum#5L])
 +- LocalTableScan [p#0]
{code}


After

{code:java}
== Physical Plan == 
AdaptiveSparkPlan isFinalPlan=false
+- HashAggregate(keys=[], functions=[sum(UnscaledValue(p#0))], 
results=[MakeDecimal(sum(UnscaledValue(p#0))#1L,17,2) AS sum(p)#2], 
output=[sum(p)#2])
   +- Exchange SinglePartition, ENSURE_REQUIREMENTS, [plan_id=38]
  +- HashAggregate(keys=[], functions=[partial_sum(UnscaledValue(p#0))], 
results=[sum#13L], output=[sum#13L])
 +- LocalTableScan [p#0]
{code}


  was:
If the result expressions in AggregateExec is non-empty, we should show them. 
Or we will be confused due to some important expressions did not showed in DAG.

For example, the plan of query SELECT sum(p) from values(cast(23.4 as 
decimal(7,2))) t(p)  was not correct because the result expression 
MakeDecimal(sum(UnscaledValue(p#0))#1L,17,2) AS sum(p)#2 was now showing

Before 

{code:java}
== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- HashAggregate(keys=[], functions=[sum(UnscaledValue(p#0))], 
output=[sum(p)#2])
   +- Exchange SinglePartition, ENSURE_REQUIREMENTS, [plan_id=11]
  +- HashAggregate(keys=[], functions=[partial_sum(UnscaledValue(p#0))], 
output=[sum#5L])
 +- LocalTableScan [p#0]
{code}


After

{code:java}
== Physical Plan == 
AdaptiveSparkPlan isFinalPlan=false
+- HashAggregate(keys=[], functions=[sum(UnscaledValue(p#0))], 
results=[MakeDecimal(sum(UnscaledValue(p#0))#1L,17,2) AS sum(p)#2], 
output=[sum(p)#2])
   +- Exchange SinglePartition, ENSURE_REQUIREMENTS, [plan_id=38]
  +- HashAggregate(keys=[], functions=[partial_sum(UnscaledValue(p#0))], 
results=[sum#13L], output=[sum#13L])
 +- LocalTableScan [p#0]
{code}



> Show result expressions in AggregateExec
> 
>
> Key: SPARK-42831
> URL: https://issues.apache.org/jira/browse/SPARK-42831
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Wan Kun
>Priority: Minor
>
> If the result expressions in AggregateExec are not empty, we should display 
> them. Or we will get confused because some important expressions do not show 
> up in the DAG.
> For example, the plan for query *SELECT sum(p) from values(cast(23.4 as 
> decimal(7,2))) t(p)*  was incorrect because the result expression 
> *MakeDecimal(sum(UnscaledValue(p#0))#1L,17,2) AS sum(p)#2* is not displayed
> Before 
> {code:java}
> == Physical Plan ==
> AdaptiveSparkPlan isFinalPlan=false
> +- HashAggregate(keys=[], functions=[sum(UnscaledValue(p#0))], 
> output=[sum(p)#2])
>+- Exchange SinglePartition, ENSURE_REQUIREMENTS, [plan_id=11]
>   +- HashAggregate(keys=[], functions=[partial_sum(UnscaledValue(p#0))], 
> output=[sum#5L])
>  +- LocalTableScan [p#0]
> {code}
> After
> {code:java}
> == Physical Plan == 
> AdaptiveSparkPlan isFinalPlan=false
> +- HashAggregate(keys=[], functions=[sum(UnscaledValue(p#0))], 
> results=[MakeDecimal(sum(UnscaledValue(p#0))#1L,17,2) AS sum(p)#2], 
> output=[sum(p)#2])
>+- Exchange SinglePartition, ENSURE_REQUIREMENTS, [plan_id=38]
>   +- HashAggregate(keys=[], functions=[partial_sum(UnscaledValue(p#0))], 
> results=[sum#13L], output=[sum#13L])
>  +- LocalTableScan [p#0]
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-42832) Remove repartition if it is the child of LocalLimit

2023-03-16 Thread Yuming Wang (Jira)
Yuming Wang created SPARK-42832:
---

 Summary: Remove repartition if it is the child of LocalLimit
 Key: SPARK-42832
 URL: https://issues.apache.org/jira/browse/SPARK-42832
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.5.0
Reporter: Yuming Wang






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-42831) Show result expressions in AggregateExec

2023-03-16 Thread Wan Kun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42831?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wan Kun updated SPARK-42831:

Description: 
If the result expressions in AggregateExec is non-empty, we should show them. 
Or we will be confused due to some important expressions did not showed in DAG.

For example, the plan of query SELECT sum(p) from values(cast(23.4 as 
decimal(7,2))) t(p)  was not correct because the result expression 
MakeDecimal(sum(UnscaledValue(p#0))#1L,17,2) AS sum(p)#2 was now showing

Before 

{code:java}
== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- HashAggregate(keys=[], functions=[sum(UnscaledValue(p#0))], 
output=[sum(p)#2])
   +- Exchange SinglePartition, ENSURE_REQUIREMENTS, [plan_id=11]
  +- HashAggregate(keys=[], functions=[partial_sum(UnscaledValue(p#0))], 
output=[sum#5L])
 +- LocalTableScan [p#0]
{code}


After

{code:java}
== Physical Plan == 
AdaptiveSparkPlan isFinalPlan=false
+- HashAggregate(keys=[], functions=[sum(UnscaledValue(p#0))], 
results=[MakeDecimal(sum(UnscaledValue(p#0))#1L,17,2) AS sum(p)#2], 
output=[sum(p)#2])
   +- Exchange SinglePartition, ENSURE_REQUIREMENTS, [plan_id=38]
  +- HashAggregate(keys=[], functions=[partial_sum(UnscaledValue(p#0))], 
results=[sum#13L], output=[sum#13L])
 +- LocalTableScan [p#0]
{code}


  was:
If the result expressions in AggregateExec is non-empty, we should show them. 
Or we will be confused due to some important expressions did not showed in DAG.

For example, the plan of query SELECT sum(p) from values(cast(23.4 as 
decimal(7,2))) t(p)  was not correct because the result expression 
MakeDecimal(sum(UnscaledValue(p#0))#1L,17,2) AS sum(p)#2 was now showing



> Show result expressions in AggregateExec
> 
>
> Key: SPARK-42831
> URL: https://issues.apache.org/jira/browse/SPARK-42831
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Wan Kun
>Priority: Minor
>
> If the result expressions in AggregateExec is non-empty, we should show them. 
> Or we will be confused due to some important expressions did not showed in 
> DAG.
> For example, the plan of query SELECT sum(p) from values(cast(23.4 as 
> decimal(7,2))) t(p)  was not correct because the result expression 
> MakeDecimal(sum(UnscaledValue(p#0))#1L,17,2) AS sum(p)#2 was now showing
> Before 
> {code:java}
> == Physical Plan ==
> AdaptiveSparkPlan isFinalPlan=false
> +- HashAggregate(keys=[], functions=[sum(UnscaledValue(p#0))], 
> output=[sum(p)#2])
>+- Exchange SinglePartition, ENSURE_REQUIREMENTS, [plan_id=11]
>   +- HashAggregate(keys=[], functions=[partial_sum(UnscaledValue(p#0))], 
> output=[sum#5L])
>  +- LocalTableScan [p#0]
> {code}
> After
> {code:java}
> == Physical Plan == 
> AdaptiveSparkPlan isFinalPlan=false
> +- HashAggregate(keys=[], functions=[sum(UnscaledValue(p#0))], 
> results=[MakeDecimal(sum(UnscaledValue(p#0))#1L,17,2) AS sum(p)#2], 
> output=[sum(p)#2])
>+- Exchange SinglePartition, ENSURE_REQUIREMENTS, [plan_id=38]
>   +- HashAggregate(keys=[], functions=[partial_sum(UnscaledValue(p#0))], 
> results=[sum#13L], output=[sum#13L])
>  +- LocalTableScan [p#0]
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-42831) Show result expressions in AggregateExec

2023-03-16 Thread Wan Kun (Jira)
Wan Kun created SPARK-42831:
---

 Summary: Show result expressions in AggregateExec
 Key: SPARK-42831
 URL: https://issues.apache.org/jira/browse/SPARK-42831
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.4.0
Reporter: Wan Kun


If the result expressions in AggregateExec is non-empty, we should show them. 
Or we will be confused due to some important expressions did not showed in DAG.

For example, the plan of query SELECT sum(p) from values(cast(23.4 as 
decimal(7,2))) t(p)  was not correct because the result expression 
MakeDecimal(sum(UnscaledValue(p#0))#1L,17,2) AS sum(p)#2 was now showing




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-42415) The built-in dialects support OFFSET and paging query.

2023-03-16 Thread jiaan.geng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jiaan.geng resolved SPARK-42415.

Resolution: Won't Fix

> The built-in dialects support OFFSET and paging query.
> --
>
> Key: SPARK-42415
> URL: https://issues.apache.org/jira/browse/SPARK-42415
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: jiaan.geng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42824) Provide a clear error message for unsupported JVM attributes.

2023-03-16 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42824?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-42824:


Assignee: Haejoon Lee

> Provide a clear error message for unsupported JVM attributes.
> -
>
> Key: SPARK-42824
> URL: https://issues.apache.org/jira/browse/SPARK-42824
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Haejoon Lee
>Assignee: Haejoon Lee
>Priority: Major
>
> There are attributes, such as "_jvm", that were accessible in PySpark but 
> cannot be accessed in Spark Connect. We need to display appropriate error 
> messages for these cases.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-42824) Provide a clear error message for unsupported JVM attributes.

2023-03-16 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42824?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-42824.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 40458
[https://github.com/apache/spark/pull/40458]

> Provide a clear error message for unsupported JVM attributes.
> -
>
> Key: SPARK-42824
> URL: https://issues.apache.org/jira/browse/SPARK-42824
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Haejoon Lee
>Assignee: Haejoon Lee
>Priority: Major
> Fix For: 3.4.0
>
>
> There are attributes, such as "_jvm", that were accessible in PySpark but 
> cannot be accessed in Spark Connect. We need to display appropriate error 
> messages for these cases.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-42551) Support more subexpression elimination cases

2023-03-16 Thread Wan Kun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42551?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wan Kun updated SPARK-42551:

Description: 
h1. *Design Sketch*
 * Get all common expressions from input expressions. Recursively visits all 
subexpressions regardless of whether the current expression is a conditional 
expression.
 * For each common expression:
 * Add a new boolean variable *subExprInit_n* to indicate whether we have  
already evaluated the common expression, and reset it to *false* at the start 
of operator.consume()
 * Add a new wrapper subExpr function for common subexpression, and replace all 
the common subexpression with the wrapper function.


{code:java}
private void subExpr_n(${argList.mkString(", ")}) {
 if (!subExprInit_n) {
   ${eval.code}
   subExprInit_n = true;
   subExprIsNull_n = ${eval.isNull};
   subExprValue_n = ${eval.value};
 }
}
{code}


h1. *New support subexpression elimination patterns*
 * 
h2. *Support subexpression elimination with conditional expressions*


{code:java}
SELECT case when v + 2 > 1 then 1
when v + 1 > 2 then 2
when v + 1 > 3 then 3 END vv
FROM values(1) as t2(v)
{code}


We can reuse the result of expression  *v + 1*


{code:java}
SELECT a, max(if(a > 0, b + c, null)) max_bc, min(if(a > 1, b + c, null)) min_bc
FROM values(1, 1, 1) as t(a, b, c)
GROUP BY a
{code}


We can reuse the result of expression  b + c
 * 
h2. *Support subexpression elimination in FilterExec*

 
{code:java}
SELECT * FROM (
  SELECT v * v + 1 v1 from values(1) as t2(v)
) t
where v1 > 5 and v1 < 10
{code}
We can reuse the result of expression  *v* * *v* *+* *1*
 * 
h2. *Support subexpression elimination in JoinExec*

 
{code:java}
SELECT * 
FROM values(1, 1) as t1(a, b) 
join values(1, 2) as t2(x, y)

ON b * y between 2 and 3{code}

 
We can reuse the result of expression  *b* * *y*
 * 
h2. *Support subexpression elimination in ExpandExec*


{code:java}
SELECT a, count(b),
count(distinct case when b > 1 then b + c else null end) as count_bc_1,
count(distinct case when b < 0 then b + c else null end) as count_bc_2
FROM values(1, 1, 1) as t(a, b, c)
GROUP BY a
{code}


We can reuse the result of expression  b + c

  was:
h1. *Design Sketch*
 * Get all common expressions from input expressions. Recursively visits all 
subexpressions regardless of whether the current expression is a conditional 
expression.
 * For each common expression:
 * Add a new boolean variable *subExprInit_n* to indicate whether we have  
already evaluated the common expression, and reset it to *false* at the start 
of operator.consume()
 * Add a new wrapper subExpr function for common subexpression, and replace all 
the common subexpression with the wrapper function.

|private void subExpr_n(${argList.mkString(", ")}) {
 if (!subExprInit_n) {
   ${eval.code}
   subExprInit_n = true;
   subExprIsNull_n = ${eval.isNull};
   subExprValue_n = ${eval.value};
 }
}|
h1. *New support subexpression elimination patterns*
 * 
h2. *Support subexpression elimination with conditional expressions*

|SELECT case when v + 2 > 1 then 1
when v + 1 > 2 then 2
when v + 1 > 3 then 3 END vv
FROM values(1) as t2(v)|

We can reuse the result of expression  *v + 1*

 
|SELECT a, max(if(a > 0, b + c, null)) max_bc, min(if(a > 1, b + c, null)) 
min_bc
FROM values(1, 1, 1) as t(a, b, c)
GROUP BY a​​​|

We can reuse the result of expression  b + c
 * 
h2. *Support subexpression elimination in FilterExec*

 
|SELECT ** *** ** FROM *(*
  *SELECT v* * *v* *+* *1* *v1* from values{*}({*}{*}1{*}{*}){*} as *t2(v)*
*) t*
where *v1* *>* *5* and *v1* *<* *10*|

We can reuse the result of expression  *v* * *v* *+* *1*
 * 
h2. *Support subexpression elimination in JoinExec*

 
|SELECT ** *** ** 
FROM ** values{*}({*}{*}1{*}{*},{*} {*}1{*}{*}){*} as *t1(a, b)* 
join values{*}({*}{*}1{*}{*},{*} {*}2{*}{*}){*} as *t2(x, y)*ON *b* *** *y* 
between ** *2* ** and ** *3*|

We can reuse the result of expression  *b* * *y*
 * 
h2. *Support subexpression elimination in ExpandExec*

 
|*SELECT* a, count(b),
    count({*}distinct{*} *case* *when* b > 1 *then* b + c *else* *null* 
{*}end{*}) *as* count_bc_1,
    count({*}distinct{*} *case* *when* b < 0 *then* b + c *else* *null* 
{*}end{*}) *as* count_bc_2
*FROM* {*}values{*}(1, 1, 1) *as* t(a, b, c)
*GROUP* *BY* a|

We can reuse the result of expression  b + c


> Support more subexpression elimination cases
> 
>
> Key: SPARK-42551
> URL: https://issues.apache.org/jira/browse/SPARK-42551
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.2
>Reporter: Wan Kun
>Priority: Major
>
> h1. *Design Sketch*
>  * Get all common expressions from input expressions. Recursively visits all 
> subexpressions regardless of whether the 

[jira] [Updated] (SPARK-42551) Support more subexpression elimination cases

2023-03-16 Thread Wan Kun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42551?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wan Kun updated SPARK-42551:

Description: 
h1. *Design Sketch*
 * Get all common expressions from input expressions. Recursively visits all 
subexpressions regardless of whether the current expression is a conditional 
expression.
 * For each common expression:
 * Add a new boolean variable *subExprInit_n* to indicate whether we have  
already evaluated the common expression, and reset it to *false* at the start 
of operator.consume()
 * Add a new wrapper subExpr function for common subexpression, and replace all 
the common subexpression with the wrapper function.

|private void subExpr_n(${argList.mkString(", ")}) {
 if (!subExprInit_n) {
   ${eval.code}
   subExprInit_n = true;
   subExprIsNull_n = ${eval.isNull};
   subExprValue_n = ${eval.value};
 }
}|
h1. *New support subexpression elimination patterns*
 * 
h2. *Support subexpression elimination with conditional expressions*

|SELECT case when v + 2 > 1 then 1
when v + 1 > 2 then 2
when v + 1 > 3 then 3 END vv
FROM values(1) as t2(v)|

We can reuse the result of expression  *v + 1*

 
|SELECT a, max(if(a > 0, b + c, null)) max_bc, min(if(a > 1, b + c, null)) 
min_bc
FROM values(1, 1, 1) as t(a, b, c)
GROUP BY a​​​|

We can reuse the result of expression  b + c
 * 
h2. *Support subexpression elimination in FilterExec*

 
|SELECT ** *** ** FROM *(*
  *SELECT v* * *v* *+* *1* *v1* from values{*}({*}{*}1{*}{*}){*} as *t2(v)*
*) t*
where *v1* *>* *5* and *v1* *<* *10*|

We can reuse the result of expression  *v* * *v* *+* *1*
 * 
h2. *Support subexpression elimination in JoinExec*

 
|SELECT ** *** ** 
FROM ** values{*}({*}{*}1{*}{*},{*} {*}1{*}{*}){*} as *t1(a, b)* 
join values{*}({*}{*}1{*}{*},{*} {*}2{*}{*}){*} as *t2(x, y)*ON *b* *** *y* 
between ** *2* ** and ** *3*|

We can reuse the result of expression  *b* * *y*
 * 
h2. *Support subexpression elimination in ExpandExec*

 
|*SELECT* a, count(b),
    count({*}distinct{*} *case* *when* b > 1 *then* b + c *else* *null* 
{*}end{*}) *as* count_bc_1,
    count({*}distinct{*} *case* *when* b < 0 *then* b + c *else* *null* 
{*}end{*}) *as* count_bc_2
*FROM* {*}values{*}(1, 1, 1) *as* t(a, b, c)
*GROUP* *BY* a|

We can reuse the result of expression  b + c

  was:
h1. *Design Sketch*
 * Get all common expressions from input expressions. Recursively visits all 
subexpressions regardless of whether the current expression is a conditional 
expression.
 * For each common expression:
 * Add a new boolean variable *subExprInit_n* to indicate whether we have  
already evaluated the common expression, and reset it to *false* at the start 
of operator.consume()
 * Add a new wrapper subExpr function for common subexpression, and replace all 
the common subexpression with the wrapper function.

|private void subExpr_n(${argList.mkString(", ")}) {
 if (!subExprInit_n) {
   ${eval.code}
   subExprInit_n = true;
   subExprIsNull_n = ${eval.isNull};
   subExprValue_n = ${eval.value};
 }
}|
h1. *New support subexpression elimination patterns*
 * 
h2. *Support subexpression elimination with conditional expressions*

|SELECT case when v + 2 > 1 then 1
when v + 1 > 2 then 2
when v + 1 > 3 then 3 END vv
FROM values(1) as t2(v)|

We can reuse the result of expression  *v + 1*

 
|SELECT a, max(if(a > 0, b + c, null)) max_bc, min(if(a > 1, b + c, null)) 
min_bc
FROM values(1, 1, 1) as t(a, b, c)
GROUP BY a​​​|

We can reuse the result of expression  b + c
 * 
h2. *Support subexpression elimination in FilterExec*

 
|SELECT ** *** ** FROM \{*}({*}
  **  SELECT *v* *** *v* *+* ** *1* *v1* from ** values{*}({*}{*}1{*}{*}){*} as 
\{*}t2(v){*}
{*}) t{*}
where *v1* *>* ** *5* ** and *v1* *<* ** *10*|

We can reuse the result of expression  *v* *** *v* *+* ** *1*
 * 
h2. *Support subexpression elimination in JoinExec*

 
|WITH *t1 (* SELECT ** *** ** FROM ** values{*}({*}{*}1{*}{*},{*} 
{*}1{*}{*}){*} as \{*}t(a, b)),{*}
{*}t2 ({*} SELECT ** *** ** FROM ** values{*}({*}{*}1{*}{*},{*} {*}2{*}{*}){*} 
as \{*}t(x, y)){*}
SELECT ** *** ** FROM *t1* join \{*}t2{*}
ON *b* *** *y* between ** *2* ** and ** *3*|

We can reuse the result of expression  *b* *** *y*
 * 
h2. *Support subexpression elimination in ExpandExec*

 
|*SELECT* a, count(b),
    count({*}distinct{*} *case* *when* b > 1 *then* b + c *else* *null* 
{*}end{*}) *as* count_bc_1,
    count({*}distinct{*} *case* *when* b < 0 *then* b + c *else* *null* 
{*}end{*}) *as* count_bc_2
*FROM* {*}values{*}(1, 1, 1) *as* t(a, b, c)
*GROUP* *BY* a|

We can reuse the result of expression  b + c


> Support more subexpression elimination cases
> 
>
> Key: SPARK-42551
> URL: https://issues.apache.org/jira/browse/SPARK-42551
> Project: Spark
>  Issue Type: Improvement
>  Components: 

[jira] [Updated] (SPARK-42551) Support more subexpression elimination cases

2023-03-16 Thread Wan Kun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42551?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wan Kun updated SPARK-42551:

Description: 
h1. *Design Sketch*
 * Get all common expressions from input expressions. Recursively visits all 
subexpressions regardless of whether the current expression is a conditional 
expression.
 * For each common expression:
 * Add a new boolean variable *subExprInit_n* to indicate whether we have  
already evaluated the common expression, and reset it to *false* at the start 
of operator.consume()
 * Add a new wrapper subExpr function for common subexpression, and replace all 
the common subexpression with the wrapper function.

|private void subExpr_n(${argList.mkString(", ")}) {
 if (!subExprInit_n) {
   ${eval.code}
   subExprInit_n = true;
   subExprIsNull_n = ${eval.isNull};
   subExprValue_n = ${eval.value};
 }
}|
h1. *New support subexpression elimination patterns*
 * 
h2. *Support subexpression elimination with conditional expressions*

|SELECT case when v + 2 > 1 then 1
when v + 1 > 2 then 2
when v + 1 > 3 then 3 END vv
FROM values(1) as t2(v)|

We can reuse the result of expression  *v + 1*

 
|SELECT a, max(if(a > 0, b + c, null)) max_bc, min(if(a > 1, b + c, null)) 
min_bc
FROM values(1, 1, 1) as t(a, b, c)
GROUP BY a​​​|

We can reuse the result of expression  b + c
 * 
h2. *Support subexpression elimination in FilterExec*

 
|SELECT ** *** ** FROM \{*}({*}
  **  SELECT *v* *** *v* *+* ** *1* *v1* from ** values{*}({*}{*}1{*}{*}){*} as 
\{*}t2(v){*}
{*}) t{*}
where *v1* *>* ** *5* ** and *v1* *<* ** *10*|

We can reuse the result of expression  *v* *** *v* *+* ** *1*
 * 
h2. *Support subexpression elimination in JoinExec*

 
|WITH *t1 (* SELECT ** *** ** FROM ** values{*}({*}{*}1{*}{*},{*} 
{*}1{*}{*}){*} as \{*}t(a, b)),{*}
{*}t2 ({*} SELECT ** *** ** FROM ** values{*}({*}{*}1{*}{*},{*} {*}2{*}{*}){*} 
as \{*}t(x, y)){*}
SELECT ** *** ** FROM *t1* join \{*}t2{*}
ON *b* *** *y* between ** *2* ** and ** *3*|

We can reuse the result of expression  *b* *** *y*
 * 
h2. *Support subexpression elimination in ExpandExec*

 
|*SELECT* a, count(b),
    count({*}distinct{*} *case* *when* b > 1 *then* b + c *else* *null* 
{*}end{*}) *as* count_bc_1,
    count({*}distinct{*} *case* *when* b < 0 *then* b + c *else* *null* 
{*}end{*}) *as* count_bc_2
*FROM* {*}values{*}(1, 1, 1) *as* t(a, b, c)
*GROUP* *BY* a|

We can reuse the result of expression  b + c

  was:
Just like SPARK-33092, We can support subexpression elimination in FilterExec 
in Whole-stage codegen.
For example:
{code:java}
SELECT * FROM (
  SELECT v, v * v + 1 v1 from values(1) as t2(v)
) t
where v > 0 and v1 > 5 and v1 < 10

Codegen plan
{code:java}
*(1) Project [v#1, ((v#1 * v#1) + 1) AS v1#0]
+- *(1) Filter (((v#1 > 0) AND (((v#1 * v#1) + 1) > 5)) AND (((v#1 * v#1) + 1) 
< 10))
   +- *(1) LocalTableScan [v#1]
{code}
The subexpression *(v#1 * v#1) + 1* will be execute twice times.


> Support more subexpression elimination cases
> 
>
> Key: SPARK-42551
> URL: https://issues.apache.org/jira/browse/SPARK-42551
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.2
>Reporter: Wan Kun
>Priority: Major
>
> h1. *Design Sketch*
>  * Get all common expressions from input expressions. Recursively visits all 
> subexpressions regardless of whether the current expression is a conditional 
> expression.
>  * For each common expression:
>  * Add a new boolean variable *subExprInit_n* to indicate whether we have  
> already evaluated the common expression, and reset it to *false* at the start 
> of operator.consume()
>  * Add a new wrapper subExpr function for common subexpression, and replace 
> all the common subexpression with the wrapper function.
> |private void subExpr_n(${argList.mkString(", ")}) {
>  if (!subExprInit_n) {
>    ${eval.code}
>    subExprInit_n = true;
>    subExprIsNull_n = ${eval.isNull};
>    subExprValue_n = ${eval.value};
>  }
> }|
> h1. *New support subexpression elimination patterns*
>  * 
> h2. *Support subexpression elimination with conditional expressions*
> |SELECT case when v + 2 > 1 then 1
> when v + 1 > 2 then 2
> when v + 1 > 3 then 3 END vv
> FROM values(1) as t2(v)|
> We can reuse the result of expression  *v + 1*
>  
> |SELECT a, max(if(a > 0, b + c, null)) max_bc, min(if(a > 1, b + c, null)) 
> min_bc
> FROM values(1, 1, 1) as t(a, b, c)
> GROUP BY a​​​|
> We can reuse the result of expression  b + c
>  * 
> h2. *Support subexpression elimination in FilterExec*
>  
> |SELECT ** *** ** FROM \{*}({*}
>   **  SELECT *v* *** *v* *+* ** *1* *v1* from ** values{*}({*}{*}1{*}{*}){*} 
> as \{*}t2(v){*}
> {*}) t{*}
> where *v1* *>* ** *5* ** and *v1* *<* ** *10*|
> We can reuse the result of expression  *v* *** 

[jira] [Updated] (SPARK-42551) Support more subexpression elimination cases

2023-03-16 Thread Wan Kun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42551?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wan Kun updated SPARK-42551:

Summary: Support more subexpression elimination cases  (was: Support 
subexpression elimination in FilterExec and JoinExec)

> Support more subexpression elimination cases
> 
>
> Key: SPARK-42551
> URL: https://issues.apache.org/jira/browse/SPARK-42551
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.2
>Reporter: Wan Kun
>Priority: Major
>
> Just like SPARK-33092, We can support subexpression elimination in FilterExec 
> in Whole-stage codegen.
> For example:
> {code:java}
> SELECT * FROM (
>   SELECT v, v * v + 1 v1 from values(1) as t2(v)
> ) t
> where v > 0 and v1 > 5 and v1 < 10
> Codegen plan
> {code:java}
> *(1) Project [v#1, ((v#1 * v#1) + 1) AS v1#0]
> +- *(1) Filter (((v#1 > 0) AND (((v#1 * v#1) + 1) > 5)) AND (((v#1 * v#1) + 
> 1) < 10))
>+- *(1) LocalTableScan [v#1]
> {code}
> The subexpression *(v#1 * v#1) + 1* will be execute twice times.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-42194) Allow `columns` parameter when creating DataFrame with Series.

2023-03-16 Thread Xinrong Meng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinrong Meng updated SPARK-42194:
-
Parent: SPARK-39199
Issue Type: Sub-task  (was: Bug)

> Allow `columns` parameter when creating DataFrame with Series.
> --
>
> Key: SPARK-42194
> URL: https://issues.apache.org/jira/browse/SPARK-42194
> Project: Spark
>  Issue Type: Sub-task
>  Components: ps
>Affects Versions: 3.5.0
>Reporter: Haejoon Lee
>Assignee: Haejoon Lee
>Priority: Major
> Fix For: 3.4.0
>
>
> pandas API on Spark doesn't allow creating DataFrame with Series by 
> specifying the `columns` parameter as below:
> {code:java}
> >>> ps.DataFrame(psser, columns=["labels"])
> Traceback (most recent call last):
>   File "", line 1, in 
>   File ".../spark/python/pyspark/pandas/frame.py", line 539, in __init__
>     assert columns is None
> AssertionError {code}
> We should make it available.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38937) interpolate support param `limit_direction`

2023-03-16 Thread Xinrong Meng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinrong Meng updated SPARK-38937:
-
Parent: SPARK-39199
Issue Type: Sub-task  (was: Improvement)

> interpolate support param `limit_direction`
> ---
>
> Key: SPARK-38937
> URL: https://issues.apache.org/jira/browse/SPARK-38937
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-39189) interpolate supports limit_area

2023-03-16 Thread Xinrong Meng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinrong Meng updated SPARK-39189:
-
Parent: SPARK-39199
Issue Type: Sub-task  (was: Improvement)

> interpolate supports limit_area
> ---
>
> Key: SPARK-39189
> URL: https://issues.apache.org/jira/browse/SPARK-39189
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Minor
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38943) EWM support ignore_na

2023-03-16 Thread Xinrong Meng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38943?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinrong Meng updated SPARK-38943:
-
Parent: SPARK-39199
Issue Type: Sub-task  (was: Improvement)

> EWM support ignore_na
> -
>
> Key: SPARK-38943
> URL: https://issues.apache.org/jira/browse/SPARK-38943
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Minor
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42826) Add migration notes for update to supported pandas version.

2023-03-16 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42826?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-42826:


Assignee: Haejoon Lee

> Add migration notes for update to supported pandas version.
> ---
>
> Key: SPARK-42826
> URL: https://issues.apache.org/jira/browse/SPARK-42826
> Project: Spark
>  Issue Type: Sub-task
>  Components: Pandas API on Spark
>Affects Versions: 3.4.0
>Reporter: Haejoon Lee
>Assignee: Haejoon Lee
>Priority: Major
>
> We deprecate & remove some APIs from 
> https://issues.apache.org/jira/browse/SPARK-42593. to follow the pandas.
> We should mention this in migration guide.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-42826) Add migration notes for update to supported pandas version.

2023-03-16 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42826?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-42826.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 40459
[https://github.com/apache/spark/pull/40459]

> Add migration notes for update to supported pandas version.
> ---
>
> Key: SPARK-42826
> URL: https://issues.apache.org/jira/browse/SPARK-42826
> Project: Spark
>  Issue Type: Sub-task
>  Components: Pandas API on Spark
>Affects Versions: 3.4.0
>Reporter: Haejoon Lee
>Assignee: Haejoon Lee
>Priority: Major
> Fix For: 3.4.0
>
>
> We deprecate & remove some APIs from 
> https://issues.apache.org/jira/browse/SPARK-42593. to follow the pandas.
> We should mention this in migration guide.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-42817) Spark driver logs are filled with Initializing service data for shuffle service using name

2023-03-16 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-42817.
---
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 40448
[https://github.com/apache/spark/pull/40448]

> Spark driver logs are filled with Initializing service data for shuffle 
> service using name
> --
>
> Key: SPARK-42817
> URL: https://issues.apache.org/jira/browse/SPARK-42817
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.2.0
>Reporter: Chandni Singh
>Assignee: Chandni Singh
>Priority: Major
> Fix For: 3.4.0
>
>
> With SPARK-34828, we added the ability to make the shuffle service name 
> configurable and we added a log 
> [here|https://github.com/apache/spark/blob/8860f69455e5a722626194c4797b4b42cccd4510/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ExecutorRunnable.scala#L118]
>  that will log the shuffle service name. However, this log is printed in the 
> driver logs whenever there is new executor launched and pollutes the log. 
> {code}
> 22/08/03 20:42:07 INFO ExecutorRunnable: Initializing service data for 
> shuffle service using name 'spark_shuffle_311'
> 22/08/03 20:42:07 INFO ExecutorRunnable: Initializing service data for 
> shuffle service using name 'spark_shuffle_311'
> 22/08/03 20:42:07 INFO ExecutorRunnable: Initializing service data for 
> shuffle service using name 'spark_shuffle_311'
> 22/08/03 20:42:07 INFO ExecutorRunnable: Initializing service data for 
> shuffle service using name 'spark_shuffle_311'
> 22/08/03 20:42:07 INFO ExecutorRunnable: Initializing service data for 
> shuffle service using name 'spark_shuffle_311'
> 22/08/03 20:42:07 INFO ExecutorRunnable: Initializing service data for 
> shuffle service using name 'spark_shuffle_311'
> 22/08/03 20:42:07 INFO ExecutorRunnable: Initializing service data for 
> shuffle service using name 'spark_shuffle_311'
> 22/08/03 20:42:07 INFO ExecutorRunnable: Initializing service data for 
> shuffle service using name 'spark_shuffle_311'
> 22/08/03 20:42:07 INFO ExecutorRunnable: Initializing service data for 
> shuffle service using name 'spark_shuffle_311'
> 22/08/03 20:42:07 INFO ExecutorRunnable: Initializing service data for 
> shuffle service using name 'spark_shuffle_311'
> 22/08/03 20:42:07 INFO ExecutorRunnable: Initializing service data for 
> shuffle service using name 'spark_shuffle_311'
> 22/08/03 20:42:07 INFO ExecutorRunnable: Initializing service data for 
> shuffle service using name 'spark_shuffle_311'
> 22/08/03 20:42:07 INFO ExecutorRunnable: Initializing service data for 
> shuffle service using name 'spark_shuffle_311'
> 22/08/03 20:42:07 INFO ExecutorRunnable: Initializing service data for 
> shuffle service using name 'spark_shuffle_311'
> 22/08/03 20:42:07 INFO ExecutorRunnable: Initializing service data for 
> shuffle service using name 'spark_shuffle_311'
> 22/08/03 20:42:07 INFO ExecutorRunnable: Initializing service data for 
> shuffle service using name 'spark_shuffle_311'
> 22/08/03 20:42:07 INFO ExecutorRunnable: Initializing service data for 
> shuffle service using name 'spark_shuffle_311'
> 22/08/03 20:42:07 INFO ExecutorRunnable: Initializing service data for 
> shuffle service using name 'spark_shuffle_311'
> 22/08/03 20:42:07 INFO ExecutorRunnable: Initializing service data for 
> shuffle service using name 'spark_shuffle_311'
> 22/08/03 20:42:07 INFO ExecutorRunnable: Initializing service data for 
> shuffle service using name 'spark_shuffle_311'
> {code}
> We can just log this once in the driver.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42817) Spark driver logs are filled with Initializing service data for shuffle service using name

2023-03-16 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-42817:
-

Assignee: Chandni Singh

> Spark driver logs are filled with Initializing service data for shuffle 
> service using name
> --
>
> Key: SPARK-42817
> URL: https://issues.apache.org/jira/browse/SPARK-42817
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.2.0
>Reporter: Chandni Singh
>Assignee: Chandni Singh
>Priority: Major
>
> With SPARK-34828, we added the ability to make the shuffle service name 
> configurable and we added a log 
> [here|https://github.com/apache/spark/blob/8860f69455e5a722626194c4797b4b42cccd4510/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ExecutorRunnable.scala#L118]
>  that will log the shuffle service name. However, this log is printed in the 
> driver logs whenever there is new executor launched and pollutes the log. 
> {code}
> 22/08/03 20:42:07 INFO ExecutorRunnable: Initializing service data for 
> shuffle service using name 'spark_shuffle_311'
> 22/08/03 20:42:07 INFO ExecutorRunnable: Initializing service data for 
> shuffle service using name 'spark_shuffle_311'
> 22/08/03 20:42:07 INFO ExecutorRunnable: Initializing service data for 
> shuffle service using name 'spark_shuffle_311'
> 22/08/03 20:42:07 INFO ExecutorRunnable: Initializing service data for 
> shuffle service using name 'spark_shuffle_311'
> 22/08/03 20:42:07 INFO ExecutorRunnable: Initializing service data for 
> shuffle service using name 'spark_shuffle_311'
> 22/08/03 20:42:07 INFO ExecutorRunnable: Initializing service data for 
> shuffle service using name 'spark_shuffle_311'
> 22/08/03 20:42:07 INFO ExecutorRunnable: Initializing service data for 
> shuffle service using name 'spark_shuffle_311'
> 22/08/03 20:42:07 INFO ExecutorRunnable: Initializing service data for 
> shuffle service using name 'spark_shuffle_311'
> 22/08/03 20:42:07 INFO ExecutorRunnable: Initializing service data for 
> shuffle service using name 'spark_shuffle_311'
> 22/08/03 20:42:07 INFO ExecutorRunnable: Initializing service data for 
> shuffle service using name 'spark_shuffle_311'
> 22/08/03 20:42:07 INFO ExecutorRunnable: Initializing service data for 
> shuffle service using name 'spark_shuffle_311'
> 22/08/03 20:42:07 INFO ExecutorRunnable: Initializing service data for 
> shuffle service using name 'spark_shuffle_311'
> 22/08/03 20:42:07 INFO ExecutorRunnable: Initializing service data for 
> shuffle service using name 'spark_shuffle_311'
> 22/08/03 20:42:07 INFO ExecutorRunnable: Initializing service data for 
> shuffle service using name 'spark_shuffle_311'
> 22/08/03 20:42:07 INFO ExecutorRunnable: Initializing service data for 
> shuffle service using name 'spark_shuffle_311'
> 22/08/03 20:42:07 INFO ExecutorRunnable: Initializing service data for 
> shuffle service using name 'spark_shuffle_311'
> 22/08/03 20:42:07 INFO ExecutorRunnable: Initializing service data for 
> shuffle service using name 'spark_shuffle_311'
> 22/08/03 20:42:07 INFO ExecutorRunnable: Initializing service data for 
> shuffle service using name 'spark_shuffle_311'
> 22/08/03 20:42:07 INFO ExecutorRunnable: Initializing service data for 
> shuffle service using name 'spark_shuffle_311'
> 22/08/03 20:42:07 INFO ExecutorRunnable: Initializing service data for 
> shuffle service using name 'spark_shuffle_311'
> {code}
> We can just log this once in the driver.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-42830) Link skipped stages on Spark UI

2023-03-16 Thread Yian Liou (Jira)
Yian Liou created SPARK-42830:
-

 Summary: Link skipped stages on Spark UI
 Key: SPARK-42830
 URL: https://issues.apache.org/jira/browse/SPARK-42830
 Project: Spark
  Issue Type: Improvement
  Components: Web UI
Affects Versions: 3.3.2
Reporter: Yian Liou


Add a link to the skipped Spark stages so that its easier to find the execution 
details on the UI.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42828) PySpark type hint returns Any for methods on GroupedData

2023-03-16 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42828?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42828:


Assignee: Apache Spark

> PySpark type hint returns Any for methods on GroupedData
> 
>
> Key: SPARK-42828
> URL: https://issues.apache.org/jira/browse/SPARK-42828
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.3.0, 3.3.1, 3.3.2
>Reporter: Joe Wang
>Assignee: Apache Spark
>Priority: Minor
>
> Since upgrading to PySpark 3.3.x, type hints for
> {code:java}
> df.groupBy(...).count(){code}
> are now returning Any instead of DataFrame, causing type inference issues 
> downstream. This used to be correctly typed prior to 3.3.x.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42828) PySpark type hint returns Any for methods on GroupedData

2023-03-16 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42828?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42828:


Assignee: (was: Apache Spark)

> PySpark type hint returns Any for methods on GroupedData
> 
>
> Key: SPARK-42828
> URL: https://issues.apache.org/jira/browse/SPARK-42828
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.3.0, 3.3.1, 3.3.2
>Reporter: Joe Wang
>Priority: Minor
>
> Since upgrading to PySpark 3.3.x, type hints for
> {code:java}
> df.groupBy(...).count(){code}
> are now returning Any instead of DataFrame, causing type inference issues 
> downstream. This used to be correctly typed prior to 3.3.x.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42828) PySpark type hint returns Any for methods on GroupedData

2023-03-16 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42828?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17701345#comment-17701345
 ] 

Apache Spark commented on SPARK-42828:
--

User 'j03wang' has created a pull request for this issue:
https://github.com/apache/spark/pull/40460

> PySpark type hint returns Any for methods on GroupedData
> 
>
> Key: SPARK-42828
> URL: https://issues.apache.org/jira/browse/SPARK-42828
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.3.0, 3.3.1, 3.3.2
>Reporter: Joe Wang
>Priority: Minor
>
> Since upgrading to PySpark 3.3.x, type hints for
> {code:java}
> df.groupBy(...).count(){code}
> are now returning Any instead of DataFrame, causing type inference issues 
> downstream. This used to be correctly typed prior to 3.3.x.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-42829) Added Identifier to the cached RDD operator on the Stages page

2023-03-16 Thread Yian Liou (Jira)
Yian Liou created SPARK-42829:
-

 Summary: Added Identifier to the cached RDD operator on the Stages 
page 
 Key: SPARK-42829
 URL: https://issues.apache.org/jira/browse/SPARK-42829
 Project: Spark
  Issue Type: Improvement
  Components: Web UI
Affects Versions: 3.3.2
Reporter: Yian Liou


On the stages page in the Web UI, there is no distinction for which cached RDD 
is being executed in a particular stage. This Jira aims to add an repeat 
identifier to distinguish which cached RDD is being executed.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-42826) Add migration notes for update to supported pandas version.

2023-03-16 Thread Haejoon Lee (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42826?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Haejoon Lee updated SPARK-42826:

Summary: Add migration notes for update to supported pandas version.  (was: 
Add migration note for API changes)

> Add migration notes for update to supported pandas version.
> ---
>
> Key: SPARK-42826
> URL: https://issues.apache.org/jira/browse/SPARK-42826
> Project: Spark
>  Issue Type: Sub-task
>  Components: Pandas API on Spark
>Affects Versions: 3.4.0
>Reporter: Haejoon Lee
>Priority: Major
>
> We deprecate & remove some APIs from 
> https://issues.apache.org/jira/browse/SPARK-42593. to follow the pandas.
> We should mention this in migration guide.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42826) Add migration note for API changes

2023-03-16 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42826?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42826:


Assignee: (was: Apache Spark)

> Add migration note for API changes
> --
>
> Key: SPARK-42826
> URL: https://issues.apache.org/jira/browse/SPARK-42826
> Project: Spark
>  Issue Type: Sub-task
>  Components: Pandas API on Spark
>Affects Versions: 3.4.0
>Reporter: Haejoon Lee
>Priority: Major
>
> We deprecate & remove some APIs from 
> https://issues.apache.org/jira/browse/SPARK-42593. to follow the pandas.
> We should mention this in migration guide.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42826) Add migration note for API changes

2023-03-16 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42826?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42826:


Assignee: Apache Spark

> Add migration note for API changes
> --
>
> Key: SPARK-42826
> URL: https://issues.apache.org/jira/browse/SPARK-42826
> Project: Spark
>  Issue Type: Sub-task
>  Components: Pandas API on Spark
>Affects Versions: 3.4.0
>Reporter: Haejoon Lee
>Assignee: Apache Spark
>Priority: Major
>
> We deprecate & remove some APIs from 
> https://issues.apache.org/jira/browse/SPARK-42593. to follow the pandas.
> We should mention this in migration guide.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42826) Add migration note for API changes

2023-03-16 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17701295#comment-17701295
 ] 

Apache Spark commented on SPARK-42826:
--

User 'itholic' has created a pull request for this issue:
https://github.com/apache/spark/pull/40459

> Add migration note for API changes
> --
>
> Key: SPARK-42826
> URL: https://issues.apache.org/jira/browse/SPARK-42826
> Project: Spark
>  Issue Type: Sub-task
>  Components: Pandas API on Spark
>Affects Versions: 3.4.0
>Reporter: Haejoon Lee
>Priority: Major
>
> We deprecate & remove some APIs from 
> https://issues.apache.org/jira/browse/SPARK-42593. to follow the pandas.
> We should mention this in migration guide.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-42828) PySpark type hint returns Any for methods on GroupedData

2023-03-16 Thread Joe Wang (Jira)
Joe Wang created SPARK-42828:


 Summary: PySpark type hint returns Any for methods on GroupedData
 Key: SPARK-42828
 URL: https://issues.apache.org/jira/browse/SPARK-42828
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 3.3.2, 3.3.1, 3.3.0
Reporter: Joe Wang


Since upgrading to PySpark 3.3.x, type hints for
{code:java}
df.groupBy(...).count(){code}
are now returning Any instead of DataFrame, causing type inference issues 
downstream. This used to be correctly typed prior to 3.3.x.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-42827) Support `functions#array_prepend`

2023-03-16 Thread Yang Jie (Jira)
Yang Jie created SPARK-42827:


 Summary: Support `functions#array_prepend`
 Key: SPARK-42827
 URL: https://issues.apache.org/jira/browse/SPARK-42827
 Project: Spark
  Issue Type: Improvement
  Components: Connect
Affects Versions: 3.5.0
Reporter: Yang Jie


Wait for SPARK-41233



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42826) Add migration note for API changes

2023-03-16 Thread Haejoon Lee (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17701277#comment-17701277
 ] 

Haejoon Lee commented on SPARK-42826:
-

I'm working on it

> Add migration note for API changes
> --
>
> Key: SPARK-42826
> URL: https://issues.apache.org/jira/browse/SPARK-42826
> Project: Spark
>  Issue Type: Sub-task
>  Components: Pandas API on Spark
>Affects Versions: 3.4.0
>Reporter: Haejoon Lee
>Priority: Major
>
> We deprecate & remove some APIs from 
> https://issues.apache.org/jira/browse/SPARK-42593. to follow the pandas.
> We should mention this in migration guide.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-42826) Add migration note for API changes

2023-03-16 Thread Haejoon Lee (Jira)
Haejoon Lee created SPARK-42826:
---

 Summary: Add migration note for API changes
 Key: SPARK-42826
 URL: https://issues.apache.org/jira/browse/SPARK-42826
 Project: Spark
  Issue Type: Sub-task
  Components: Pandas API on Spark
Affects Versions: 3.4.0
Reporter: Haejoon Lee


We deprecate & remove some APIs from 
https://issues.apache.org/jira/browse/SPARK-42593. to follow the pandas.

We should mention this in migration guide.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42693) API Auditing

2023-03-16 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17701240#comment-17701240
 ] 

Dongjoon Hyun commented on SPARK-42693:
---

Thank you!

> API Auditing
> 
>
> Key: SPARK-42693
> URL: https://issues.apache.org/jira/browse/SPARK-42693
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark, Spark Core, SQL, Structured Streaming
>Affects Versions: 3.4.0
>Reporter: Xinrong Meng
>Assignee: Xinrong Meng
>Priority: Blocker
>
> Audit user-facing API of Spark 3.4.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42824) Provide a clear error message for unsupported JVM attributes.

2023-03-16 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42824?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42824:


Assignee: Apache Spark

> Provide a clear error message for unsupported JVM attributes.
> -
>
> Key: SPARK-42824
> URL: https://issues.apache.org/jira/browse/SPARK-42824
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Haejoon Lee
>Assignee: Apache Spark
>Priority: Major
>
> There are attributes, such as "_jvm", that were accessible in PySpark but 
> cannot be accessed in Spark Connect. We need to display appropriate error 
> messages for these cases.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42824) Provide a clear error message for unsupported JVM attributes.

2023-03-16 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42824?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42824:


Assignee: (was: Apache Spark)

> Provide a clear error message for unsupported JVM attributes.
> -
>
> Key: SPARK-42824
> URL: https://issues.apache.org/jira/browse/SPARK-42824
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Haejoon Lee
>Priority: Major
>
> There are attributes, such as "_jvm", that were accessible in PySpark but 
> cannot be accessed in Spark Connect. We need to display appropriate error 
> messages for these cases.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42824) Provide a clear error message for unsupported JVM attributes.

2023-03-16 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17701220#comment-17701220
 ] 

Apache Spark commented on SPARK-42824:
--

User 'itholic' has created a pull request for this issue:
https://github.com/apache/spark/pull/40458

> Provide a clear error message for unsupported JVM attributes.
> -
>
> Key: SPARK-42824
> URL: https://issues.apache.org/jira/browse/SPARK-42824
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Haejoon Lee
>Priority: Major
>
> There are attributes, such as "_jvm", that were accessible in PySpark but 
> cannot be accessed in Spark Connect. We need to display appropriate error 
> messages for these cases.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-42824) Show proper error message for unsupported JVM attribute.

2023-03-16 Thread Haejoon Lee (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42824?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Haejoon Lee updated SPARK-42824:

Summary: Show proper error message for unsupported JVM attribute.  (was: 
Show error messages for attributes that cannot be accessed)

> Show proper error message for unsupported JVM attribute.
> 
>
> Key: SPARK-42824
> URL: https://issues.apache.org/jira/browse/SPARK-42824
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Haejoon Lee
>Priority: Major
>
> There are attributes, such as "_jvm", that were accessible in PySpark but 
> cannot be accessed in Spark Connect. We need to display appropriate error 
> messages for these cases.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-42824) Provide a clear error message for unsupported JVM attributes.

2023-03-16 Thread Haejoon Lee (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42824?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Haejoon Lee updated SPARK-42824:

Summary: Provide a clear error message for unsupported JVM attributes.  
(was: Show proper error message for unsupported JVM attribute.)

> Provide a clear error message for unsupported JVM attributes.
> -
>
> Key: SPARK-42824
> URL: https://issues.apache.org/jira/browse/SPARK-42824
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Haejoon Lee
>Priority: Major
>
> There are attributes, such as "_jvm", that were accessible in PySpark but 
> cannot be accessed in Spark Connect. We need to display appropriate error 
> messages for these cases.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-42825) setParams() only sets explicitly named params. Is this intentional or a bug?

2023-03-16 Thread Lucas Partridge (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42825?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lucas Partridge updated SPARK-42825:

Description: 
The Python signature/docstring of the setParams() method for the estimators and 
transformers under pyspark.ml imply that if you don't set any of the named 
params then they will be reset to their default values.

Example from 
[https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.clustering.GaussianMixture.html#pyspark.ml.clustering.GaussianMixture.setParams]
 :
{code:java}
setParams(self, \*, featuresCol="features", predictionCol="prediction", k=2, 
probabilityCol="probability", tol=0.01, maxIter=100, seed=None, 
aggregationDepth=2, weightCol=None){code}
In the extreme this would imply that if you called setParams() with no args 
then _all_ the params would be reset to their default values.

But what actually happens is that _only_ the params passed in the call get 
changed; the values of any other params aren't affected. So if you call 
setParams() with no args then _no_ params get changed!

So is this behavior by design? I guess it is from the name of the method. But 
it is counter-intuitive from its docstring. So if this behavior is intentional 
then perhaps the default docstring should make this explicit by saying 
something like:

"Sets the named params. The values of other params are not affected."

  was:
The Python signature/docstring of the setParams() method for the estimators and 
transformers under pyspark.ml imply that if you don't set any of the named 
params then they will be reset to their default values.

Example from 
[https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.clustering.GaussianMixture.html#pyspark.ml.clustering.GaussianMixture.setParams]
 :


{code:java}
setParams(self, \*, featuresCol="features", predictionCol="prediction", k=2, 
probabilityCol="probability", tol=0.01, maxIter=100, seed=None, 
aggregationDepth=2, weightCol=None){code}
In the extreme this would imply that if you called setParams() with no args 
then _all_ the params would be reset to their default values.

But what actually happens is that _only_ the params passed in the call get 
changed; the values of any other params aren't affected. So if you call 
setParams() with no args then _no_ params get changed!

So is this behavior by design? I guess it is from the name of the method. But 
it is counter-intuitive from its docstring. So if this behavior is intentional 
then perhaps the default docstring should make this explicit by saying 
something like:

"Sets the named params. The values of other params are not affected."


> setParams() only sets explicitly named params. Is this intentional or a bug?
> 
>
> Key: SPARK-42825
> URL: https://issues.apache.org/jira/browse/SPARK-42825
> Project: Spark
>  Issue Type: Question
>  Components: ML, PySpark
>Affects Versions: 3.3.2
>Reporter: Lucas Partridge
>Priority: Minor
>
> The Python signature/docstring of the setParams() method for the estimators 
> and transformers under pyspark.ml imply that if you don't set any of the 
> named params then they will be reset to their default values.
> Example from 
> [https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.clustering.GaussianMixture.html#pyspark.ml.clustering.GaussianMixture.setParams]
>  :
> {code:java}
> setParams(self, \*, featuresCol="features", predictionCol="prediction", k=2, 
> probabilityCol="probability", tol=0.01, maxIter=100, seed=None, 
> aggregationDepth=2, weightCol=None){code}
> In the extreme this would imply that if you called setParams() with no args 
> then _all_ the params would be reset to their default values.
> But what actually happens is that _only_ the params passed in the call get 
> changed; the values of any other params aren't affected. So if you call 
> setParams() with no args then _no_ params get changed!
> So is this behavior by design? I guess it is from the name of the method. But 
> it is counter-intuitive from its docstring. So if this behavior is 
> intentional then perhaps the default docstring should make this explicit by 
> saying something like:
> "Sets the named params. The values of other params are not affected."



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-42825) setParams() only sets explicitly named params. Is this intentional or a bug?

2023-03-16 Thread Lucas Partridge (Jira)
Lucas Partridge created SPARK-42825:
---

 Summary: setParams() only sets explicitly named params. Is this 
intentional or a bug?
 Key: SPARK-42825
 URL: https://issues.apache.org/jira/browse/SPARK-42825
 Project: Spark
  Issue Type: Question
  Components: ML, PySpark
Affects Versions: 3.3.2
Reporter: Lucas Partridge


The Python signature/docstring of the setParams() method for the estimators and 
transformers under pyspark.ml imply that if you don't set any of the named 
params then they will be reset to their default values.

Example from 
[https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.clustering.GaussianMixture.html#pyspark.ml.clustering.GaussianMixture.setParams]
 :


{code:java}
setParams(self, \*, featuresCol="features", predictionCol="prediction", k=2, 
probabilityCol="probability", tol=0.01, maxIter=100, seed=None, 
aggregationDepth=2, weightCol=None){code}
In the extreme this would imply that if you called setParams() with no args 
then _all_ the params would be reset to their default values.

But what actually happens is that _only_ the params passed in the call get 
changed; the values of any other params aren't affected. So if you call 
setParams() with no args then _no_ params get changed!

So is this behavior by design? I guess it is from the name of the method. But 
it is counter-intuitive from its docstring. So if this behavior is intentional 
then perhaps the default docstring should make this explicit by saying 
something like:

"Sets the named params. The values of other params are not affected."



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41233) High-order function: array_prepend

2023-03-16 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41233:


Assignee: (was: Apache Spark)

> High-order function: array_prepend
> --
>
> Key: SPARK-41233
> URL: https://issues.apache.org/jira/browse/SPARK-41233
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, SQL
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Priority: Major
>
> refer to 
> https://docs.snowflake.com/en/developer-guide/snowpark/reference/python/api/snowflake.snowpark.functions.array_prepend.html
> 1, about the data type validation:
> In Snowflake’s array_append, array_prepend and array_insert functions, the 
> element data type does not need to match the data type of the existing 
> elements in the array.
> While in Spark, we want to leverage the same data type validation as 
> array_remove.
> 2, about the NULL handling
> Currently, SparkSQL, SnowSQL and PostgreSQL deal with NULL values in 
> different ways.
> Existing functions array_contains, array_position and array_remove in 
> SparkSQL handle NULL in this way, if the input array or/and element is NULL, 
> returns NULL. However, this behavior should be broken.
> We should implement the NULL handling in array_prepend in this way:
> 2.1, if the array is NULL, returns NULL;
> 2.2 if the array is not NULL, the element is NULL, append the NULL value into 
> the array



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41233) High-order function: array_prepend

2023-03-16 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41233:


Assignee: Apache Spark

> High-order function: array_prepend
> --
>
> Key: SPARK-41233
> URL: https://issues.apache.org/jira/browse/SPARK-41233
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, SQL
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Assignee: Apache Spark
>Priority: Major
>
> refer to 
> https://docs.snowflake.com/en/developer-guide/snowpark/reference/python/api/snowflake.snowpark.functions.array_prepend.html
> 1, about the data type validation:
> In Snowflake’s array_append, array_prepend and array_insert functions, the 
> element data type does not need to match the data type of the existing 
> elements in the array.
> While in Spark, we want to leverage the same data type validation as 
> array_remove.
> 2, about the NULL handling
> Currently, SparkSQL, SnowSQL and PostgreSQL deal with NULL values in 
> different ways.
> Existing functions array_contains, array_position and array_remove in 
> SparkSQL handle NULL in this way, if the input array or/and element is NULL, 
> returns NULL. However, this behavior should be broken.
> We should implement the NULL handling in array_prepend in this way:
> 2.1, if the array is NULL, returns NULL;
> 2.2 if the array is not NULL, the element is NULL, append the NULL value into 
> the array



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-41233) High-order function: array_prepend

2023-03-16 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reopened SPARK-41233:
--

> High-order function: array_prepend
> --
>
> Key: SPARK-41233
> URL: https://issues.apache.org/jira/browse/SPARK-41233
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, SQL
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Priority: Major
>
> refer to 
> https://docs.snowflake.com/en/developer-guide/snowpark/reference/python/api/snowflake.snowpark.functions.array_prepend.html
> 1, about the data type validation:
> In Snowflake’s array_append, array_prepend and array_insert functions, the 
> element data type does not need to match the data type of the existing 
> elements in the array.
> While in Spark, we want to leverage the same data type validation as 
> array_remove.
> 2, about the NULL handling
> Currently, SparkSQL, SnowSQL and PostgreSQL deal with NULL values in 
> different ways.
> Existing functions array_contains, array_position and array_remove in 
> SparkSQL handle NULL in this way, if the input array or/and element is NULL, 
> returns NULL. However, this behavior should be broken.
> We should implement the NULL handling in array_prepend in this way:
> 2.1, if the array is NULL, returns NULL;
> 2.2 if the array is not NULL, the element is NULL, append the NULL value into 
> the array



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41233) High-order function: array_prepend

2023-03-16 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17701105#comment-17701105
 ] 

Hyukjin Kwon commented on SPARK-41233:
--

Reverted at 
https://github.com/apache/spark/commit/baf90206d04738e63ea71f63d86668a7dc7c8f9a

> High-order function: array_prepend
> --
>
> Key: SPARK-41233
> URL: https://issues.apache.org/jira/browse/SPARK-41233
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, SQL
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Priority: Major
>
> refer to 
> https://docs.snowflake.com/en/developer-guide/snowpark/reference/python/api/snowflake.snowpark.functions.array_prepend.html
> 1, about the data type validation:
> In Snowflake’s array_append, array_prepend and array_insert functions, the 
> element data type does not need to match the data type of the existing 
> elements in the array.
> While in Spark, we want to leverage the same data type validation as 
> array_remove.
> 2, about the NULL handling
> Currently, SparkSQL, SnowSQL and PostgreSQL deal with NULL values in 
> different ways.
> Existing functions array_contains, array_position and array_remove in 
> SparkSQL handle NULL in this way, if the input array or/and element is NULL, 
> returns NULL. However, this behavior should be broken.
> We should implement the NULL handling in array_prepend in this way:
> 2.1, if the array is NULL, returns NULL;
> 2.2 if the array is not NULL, the element is NULL, append the NULL value into 
> the array



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-41233) High-order function: array_prepend

2023-03-16 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-41233:
-
Fix Version/s: (was: 3.5.0)

> High-order function: array_prepend
> --
>
> Key: SPARK-41233
> URL: https://issues.apache.org/jira/browse/SPARK-41233
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, SQL
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Priority: Major
>
> refer to 
> https://docs.snowflake.com/en/developer-guide/snowpark/reference/python/api/snowflake.snowpark.functions.array_prepend.html
> 1, about the data type validation:
> In Snowflake’s array_append, array_prepend and array_insert functions, the 
> element data type does not need to match the data type of the existing 
> elements in the array.
> While in Spark, we want to leverage the same data type validation as 
> array_remove.
> 2, about the NULL handling
> Currently, SparkSQL, SnowSQL and PostgreSQL deal with NULL values in 
> different ways.
> Existing functions array_contains, array_position and array_remove in 
> SparkSQL handle NULL in this way, if the input array or/and element is NULL, 
> returns NULL. However, this behavior should be broken.
> We should implement the NULL handling in array_prepend in this way:
> 2.1, if the array is NULL, returns NULL;
> 2.2 if the array is not NULL, the element is NULL, append the NULL value into 
> the array



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42824) Show error messages for attributes that cannot be accessed

2023-03-16 Thread Haejoon Lee (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17701104#comment-17701104
 ] 

Haejoon Lee commented on SPARK-42824:
-

I'm working on it

> Show error messages for attributes that cannot be accessed
> --
>
> Key: SPARK-42824
> URL: https://issues.apache.org/jira/browse/SPARK-42824
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Haejoon Lee
>Priority: Major
>
> There are attributes, such as "_jvm", that were accessible in PySpark but 
> cannot be accessed in Spark Connect. We need to display appropriate error 
> messages for these cases.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-42824) Show error messages for attributes that cannot be accessed

2023-03-16 Thread Haejoon Lee (Jira)
Haejoon Lee created SPARK-42824:
---

 Summary: Show error messages for attributes that cannot be accessed
 Key: SPARK-42824
 URL: https://issues.apache.org/jira/browse/SPARK-42824
 Project: Spark
  Issue Type: Sub-task
  Components: Connect
Affects Versions: 3.4.0
Reporter: Haejoon Lee


There are attributes, such as "_jvm", that were accessible in PySpark but 
cannot be accessed in Spark Connect. We need to display appropriate error 
messages for these cases.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-42819) Add support for setting max_write_buffer_number and write_buffer_size for RocksDB used in streaming

2023-03-16 Thread Jungtaek Lim (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42819?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim resolved SPARK-42819.
--
Fix Version/s: 3.5.0
   Resolution: Fixed

Issue resolved by pull request 40455
[https://github.com/apache/spark/pull/40455]

> Add support for setting max_write_buffer_number and write_buffer_size for 
> RocksDB used in streaming
> ---
>
> Key: SPARK-42819
> URL: https://issues.apache.org/jira/browse/SPARK-42819
> Project: Spark
>  Issue Type: Task
>  Components: Structured Streaming
>Affects Versions: 3.4.0
>Reporter: Anish Shrigondekar
>Assignee: Anish Shrigondekar
>Priority: Major
> Fix For: 3.5.0
>
>
> Add support for setting max_write_buffer_number and write_buffer_size for 
> RocksDB used in streaming
>  
> We need these settings in order to control memory tuning for RocksDB. We 
> already expose settings for blockCache size. However, these 2 settings are 
> missing. This change proposes to add them.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42819) Add support for setting max_write_buffer_number and write_buffer_size for RocksDB used in streaming

2023-03-16 Thread Jungtaek Lim (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42819?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim reassigned SPARK-42819:


Assignee: Anish Shrigondekar

> Add support for setting max_write_buffer_number and write_buffer_size for 
> RocksDB used in streaming
> ---
>
> Key: SPARK-42819
> URL: https://issues.apache.org/jira/browse/SPARK-42819
> Project: Spark
>  Issue Type: Task
>  Components: Structured Streaming
>Affects Versions: 3.4.0
>Reporter: Anish Shrigondekar
>Assignee: Anish Shrigondekar
>Priority: Major
>
> Add support for setting max_write_buffer_number and write_buffer_size for 
> RocksDB used in streaming
>  
> We need these settings in order to control memory tuning for RocksDB. We 
> already expose settings for blockCache size. However, these 2 settings are 
> missing. This change proposes to add them.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42823) spark-sql shell supports multipart namespaces for initialization

2023-03-16 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42823:


Assignee: Apache Spark

> spark-sql shell supports multipart namespaces for initialization
> 
>
> Key: SPARK-42823
> URL: https://issues.apache.org/jira/browse/SPARK-42823
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Kent Yao
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42823) spark-sql shell supports multipart namespaces for initialization

2023-03-16 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42823:


Assignee: (was: Apache Spark)

> spark-sql shell supports multipart namespaces for initialization
> 
>
> Key: SPARK-42823
> URL: https://issues.apache.org/jira/browse/SPARK-42823
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Kent Yao
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42823) spark-sql shell supports multipart namespaces for initialization

2023-03-16 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17701078#comment-17701078
 ] 

Apache Spark commented on SPARK-42823:
--

User 'yaooqinn' has created a pull request for this issue:
https://github.com/apache/spark/pull/40457

> spark-sql shell supports multipart namespaces for initialization
> 
>
> Key: SPARK-42823
> URL: https://issues.apache.org/jira/browse/SPARK-42823
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Kent Yao
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41233) High-order function: array_prepend

2023-03-16 Thread Ruifeng Zheng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng reassigned SPARK-41233:
-

Assignee: (was: Ruifeng Zheng)

> High-order function: array_prepend
> --
>
> Key: SPARK-41233
> URL: https://issues.apache.org/jira/browse/SPARK-41233
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, SQL
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Priority: Major
> Fix For: 3.5.0
>
>
> refer to 
> https://docs.snowflake.com/en/developer-guide/snowpark/reference/python/api/snowflake.snowpark.functions.array_prepend.html
> 1, about the data type validation:
> In Snowflake’s array_append, array_prepend and array_insert functions, the 
> element data type does not need to match the data type of the existing 
> elements in the array.
> While in Spark, we want to leverage the same data type validation as 
> array_remove.
> 2, about the NULL handling
> Currently, SparkSQL, SnowSQL and PostgreSQL deal with NULL values in 
> different ways.
> Existing functions array_contains, array_position and array_remove in 
> SparkSQL handle NULL in this way, if the input array or/and element is NULL, 
> returns NULL. However, this behavior should be broken.
> We should implement the NULL handling in array_prepend in this way:
> 2.1, if the array is NULL, returns NULL;
> 2.2 if the array is not NULL, the element is NULL, append the NULL value into 
> the array



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41233) High-order function: array_prepend

2023-03-16 Thread Ruifeng Zheng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng reassigned SPARK-41233:
-

Assignee: Ruifeng Zheng

> High-order function: array_prepend
> --
>
> Key: SPARK-41233
> URL: https://issues.apache.org/jira/browse/SPARK-41233
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, SQL
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Major
> Fix For: 3.5.0
>
>
> refer to 
> https://docs.snowflake.com/en/developer-guide/snowpark/reference/python/api/snowflake.snowpark.functions.array_prepend.html
> 1, about the data type validation:
> In Snowflake’s array_append, array_prepend and array_insert functions, the 
> element data type does not need to match the data type of the existing 
> elements in the array.
> While in Spark, we want to leverage the same data type validation as 
> array_remove.
> 2, about the NULL handling
> Currently, SparkSQL, SnowSQL and PostgreSQL deal with NULL values in 
> different ways.
> Existing functions array_contains, array_position and array_remove in 
> SparkSQL handle NULL in this way, if the input array or/and element is NULL, 
> returns NULL. However, this behavior should be broken.
> We should implement the NULL handling in array_prepend in this way:
> 2.1, if the array is NULL, returns NULL;
> 2.2 if the array is not NULL, the element is NULL, append the NULL value into 
> the array



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-41233) High-order function: array_prepend

2023-03-16 Thread Ruifeng Zheng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng resolved SPARK-41233.
---
Fix Version/s: 3.5.0
   Resolution: Fixed

Issue resolved by pull request 38947
[https://github.com/apache/spark/pull/38947]

> High-order function: array_prepend
> --
>
> Key: SPARK-41233
> URL: https://issues.apache.org/jira/browse/SPARK-41233
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, SQL
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Priority: Major
> Fix For: 3.5.0
>
>
> refer to 
> https://docs.snowflake.com/en/developer-guide/snowpark/reference/python/api/snowflake.snowpark.functions.array_prepend.html
> 1, about the data type validation:
> In Snowflake’s array_append, array_prepend and array_insert functions, the 
> element data type does not need to match the data type of the existing 
> elements in the array.
> While in Spark, we want to leverage the same data type validation as 
> array_remove.
> 2, about the NULL handling
> Currently, SparkSQL, SnowSQL and PostgreSQL deal with NULL values in 
> different ways.
> Existing functions array_contains, array_position and array_remove in 
> SparkSQL handle NULL in this way, if the input array or/and element is NULL, 
> returns NULL. However, this behavior should be broken.
> We should implement the NULL handling in array_prepend in this way:
> 2.1, if the array is NULL, returns NULL;
> 2.2 if the array is not NULL, the element is NULL, append the NULL value into 
> the array



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42720) Refactor the withSequenceColumn

2023-03-16 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17701051#comment-17701051
 ] 

Apache Spark commented on SPARK-42720:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/40456

> Refactor the withSequenceColumn
> ---
>
> Key: SPARK-42720
> URL: https://issues.apache.org/jira/browse/SPARK-42720
> Project: Spark
>  Issue Type: Sub-task
>  Components: Pandas API on Spark
>Affects Versions: 3.5.0
>Reporter: Haejoon Lee
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42720) Refactor the withSequenceColumn

2023-03-16 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42720?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42720:


Assignee: (was: Apache Spark)

> Refactor the withSequenceColumn
> ---
>
> Key: SPARK-42720
> URL: https://issues.apache.org/jira/browse/SPARK-42720
> Project: Spark
>  Issue Type: Sub-task
>  Components: Pandas API on Spark
>Affects Versions: 3.5.0
>Reporter: Haejoon Lee
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42720) Refactor the withSequenceColumn

2023-03-16 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42720?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42720:


Assignee: Apache Spark

> Refactor the withSequenceColumn
> ---
>
> Key: SPARK-42720
> URL: https://issues.apache.org/jira/browse/SPARK-42720
> Project: Spark
>  Issue Type: Sub-task
>  Components: Pandas API on Spark
>Affects Versions: 3.5.0
>Reporter: Haejoon Lee
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42720) Refactor the withSequenceColumn

2023-03-16 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17701050#comment-17701050
 ] 

Apache Spark commented on SPARK-42720:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/40456

> Refactor the withSequenceColumn
> ---
>
> Key: SPARK-42720
> URL: https://issues.apache.org/jira/browse/SPARK-42720
> Project: Spark
>  Issue Type: Sub-task
>  Components: Pandas API on Spark
>Affects Versions: 3.5.0
>Reporter: Haejoon Lee
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-42823) spark-sql shell supports multipart namespaces for initialization

2023-03-16 Thread Kent Yao (Jira)
Kent Yao created SPARK-42823:


 Summary: spark-sql shell supports multipart namespaces for 
initialization
 Key: SPARK-42823
 URL: https://issues.apache.org/jira/browse/SPARK-42823
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.4.0
Reporter: Kent Yao






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42804) when target table format is textfile using `insert into select` will got error

2023-03-16 Thread kevinshin (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17701010#comment-17701010
 ] 

kevinshin commented on SPARK-42804:
---

@[~yumwang]  below is my step by step reproduce  this issue : 
 
hive version is HDP 3.1.0.3.1.4.0-315
 
[bigtop@hdpdev243 spark3]$ {color:#4c9aff}cat conf/spark-defaults.conf{color}
# Generated by Apache Ambari. Tue Apr 27 11:19:24 2021
 
spark.sql.hive.convertMetastoreOrc true
spark.sql.orc.filterPushdown true
spark.sql.orc.impl native
spark.sql.legacy.createHiveTableByDefault false
 
[bigtop@hdpdev243 spark3]$ {color:#4c9aff}bin/spark-sql{color}
23/03/16 15:03:29 WARN NativeCodeLoader: Unable to load native-hadoop library 
for your platform... using builtin-java classes where applicable
23/03/16 15:03:29 WARN HiveConf: HiveConf of name 
hive.materializedview.rewriting.incremental does not exist
23/03/16 15:03:29 WARN HiveConf: HiveConf of name 
hive.metastore.event.db.notification.api.auth does not exist
23/03/16 15:03:29 WARN HiveConf: HiveConf of name 
hive.server2.webui.cors.allowed.headers does not exist
23/03/16 15:03:29 WARN HiveConf: HiveConf of name 
hive.hook.proto.base-directory does not exist
23/03/16 15:03:29 WARN HiveConf: HiveConf of name hive.load.data.owner does not 
exist
23/03/16 15:03:29 WARN HiveConf: HiveConf of name 
hive.service.metrics.codahale.reporter.classes does not exist
23/03/16 15:03:29 WARN HiveConf: HiveConf of name hive.strict.managed.tables 
does not exist
23/03/16 15:03:29 WARN HiveConf: HiveConf of name hive.create.as.insert.only 
does not exist
23/03/16 15:03:29 WARN HiveConf: HiveConf of name hive.metastore.db.type does 
not exist
23/03/16 15:03:29 WARN HiveConf: HiveConf of name 
hive.tez.cartesian-product.enabled does not exist
23/03/16 15:03:29 WARN HiveConf: HiveConf of name 
hive.metastore.warehouse.external.dir does not exist
23/03/16 15:03:29 WARN HiveConf: HiveConf of name hive.heapsize does not exist
23/03/16 15:03:29 WARN HiveConf: HiveConf of name 
hive.server2.webui.enable.cors does not exist
23/03/16 15:03:29 WARN HiveClientImpl: Detected HiveConf hive.execution.engine 
is 'tez' and will be reset to 'mr' to disable useless hive logic
23/03/16 15:03:30 WARN DomainSocketFactory: The short-circuit local reads 
feature cannot be used because libhadoop cannot be loaded.
Spark master: local[*], Application Id: local-1678950211606
spark-sql> select version();
3.2.3 b53c341e0fefbb33d115ab630369a18765b7763d
Time taken: 3.956 seconds, Fetched 1 row(s)
spark-sql> {color:#4c9aff}create table test.tex_t1(name string, address string) 
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE;{color}
23/03/16 15:03:51 WARN SessionState: METASTORE_FILTER_HOOK will be ignored, 
since hive.security.authorization.manager is set to instance of 
HiveAuthorizerFactory.
Time taken: 0.753 seconds
spark-sql> {color:#4c9aff}create table test.tex_t2(name string, address 
string);{color}
Time taken: 0.326 seconds
spark-sql> {color:#4c9aff}insert into test.tex_t2 select 'a', 'b';{color}
Time taken: 2.011 seconds
spark-sql> {color:#4c9aff}insert into test.tex_t1 select 'a', 'b';{color}
23/03/16 15:04:13 WARN HdfsUtils: Unable to inherit permissions for file 
hdfs://nsdev/warehouse/tablespace/managed/hive/test.db/tex_t1/part-0-57c15f7a-7462-4101-af5d-9f4a22cf69df-c000
 from file hdfs://nsdev/warehouse/tablespace/man
aged/hive/test.db/tex_t1
23/03/16 15:04:13 WARN RetryingMetaStoreClient: MetaStoreClient lost 
connection. Attempting to reconnect (1 of 24) after 5s. fireListenerEvent
org.apache.thrift.transport.TTransportException
at 
org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:132)
at org.apache.thrift.transport.TTransport.readAll(TTransport.java:86)
at org.apache.thrift.protocol.TBinaryProtocol.readAll(TBinaryProtocol.java:425)
at org.apache.thrift.protocol.TBinaryProtocol.readI32(TBinaryProtocol.java:321)
at 
org.apache.thrift.protocol.TBinaryProtocol.readMessageBegin(TBinaryProtocol.java:225)
at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:77)
at 
org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.recv_fire_listener_event(ThriftHiveMetastore.java:4977)
at 
org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.fire_listener_event(ThriftHiveMetastore.java:4964)
at 
org.apache.hadoop.hive.metastore.HiveMetaStoreClient.fireListenerEvent(HiveMetaStoreClient.java:2296)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:173)
at com.sun.proxy.$Proxy21.fireListenerEvent(Unknown Source)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 

[jira] [Assigned] (SPARK-42800) Implement ml function {array_to_vector, vector_to_array}

2023-03-16 Thread Ruifeng Zheng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42800?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng reassigned SPARK-42800:
-

Assignee: Ruifeng Zheng

> Implement ml function {array_to_vector, vector_to_array}
> 
>
> Key: SPARK-42800
> URL: https://issues.apache.org/jira/browse/SPARK-42800
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, ML, PySpark
>Affects Versions: 3.5.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42804) when target table format is textfile using `insert into select` will got error

2023-03-16 Thread Yuming Wang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17701006#comment-17701006
 ] 

Yuming Wang commented on SPARK-42804:
-

I can't reproduce it. Did you set any configs?

> when target table format is textfile using `insert into select` will got error
> --
>
> Key: SPARK-42804
> URL: https://issues.apache.org/jira/browse/SPARK-42804
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.3
>Reporter: kevinshin
>Priority: Major
>
> *create* *table* test.tex_t1(name string, address string) *ROW* FORMAT 
> DELIMITED FIELDS TERMINATED *BY* ',' STORED *AS* TEXTFILE;
> *insert* *into* test.tex_t1 *select* 'a', 'b';
> will got alot of message about :
> WARN RetryingMetaStoreClient: MetaStoreClient lost connection. Attempting to 
> reconnect (24 of 24) after 5s. fireListenerEvent
> org.apache.thrift.transport.TTransportException
>  
> But the data was actual write to table.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-42800) Implement ml function {array_to_vector, vector_to_array}

2023-03-16 Thread Ruifeng Zheng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42800?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng resolved SPARK-42800.
---
Fix Version/s: 3.5.0
   Resolution: Fixed

Issue resolved by pull request 40432
[https://github.com/apache/spark/pull/40432]

> Implement ml function {array_to_vector, vector_to_array}
> 
>
> Key: SPARK-42800
> URL: https://issues.apache.org/jira/browse/SPARK-42800
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, ML, PySpark
>Affects Versions: 3.5.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Major
> Fix For: 3.5.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-42804) when target table format is textfile using `insert into select` will got error

2023-03-16 Thread kevinshin (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17700926#comment-17700926
 ] 

kevinshin edited comment on SPARK-42804 at 3/16/23 6:47 AM:


orc and parquet table won't have this problem.

directly use hive beeline connect to hive also have no problem.


was (Author: JIRAUSER281772):
orc and parquet table won't have this problem.

> when target table format is textfile using `insert into select` will got error
> --
>
> Key: SPARK-42804
> URL: https://issues.apache.org/jira/browse/SPARK-42804
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.3
>Reporter: kevinshin
>Priority: Major
>
> *create* *table* test.tex_t1(name string, address string) *ROW* FORMAT 
> DELIMITED FIELDS TERMINATED *BY* ',' STORED *AS* TEXTFILE;
> *insert* *into* test.tex_t1 *select* 'a', 'b';
> will got alot of message about :
> WARN RetryingMetaStoreClient: MetaStoreClient lost connection. Attempting to 
> reconnect (24 of 24) after 5s. fireListenerEvent
> org.apache.thrift.transport.TTransportException
>  
> But the data was actual write to table.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42693) API Auditing

2023-03-16 Thread Xinrong Meng (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17701003#comment-17701003
 ] 

Xinrong Meng commented on SPARK-42693:
--

Thanks [~dongjoon], I just started it. I will keep sharing the progress.

> API Auditing
> 
>
> Key: SPARK-42693
> URL: https://issues.apache.org/jira/browse/SPARK-42693
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark, Spark Core, SQL, Structured Streaming
>Affects Versions: 3.4.0
>Reporter: Xinrong Meng
>Assignee: Xinrong Meng
>Priority: Blocker
>
> Audit user-facing API of Spark 3.4.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42819) Add support for setting max_write_buffer_number and write_buffer_size for RocksDB used in streaming

2023-03-16 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42819?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17700994#comment-17700994
 ] 

Apache Spark commented on SPARK-42819:
--

User 'anishshri-db' has created a pull request for this issue:
https://github.com/apache/spark/pull/40455

> Add support for setting max_write_buffer_number and write_buffer_size for 
> RocksDB used in streaming
> ---
>
> Key: SPARK-42819
> URL: https://issues.apache.org/jira/browse/SPARK-42819
> Project: Spark
>  Issue Type: Task
>  Components: Structured Streaming
>Affects Versions: 3.4.0
>Reporter: Anish Shrigondekar
>Priority: Major
>
> Add support for setting max_write_buffer_number and write_buffer_size for 
> RocksDB used in streaming
>  
> We need these settings in order to control memory tuning for RocksDB. We 
> already expose settings for blockCache size. However, these 2 settings are 
> missing. This change proposes to add them.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42819) Add support for setting max_write_buffer_number and write_buffer_size for RocksDB used in streaming

2023-03-16 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42819?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42819:


Assignee: Apache Spark

> Add support for setting max_write_buffer_number and write_buffer_size for 
> RocksDB used in streaming
> ---
>
> Key: SPARK-42819
> URL: https://issues.apache.org/jira/browse/SPARK-42819
> Project: Spark
>  Issue Type: Task
>  Components: Structured Streaming
>Affects Versions: 3.4.0
>Reporter: Anish Shrigondekar
>Assignee: Apache Spark
>Priority: Major
>
> Add support for setting max_write_buffer_number and write_buffer_size for 
> RocksDB used in streaming
>  
> We need these settings in order to control memory tuning for RocksDB. We 
> already expose settings for blockCache size. However, these 2 settings are 
> missing. This change proposes to add them.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42819) Add support for setting max_write_buffer_number and write_buffer_size for RocksDB used in streaming

2023-03-16 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42819?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42819:


Assignee: (was: Apache Spark)

> Add support for setting max_write_buffer_number and write_buffer_size for 
> RocksDB used in streaming
> ---
>
> Key: SPARK-42819
> URL: https://issues.apache.org/jira/browse/SPARK-42819
> Project: Spark
>  Issue Type: Task
>  Components: Structured Streaming
>Affects Versions: 3.4.0
>Reporter: Anish Shrigondekar
>Priority: Major
>
> Add support for setting max_write_buffer_number and write_buffer_size for 
> RocksDB used in streaming
>  
> We need these settings in order to control memory tuning for RocksDB. We 
> already expose settings for blockCache size. However, these 2 settings are 
> missing. This change proposes to add them.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42819) Add support for setting max_write_buffer_number and write_buffer_size for RocksDB used in streaming

2023-03-16 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42819?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17700993#comment-17700993
 ] 

Apache Spark commented on SPARK-42819:
--

User 'anishshri-db' has created a pull request for this issue:
https://github.com/apache/spark/pull/40455

> Add support for setting max_write_buffer_number and write_buffer_size for 
> RocksDB used in streaming
> ---
>
> Key: SPARK-42819
> URL: https://issues.apache.org/jira/browse/SPARK-42819
> Project: Spark
>  Issue Type: Task
>  Components: Structured Streaming
>Affects Versions: 3.4.0
>Reporter: Anish Shrigondekar
>Priority: Major
>
> Add support for setting max_write_buffer_number and write_buffer_size for 
> RocksDB used in streaming
>  
> We need these settings in order to control memory tuning for RocksDB. We 
> already expose settings for blockCache size. However, these 2 settings are 
> missing. This change proposes to add them.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org