[jira] [Comment Edited] (SPARK-29002) Avoid changing SMJ to BHJ if the build side has a high ratio of empty partitions

2021-05-12 Thread Penglei Shi (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29002?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17343740#comment-17343740
 ] 

Penglei Shi edited comment on SPARK-29002 at 5/13/21, 6:58 AM:
---

When changing a SMJ to BHJ, there will be an additional BroadcastQueryStageExec 
for build side, the new broadcast query stage contains a 
CustomShuffleReaderExec after applying queryStageOptimizerRules, and the 
CoalesceShufflePartitions in finalStageOptimizerRules will not work. Am i 
correct? And on the other hand, i think it is not proper to coalesce partitions 
for probe side in BHJ.

!image-2021-05-13-12-12-23-530.png|width=524,height=376!


was (Author: penglei shi):
When changing a SMJ to BHJ, there will be an additional BroadcastQueryStageExec 
for build side, the new broadcast query stage contains a 
CustomShuffleReaderExec after applying queryStageOptimizerRules, and the 
CoalesceShufflePartitions in finalStageOptimizerRules will not work. Am i 
correct?

!image-2021-05-13-12-12-23-530.png|width=524,height=376!

> Avoid changing SMJ to BHJ if the build side has a high ratio of empty 
> partitions
> 
>
> Key: SPARK-29002
> URL: https://issues.apache.org/jira/browse/SPARK-29002
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wei Xue
>Assignee: Wei Xue
>Priority: Major
> Fix For: 3.0.0
>
> Attachments: image-2021-05-13-12-12-23-530.png
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29002) Avoid changing SMJ to BHJ if the build side has a high ratio of empty partitions

2021-05-12 Thread Penglei Shi (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29002?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17343740#comment-17343740
 ] 

Penglei Shi commented on SPARK-29002:
-

When changing a SMJ to BHJ, there will be an additional BroadcastQueryStageExec 
for build side, the new broadcast query stage contains a 
CustomShuffleReaderExec after applying queryStageOptimizerRules, and the 
CoalesceShufflePartitions in finalStageOptimizerRules will not work. Am i 
correct?

!image-2021-05-13-12-12-23-530.png|width=524,height=376!

> Avoid changing SMJ to BHJ if the build side has a high ratio of empty 
> partitions
> 
>
> Key: SPARK-29002
> URL: https://issues.apache.org/jira/browse/SPARK-29002
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wei Xue
>Assignee: Wei Xue
>Priority: Major
> Fix For: 3.0.0
>
> Attachments: image-2021-05-13-12-12-23-530.png
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35392) Flaky Test: spark/spark/python/pyspark/ml/clustering.py GaussianMixture

2021-05-12 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35392:


Assignee: Apache Spark

> Flaky Test: spark/spark/python/pyspark/ml/clustering.py GaussianMixture
> ---
>
> Key: SPARK-35392
> URL: https://issues.apache.org/jira/browse/SPARK-35392
> Project: Spark
>  Issue Type: Bug
>  Components: ML, PySpark
>Affects Versions: 3.2.0
>Reporter: Hyukjin Kwon
>Assignee: Apache Spark
>Priority: Major
>
> https://github.com/apache/spark/runs/2568540411
> {code}
> **
> File "/__w/spark/spark/python/pyspark/ml/clustering.py", line 276, in 
> __main__.GaussianMixture
> Failed example:
> summary.logLikelihood
> Expected:
> 65.02945...
> Got:
> 93.36008975083433
> **
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35392) Flaky Test: spark/spark/python/pyspark/ml/clustering.py GaussianMixture

2021-05-12 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35392:


Assignee: (was: Apache Spark)

> Flaky Test: spark/spark/python/pyspark/ml/clustering.py GaussianMixture
> ---
>
> Key: SPARK-35392
> URL: https://issues.apache.org/jira/browse/SPARK-35392
> Project: Spark
>  Issue Type: Bug
>  Components: ML, PySpark
>Affects Versions: 3.2.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> https://github.com/apache/spark/runs/2568540411
> {code}
> **
> File "/__w/spark/spark/python/pyspark/ml/clustering.py", line 276, in 
> __main__.GaussianMixture
> Failed example:
> summary.logLikelihood
> Expected:
> 65.02945...
> Got:
> 93.36008975083433
> **
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35392) Flaky Test: spark/spark/python/pyspark/ml/clustering.py GaussianMixture

2021-05-12 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17343724#comment-17343724
 ] 

Apache Spark commented on SPARK-35392:
--

User 'zhengruifeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/32533

> Flaky Test: spark/spark/python/pyspark/ml/clustering.py GaussianMixture
> ---
>
> Key: SPARK-35392
> URL: https://issues.apache.org/jira/browse/SPARK-35392
> Project: Spark
>  Issue Type: Bug
>  Components: ML, PySpark
>Affects Versions: 3.2.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> https://github.com/apache/spark/runs/2568540411
> {code}
> **
> File "/__w/spark/spark/python/pyspark/ml/clustering.py", line 276, in 
> __main__.GaussianMixture
> Failed example:
> summary.logLikelihood
> Expected:
> 65.02945...
> Got:
> 93.36008975083433
> **
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-35382) Fix lambda variable name issues in nested DataFrame functions in Python APIs

2021-05-12 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-35382.
--
Fix Version/s: 3.2.0
   3.1.2
 Assignee: Takuya Ueshin
   Resolution: Fixed

Fixed in https://github.com/apache/spark/pull/32523

> Fix lambda variable name issues in nested DataFrame functions in Python APIs
> 
>
> Key: SPARK-35382
> URL: https://issues.apache.org/jira/browse/SPARK-35382
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.1.1
>Reporter: Hyukjin Kwon
>Assignee: Takuya Ueshin
>Priority: Critical
>  Labels: correctness
> Fix For: 3.1.2, 3.2.0
>
>
> Python side also has the same issue as SPARK-34794
> {code}
> from pyspark.sql.functions import *
> df = sql("SELECT array(1, 2, 3) as numbers, array('a', 'b', 'c') as letters")
> df.select(
> transform(
> "numbers",
> lambda n: transform("letters", lambda l: struct(n.alias("n"), 
> l.alias("l")))
> )
> ).show()
> {code}
> {code}
> ++
> |transform(numbers, lambdafunction(transform(letters, 
> lambdafunction(struct(namedlambdavariable() AS n, namedlambdavariable() AS 
> l), namedlambdavariable())), namedlambdavariable()))|
> ++
> | 
>   
>  [[{a, a}, {b, b},...|
> ++
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35392) Flaky Test: spark/spark/python/pyspark/ml/clustering.py GaussianMixture

2021-05-12 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17343722#comment-17343722
 ] 

Hyukjin Kwon commented on SPARK-35392:
--

Thanks man!

> Flaky Test: spark/spark/python/pyspark/ml/clustering.py GaussianMixture
> ---
>
> Key: SPARK-35392
> URL: https://issues.apache.org/jira/browse/SPARK-35392
> Project: Spark
>  Issue Type: Bug
>  Components: ML, PySpark
>Affects Versions: 3.2.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> https://github.com/apache/spark/runs/2568540411
> {code}
> **
> File "/__w/spark/spark/python/pyspark/ml/clustering.py", line 276, in 
> __main__.GaussianMixture
> Failed example:
> summary.logLikelihood
> Expected:
> 65.02945...
> Got:
> 93.36008975083433
> **
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35392) Flaky Test: spark/spark/python/pyspark/ml/clustering.py GaussianMixture

2021-05-12 Thread zhengruifeng (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17343721#comment-17343721
 ] 

zhengruifeng commented on SPARK-35392:
--

[~hyukjin.kwon] OK, I will send a PR

> Flaky Test: spark/spark/python/pyspark/ml/clustering.py GaussianMixture
> ---
>
> Key: SPARK-35392
> URL: https://issues.apache.org/jira/browse/SPARK-35392
> Project: Spark
>  Issue Type: Bug
>  Components: ML, PySpark
>Affects Versions: 3.2.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> https://github.com/apache/spark/runs/2568540411
> {code}
> **
> File "/__w/spark/spark/python/pyspark/ml/clustering.py", line 276, in 
> __main__.GaussianMixture
> Failed example:
> summary.logLikelihood
> Expected:
> 65.02945...
> Got:
> 93.36008975083433
> **
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35384) Improve performance for InvokeLike.invoke

2021-05-12 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17343719#comment-17343719
 ] 

Apache Spark commented on SPARK-35384:
--

User 'sunchao' has created a pull request for this issue:
https://github.com/apache/spark/pull/32532

> Improve performance for InvokeLike.invoke
> -
>
> Key: SPARK-35384
> URL: https://issues.apache.org/jira/browse/SPARK-35384
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Chao Sun
>Assignee: Chao Sun
>Priority: Minor
> Fix For: 3.2.0
>
>
> `InvokeLike.invoke` uses `map` to evaluate arguments:
> {code:java}
> val args = arguments.map(e => e.eval(input).asInstanceOf[Object])
> if (needNullCheck && args.exists(_ == null)) {
>   // return null if one of arguments is null
>   null
> } else { 
> {code}
> which seems pretty expensive if the method itself is trivial. We can change 
> it to a plain for-loop.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35392) Flaky Test: spark/spark/python/pyspark/ml/clustering.py GaussianMixture

2021-05-12 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17343718#comment-17343718
 ] 

Hyukjin Kwon commented on SPARK-35392:
--

[~podongfeng] would you mind making a quick fix please?

> Flaky Test: spark/spark/python/pyspark/ml/clustering.py GaussianMixture
> ---
>
> Key: SPARK-35392
> URL: https://issues.apache.org/jira/browse/SPARK-35392
> Project: Spark
>  Issue Type: Bug
>  Components: ML, PySpark
>Affects Versions: 3.2.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> https://github.com/apache/spark/runs/2568540411
> {code}
> **
> File "/__w/spark/spark/python/pyspark/ml/clustering.py", line 276, in 
> __main__.GaussianMixture
> Failed example:
> summary.logLikelihood
> Expected:
> 65.02945...
> Got:
> 93.36008975083433
> **
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-35392) Flaky Test: spark/spark/python/pyspark/ml/clustering.py GaussianMixture

2021-05-12 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17343718#comment-17343718
 ] 

Hyukjin Kwon edited comment on SPARK-35392 at 5/13/21, 6:01 AM:


[~podongfeng] would you mind making a quick PR please?


was (Author: hyukjin.kwon):
[~podongfeng] would you mind making a quick fix please?

> Flaky Test: spark/spark/python/pyspark/ml/clustering.py GaussianMixture
> ---
>
> Key: SPARK-35392
> URL: https://issues.apache.org/jira/browse/SPARK-35392
> Project: Spark
>  Issue Type: Bug
>  Components: ML, PySpark
>Affects Versions: 3.2.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> https://github.com/apache/spark/runs/2568540411
> {code}
> **
> File "/__w/spark/spark/python/pyspark/ml/clustering.py", line 276, in 
> __main__.GaussianMixture
> Failed example:
> summary.logLikelihood
> Expected:
> 65.02945...
> Got:
> 93.36008975083433
> **
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-35395) Move ORC data source options from Python and Scala into a single page

2021-05-12 Thread Haejoon Lee (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Haejoon Lee updated SPARK-35395:

Summary: Move ORC data source options from Python and Scala into a single 
page  (was: Move ORC data source options from Python and Scala into a single pa)

> Move ORC data source options from Python and Scala into a single page
> -
>
> Key: SPARK-35395
> URL: https://issues.apache.org/jira/browse/SPARK-35395
> Project: Spark
>  Issue Type: Sub-task
>  Components: docs
>Affects Versions: 3.2.0
>Reporter: Haejoon Lee
>Priority: Major
>
> Refer to https://issues.apache.org/jira/browse/SPARK-34491



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-35395) Move ORC data source options from Python and Scala into a single pa

2021-05-12 Thread Haejoon Lee (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Haejoon Lee updated SPARK-35395:

Description: Refer to https://issues.apache.org/jira/browse/SPARK-34491

> Move ORC data source options from Python and Scala into a single pa
> ---
>
> Key: SPARK-35395
> URL: https://issues.apache.org/jira/browse/SPARK-35395
> Project: Spark
>  Issue Type: Sub-task
>  Components: docs
>Affects Versions: 3.2.0
>Reporter: Haejoon Lee
>Priority: Major
>
> Refer to https://issues.apache.org/jira/browse/SPARK-34491



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-35395) Move ORC data source options from Python and Scala into a single pa

2021-05-12 Thread Haejoon Lee (Jira)
Haejoon Lee created SPARK-35395:
---

 Summary: Move ORC data source options from Python and Scala into a 
single pa
 Key: SPARK-35395
 URL: https://issues.apache.org/jira/browse/SPARK-35395
 Project: Spark
  Issue Type: Sub-task
  Components: docs
Affects Versions: 3.2.0
Reporter: Haejoon Lee






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35394) Move kubernetes-client.version to root pom file

2021-05-12 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35394?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17343703#comment-17343703
 ] 

Apache Spark commented on SPARK-35394:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/32531

> Move kubernetes-client.version to root pom file
> ---
>
> Key: SPARK-35394
> URL: https://issues.apache.org/jira/browse/SPARK-35394
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, Kubernetes
>Affects Versions: 3.2.0
>Reporter: Dongjoon Hyun
>Priority: Major
>
> Currently, Apache Spark has two K8s version variables in two `pom.xml`s. We 
> had better unify it.
> - kubernetes.client.version (kubernetes/core module
> - kubernetes-client.version (kubernetes/integration-test module)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35394) Move kubernetes-client.version to root pom file

2021-05-12 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35394?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35394:


Assignee: Apache Spark

> Move kubernetes-client.version to root pom file
> ---
>
> Key: SPARK-35394
> URL: https://issues.apache.org/jira/browse/SPARK-35394
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, Kubernetes
>Affects Versions: 3.2.0
>Reporter: Dongjoon Hyun
>Assignee: Apache Spark
>Priority: Major
>
> Currently, Apache Spark has two K8s version variables in two `pom.xml`s. We 
> had better unify it.
> - kubernetes.client.version (kubernetes/core module
> - kubernetes-client.version (kubernetes/integration-test module)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35394) Move kubernetes-client.version to root pom file

2021-05-12 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35394?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35394:


Assignee: (was: Apache Spark)

> Move kubernetes-client.version to root pom file
> ---
>
> Key: SPARK-35394
> URL: https://issues.apache.org/jira/browse/SPARK-35394
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, Kubernetes
>Affects Versions: 3.2.0
>Reporter: Dongjoon Hyun
>Priority: Major
>
> Currently, Apache Spark has two K8s version variables in two `pom.xml`s. We 
> had better unify it.
> - kubernetes.client.version (kubernetes/core module
> - kubernetes-client.version (kubernetes/integration-test module)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35394) Move kubernetes-client.version to root pom file

2021-05-12 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35394?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17343702#comment-17343702
 ] 

Apache Spark commented on SPARK-35394:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/32531

> Move kubernetes-client.version to root pom file
> ---
>
> Key: SPARK-35394
> URL: https://issues.apache.org/jira/browse/SPARK-35394
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, Kubernetes
>Affects Versions: 3.2.0
>Reporter: Dongjoon Hyun
>Priority: Major
>
> Currently, Apache Spark has two K8s version variables in two `pom.xml`s. We 
> had better unify it.
> - kubernetes.client.version (kubernetes/core module
> - kubernetes-client.version (kubernetes/integration-test module)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-35394) Move kubernetes-client.version to root pom file

2021-05-12 Thread Dongjoon Hyun (Jira)
Dongjoon Hyun created SPARK-35394:
-

 Summary: Move kubernetes-client.version to root pom file
 Key: SPARK-35394
 URL: https://issues.apache.org/jira/browse/SPARK-35394
 Project: Spark
  Issue Type: Improvement
  Components: Build, Kubernetes
Affects Versions: 3.2.0
Reporter: Dongjoon Hyun


Currently, Apache Spark has two K8s version variables in two `pom.xml`s. We had 
better unify it.
- kubernetes.client.version (kubernetes/core module
- kubernetes-client.version (kubernetes/integration-test module)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35392) Flaky Test: spark/spark/python/pyspark/ml/clustering.py GaussianMixture

2021-05-12 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17343695#comment-17343695
 ] 

Dongjoon Hyun commented on SPARK-35392:
---

+1 for disabling, too.

> Flaky Test: spark/spark/python/pyspark/ml/clustering.py GaussianMixture
> ---
>
> Key: SPARK-35392
> URL: https://issues.apache.org/jira/browse/SPARK-35392
> Project: Spark
>  Issue Type: Bug
>  Components: ML, PySpark
>Affects Versions: 3.2.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> https://github.com/apache/spark/runs/2568540411
> {code}
> **
> File "/__w/spark/spark/python/pyspark/ml/clustering.py", line 276, in 
> __main__.GaussianMixture
> Failed example:
> summary.logLikelihood
> Expected:
> 65.02945...
> Got:
> 93.36008975083433
> **
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29002) Avoid changing SMJ to BHJ if the build side has a high ratio of empty partitions

2021-05-12 Thread Penglei Shi (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29002?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Penglei Shi updated SPARK-29002:

Attachment: image-2021-05-13-12-12-23-530.png

> Avoid changing SMJ to BHJ if the build side has a high ratio of empty 
> partitions
> 
>
> Key: SPARK-29002
> URL: https://issues.apache.org/jira/browse/SPARK-29002
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wei Xue
>Assignee: Wei Xue
>Priority: Major
> Fix For: 3.0.0
>
> Attachments: image-2021-05-13-12-12-23-530.png
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35392) Flaky Test: spark/spark/python/pyspark/ml/clustering.py GaussianMixture

2021-05-12 Thread Sean R. Owen (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17343691#comment-17343691
 ] 

Sean R. Owen commented on SPARK-35392:
--

Yes disable it. Yes almost certainly because of that change; we found that 
during testing it but tests did pass

> Flaky Test: spark/spark/python/pyspark/ml/clustering.py GaussianMixture
> ---
>
> Key: SPARK-35392
> URL: https://issues.apache.org/jira/browse/SPARK-35392
> Project: Spark
>  Issue Type: Bug
>  Components: ML, PySpark
>Affects Versions: 3.2.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> https://github.com/apache/spark/runs/2568540411
> {code}
> **
> File "/__w/spark/spark/python/pyspark/ml/clustering.py", line 276, in 
> __main__.GaussianMixture
> Failed example:
> summary.logLikelihood
> Expected:
> 65.02945...
> Got:
> 93.36008975083433
> **
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35392) Flaky Test: spark/spark/python/pyspark/ml/clustering.py GaussianMixture

2021-05-12 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17343689#comment-17343689
 ] 

Hyukjin Kwon commented on SPARK-35392:
--

I think disabling is fine for now if it's tricky to fix.

> Flaky Test: spark/spark/python/pyspark/ml/clustering.py GaussianMixture
> ---
>
> Key: SPARK-35392
> URL: https://issues.apache.org/jira/browse/SPARK-35392
> Project: Spark
>  Issue Type: Bug
>  Components: ML, PySpark
>Affects Versions: 3.2.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> https://github.com/apache/spark/runs/2568540411
> {code}
> **
> File "/__w/spark/spark/python/pyspark/ml/clustering.py", line 276, in 
> __main__.GaussianMixture
> Failed example:
> summary.logLikelihood
> Expected:
> 65.02945...
> Got:
> 93.36008975083433
> **
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-35384) Improve performance for InvokeLike.invoke

2021-05-12 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35384?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-35384.
---
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 32527
[https://github.com/apache/spark/pull/32527]

> Improve performance for InvokeLike.invoke
> -
>
> Key: SPARK-35384
> URL: https://issues.apache.org/jira/browse/SPARK-35384
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Chao Sun
>Assignee: Chao Sun
>Priority: Minor
> Fix For: 3.2.0
>
>
> `InvokeLike.invoke` uses `map` to evaluate arguments:
> {code:java}
> val args = arguments.map(e => e.eval(input).asInstanceOf[Object])
> if (needNullCheck && args.exists(_ == null)) {
>   // return null if one of arguments is null
>   null
> } else { 
> {code}
> which seems pretty expensive if the method itself is trivial. We can change 
> it to a plain for-loop.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35384) Improve performance for InvokeLike.invoke

2021-05-12 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35384?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-35384:
-

Assignee: Chao Sun

> Improve performance for InvokeLike.invoke
> -
>
> Key: SPARK-35384
> URL: https://issues.apache.org/jira/browse/SPARK-35384
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Chao Sun
>Assignee: Chao Sun
>Priority: Minor
>
> `InvokeLike.invoke` uses `map` to evaluate arguments:
> {code:java}
> val args = arguments.map(e => e.eval(input).asInstanceOf[Object])
> if (needNullCheck && args.exists(_ == null)) {
>   // return null if one of arguments is null
>   null
> } else { 
> {code}
> which seems pretty expensive if the method itself is trivial. We can change 
> it to a plain for-loop.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35392) Flaky Test: spark/spark/python/pyspark/ml/clustering.py GaussianMixture

2021-05-12 Thread zhengruifeng (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17343684#comment-17343684
 ] 

zhengruifeng commented on SPARK-35392:
--

This GMM test is highly unstable, it tend to fail if: change number of 
partitions or just change the way to compute the sum of weights.

I think we can just disable this check of {{summary.logLikelihood}} for now, 
and use another test in the future.++

 

as to this failure, is it related to 
[https://github.com/apache/spark/pull/32415?] [~srowen]

> Flaky Test: spark/spark/python/pyspark/ml/clustering.py GaussianMixture
> ---
>
> Key: SPARK-35392
> URL: https://issues.apache.org/jira/browse/SPARK-35392
> Project: Spark
>  Issue Type: Bug
>  Components: ML, PySpark
>Affects Versions: 3.2.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> https://github.com/apache/spark/runs/2568540411
> {code}
> **
> File "/__w/spark/spark/python/pyspark/ml/clustering.py", line 276, in 
> __main__.GaussianMixture
> Failed example:
> summary.logLikelihood
> Expected:
> 65.02945...
> Got:
> 93.36008975083433
> **
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35392) Flaky Test: spark/spark/python/pyspark/ml/clustering.py GaussianMixture

2021-05-12 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17343680#comment-17343680
 ] 

Hyukjin Kwon commented on SPARK-35392:
--

I see the first test failure at 
https://github.com/apache/spark/commit/101b0cc313cb4a6fb0027d470f313314d77bea08 
.. but doesn't look related from a cursory look.

> Flaky Test: spark/spark/python/pyspark/ml/clustering.py GaussianMixture
> ---
>
> Key: SPARK-35392
> URL: https://issues.apache.org/jira/browse/SPARK-35392
> Project: Spark
>  Issue Type: Bug
>  Components: ML, PySpark
>Affects Versions: 3.2.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> https://github.com/apache/spark/runs/2568540411
> {code}
> **
> File "/__w/spark/spark/python/pyspark/ml/clustering.py", line 276, in 
> __main__.GaussianMixture
> Failed example:
> summary.logLikelihood
> Expected:
> 65.02945...
> Got:
> 93.36008975083433
> **
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-35393) PIP packaging test is skipped in GitHub Actions build

2021-05-12 Thread Hyukjin Kwon (Jira)
Hyukjin Kwon created SPARK-35393:


 Summary: PIP packaging test is skipped in GitHub Actions build
 Key: SPARK-35393
 URL: https://issues.apache.org/jira/browse/SPARK-35393
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 3.2.0
Reporter: Hyukjin Kwon


See https://github.com/apache/spark/runs/2568923639?check_suite_focus=true

{code}

Running PySpark packaging tests

Constructing virtual env for testing
Missing virtualenv & conda, skipping pip installability tests
Cleaning up temporary directory - /tmp/tmp.iILYWISPXW
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35392) Flaky Test: spark/spark/python/pyspark/ml/clustering.py GaussianMixture

2021-05-12 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17343679#comment-17343679
 ] 

Hyukjin Kwon commented on SPARK-35392:
--

cc  [~ruifengz] per 
https://github.com/apache/spark/commit/111e9038d88feef63806457796f3b633f41ef32b 
and [~viirya] FYI

> Flaky Test: spark/spark/python/pyspark/ml/clustering.py GaussianMixture
> ---
>
> Key: SPARK-35392
> URL: https://issues.apache.org/jira/browse/SPARK-35392
> Project: Spark
>  Issue Type: Bug
>  Components: ML, PySpark
>Affects Versions: 3.2.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> https://github.com/apache/spark/runs/2568540411
> {code}
> **
> File "/__w/spark/spark/python/pyspark/ml/clustering.py", line 276, in 
> __main__.GaussianMixture
> Failed example:
> summary.logLikelihood
> Expected:
> 65.02945...
> Got:
> 93.36008975083433
> **
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-35392) Flaky Test: spark/spark/python/pyspark/ml/clustering.py GaussianMixture

2021-05-12 Thread Hyukjin Kwon (Jira)
Hyukjin Kwon created SPARK-35392:


 Summary: Flaky Test: spark/spark/python/pyspark/ml/clustering.py 
GaussianMixture
 Key: SPARK-35392
 URL: https://issues.apache.org/jira/browse/SPARK-35392
 Project: Spark
  Issue Type: Bug
  Components: ML, PySpark
Affects Versions: 3.2.0
Reporter: Hyukjin Kwon


https://github.com/apache/spark/runs/2568540411

{code}
**
File "/__w/spark/spark/python/pyspark/ml/clustering.py", line 276, in 
__main__.GaussianMixture
Failed example:
summary.logLikelihood
Expected:
65.02945...
Got:
93.36008975083433
**
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35106) HadoopMapReduceCommitProtocol performs bad rename when dynamic partition overwrite is used

2021-05-12 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17343670#comment-17343670
 ] 

Apache Spark commented on SPARK-35106:
--

User 'YuzhouSun' has created a pull request for this issue:
https://github.com/apache/spark/pull/32530

> HadoopMapReduceCommitProtocol performs bad rename when dynamic partition 
> overwrite is used
> --
>
> Key: SPARK-35106
> URL: https://issues.apache.org/jira/browse/SPARK-35106
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output, Spark Core
>Affects Versions: 3.1.1
>Reporter: Erik Krogen
>Priority: Major
>
> Recently when evaluating the code in 
> {{HadoopMapReduceCommitProtocol#commitJob}}, I found some bad codepath under 
> the {{dynamicPartitionOverwrite == true}} scenario:
> {code:language=scala}
>   # BLOCK 1
>   if (dynamicPartitionOverwrite) {
> val absPartitionPaths = filesToMove.values.map(new 
> Path(_).getParent).toSet
> logDebug(s"Clean up absolute partition directories for overwriting: 
> $absPartitionPaths")
> absPartitionPaths.foreach(fs.delete(_, true))
>   }
>   # BLOCK 2
>   for ((src, dst) <- filesToMove) {
> fs.rename(new Path(src), new Path(dst))
>   }
>   # BLOCK 3
>   if (dynamicPartitionOverwrite) {
> val partitionPaths = allPartitionPaths.foldLeft(Set[String]())(_ ++ _)
> logDebug(s"Clean up default partition directories for overwriting: 
> $partitionPaths")
> for (part <- partitionPaths) {
>   val finalPartPath = new Path(path, part)
>   if (!fs.delete(finalPartPath, true) && 
> !fs.exists(finalPartPath.getParent)) {
> // According to the official hadoop FileSystem API spec, delete 
> op should assume
> // the destination is no longer present regardless of return 
> value, thus we do not
> // need to double check if finalPartPath exists before rename.
> // Also in our case, based on the spec, delete returns false only 
> when finalPartPath
> // does not exist. When this happens, we need to take action if 
> parent of finalPartPath
> // also does not exist(e.g. the scenario described on 
> SPARK-23815), because
> // FileSystem API spec on rename op says the rename 
> dest(finalPartPath) must have
> // a parent that exists, otherwise we may get unexpected result 
> on the rename.
> fs.mkdirs(finalPartPath.getParent)
>   }
>   fs.rename(new Path(stagingDir, part), finalPartPath)
> }
>   }
> {code}
> Assuming {{dynamicPartitionOverwrite == true}}, we have the following 
> sequence of events:
> # Block 1 deletes all parent directories of {{filesToMove.values}}
> # Block 2 attempts to rename all {{filesToMove.keys}} to 
> {{filesToMove.values}}
> # Block 3 does directory-level renames to place files into their final 
> locations
> All renames in Block 2 will always fail, since all parent directories of 
> {{filesToMove.values}} were just deleted in Block 1. Under a normal HDFS 
> scenario, the contract of {{fs.rename}} is to return {{false}} under such a 
> failure scenario, as opposed to throwing an exception. There is a separate 
> issue here that Block 2 should probably be checking for those {{false}} 
> return values -- but this allows for {{dynamicPartitionOverwrite}} to "work", 
> albeit with a bunch of failed renames in the middle. Really, we should only 
> run Block 2 in the {{dynamicPartitionOverwrite == false}} case, and 
> consolidate Blocks 1 and 3 to run in the {{true}} case.
> We discovered this issue when testing against a {{FileSystem}} implementation 
> which was throwing an exception for this failed rename scenario instead of 
> returning false, escalating the silent/ignored rename failures into actual 
> failures.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35106) HadoopMapReduceCommitProtocol performs bad rename when dynamic partition overwrite is used

2021-05-12 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17343669#comment-17343669
 ] 

Apache Spark commented on SPARK-35106:
--

User 'YuzhouSun' has created a pull request for this issue:
https://github.com/apache/spark/pull/32530

> HadoopMapReduceCommitProtocol performs bad rename when dynamic partition 
> overwrite is used
> --
>
> Key: SPARK-35106
> URL: https://issues.apache.org/jira/browse/SPARK-35106
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output, Spark Core
>Affects Versions: 3.1.1
>Reporter: Erik Krogen
>Priority: Major
>
> Recently when evaluating the code in 
> {{HadoopMapReduceCommitProtocol#commitJob}}, I found some bad codepath under 
> the {{dynamicPartitionOverwrite == true}} scenario:
> {code:language=scala}
>   # BLOCK 1
>   if (dynamicPartitionOverwrite) {
> val absPartitionPaths = filesToMove.values.map(new 
> Path(_).getParent).toSet
> logDebug(s"Clean up absolute partition directories for overwriting: 
> $absPartitionPaths")
> absPartitionPaths.foreach(fs.delete(_, true))
>   }
>   # BLOCK 2
>   for ((src, dst) <- filesToMove) {
> fs.rename(new Path(src), new Path(dst))
>   }
>   # BLOCK 3
>   if (dynamicPartitionOverwrite) {
> val partitionPaths = allPartitionPaths.foldLeft(Set[String]())(_ ++ _)
> logDebug(s"Clean up default partition directories for overwriting: 
> $partitionPaths")
> for (part <- partitionPaths) {
>   val finalPartPath = new Path(path, part)
>   if (!fs.delete(finalPartPath, true) && 
> !fs.exists(finalPartPath.getParent)) {
> // According to the official hadoop FileSystem API spec, delete 
> op should assume
> // the destination is no longer present regardless of return 
> value, thus we do not
> // need to double check if finalPartPath exists before rename.
> // Also in our case, based on the spec, delete returns false only 
> when finalPartPath
> // does not exist. When this happens, we need to take action if 
> parent of finalPartPath
> // also does not exist(e.g. the scenario described on 
> SPARK-23815), because
> // FileSystem API spec on rename op says the rename 
> dest(finalPartPath) must have
> // a parent that exists, otherwise we may get unexpected result 
> on the rename.
> fs.mkdirs(finalPartPath.getParent)
>   }
>   fs.rename(new Path(stagingDir, part), finalPartPath)
> }
>   }
> {code}
> Assuming {{dynamicPartitionOverwrite == true}}, we have the following 
> sequence of events:
> # Block 1 deletes all parent directories of {{filesToMove.values}}
> # Block 2 attempts to rename all {{filesToMove.keys}} to 
> {{filesToMove.values}}
> # Block 3 does directory-level renames to place files into their final 
> locations
> All renames in Block 2 will always fail, since all parent directories of 
> {{filesToMove.values}} were just deleted in Block 1. Under a normal HDFS 
> scenario, the contract of {{fs.rename}} is to return {{false}} under such a 
> failure scenario, as opposed to throwing an exception. There is a separate 
> issue here that Block 2 should probably be checking for those {{false}} 
> return values -- but this allows for {{dynamicPartitionOverwrite}} to "work", 
> albeit with a bunch of failed renames in the middle. Really, we should only 
> run Block 2 in the {{dynamicPartitionOverwrite == false}} case, and 
> consolidate Blocks 1 and 3 to run in the {{true}} case.
> We discovered this issue when testing against a {{FileSystem}} implementation 
> which was throwing an exception for this failed rename scenario instead of 
> returning false, escalating the silent/ignored rename failures into actual 
> failures.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35346) More clause needed for combining groupby and cube

2021-05-12 Thread Kai (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17343667#comment-17343667
 ] 

Kai commented on SPARK-35346:
-

Thanks for the reply [~hyukjin.kwon] , but no. I mean in the operation of 
dataframe (in pyspark not pysql). We need a mixed case for group and cube or 
rollup. 

Now:

example_dataframe.cube('xxx','xxx','xxx').agg() or 
example_dataframe.group('xxx','xxx','xxx').agg()

Improve:

example_dataframe.group('xxx','xxx','xxx',cube('xxx','xxx','xxx')).agg()

 

which is similar to this feature 
https://issues.apache.org/jira/browse/SPARK-33229 thanks [~maropu]

 

> More clause needed for combining groupby and cube
> -
>
> Key: SPARK-35346
> URL: https://issues.apache.org/jira/browse/SPARK-35346
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 3.0.0, 3.0.2, 3.1.1
>Reporter: Kai
>Priority: Major
>
> As we all know, aggregation clause must follow after groupby, rollup or cube 
> clause in pyspark. I think we should have more features in this part. Because 
> in sql, we can write it like this "group by xxx, xxx, cube(xxx,xxx)". While 
> in pyspark, if you just need cube for one field and group for the others, 
> it's not gonna happen. Using cube for all fields brings much more cost for 
> useless data. So I think we need to improve it. Thank you!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-35388) Allow the PR source branch to include slashes.

2021-05-12 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35388?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-35388.
--
Fix Version/s: 3.2.0
 Assignee: Takuya Ueshin
   Resolution: Fixed

Fixed in https://github.com/apache/spark/pull/32524

> Allow the PR source branch to include slashes. 
> ---
>
> Key: SPARK-35388
> URL: https://issues.apache.org/jira/browse/SPARK-35388
> Project: Spark
>  Issue Type: Sub-task
>  Components: Project Infra
>Affects Versions: 3.2.0
>Reporter: Takuya Ueshin
>Assignee: Takuya Ueshin
>Priority: Major
> Fix For: 3.2.0
>
>
> We should allow the PR source branch to include slashes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35106) HadoopMapReduceCommitProtocol performs bad rename when dynamic partition overwrite is used

2021-05-12 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17343654#comment-17343654
 ] 

Apache Spark commented on SPARK-35106:
--

User 'YuzhouSun' has created a pull request for this issue:
https://github.com/apache/spark/pull/32529

> HadoopMapReduceCommitProtocol performs bad rename when dynamic partition 
> overwrite is used
> --
>
> Key: SPARK-35106
> URL: https://issues.apache.org/jira/browse/SPARK-35106
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output, Spark Core
>Affects Versions: 3.1.1
>Reporter: Erik Krogen
>Priority: Major
>
> Recently when evaluating the code in 
> {{HadoopMapReduceCommitProtocol#commitJob}}, I found some bad codepath under 
> the {{dynamicPartitionOverwrite == true}} scenario:
> {code:language=scala}
>   # BLOCK 1
>   if (dynamicPartitionOverwrite) {
> val absPartitionPaths = filesToMove.values.map(new 
> Path(_).getParent).toSet
> logDebug(s"Clean up absolute partition directories for overwriting: 
> $absPartitionPaths")
> absPartitionPaths.foreach(fs.delete(_, true))
>   }
>   # BLOCK 2
>   for ((src, dst) <- filesToMove) {
> fs.rename(new Path(src), new Path(dst))
>   }
>   # BLOCK 3
>   if (dynamicPartitionOverwrite) {
> val partitionPaths = allPartitionPaths.foldLeft(Set[String]())(_ ++ _)
> logDebug(s"Clean up default partition directories for overwriting: 
> $partitionPaths")
> for (part <- partitionPaths) {
>   val finalPartPath = new Path(path, part)
>   if (!fs.delete(finalPartPath, true) && 
> !fs.exists(finalPartPath.getParent)) {
> // According to the official hadoop FileSystem API spec, delete 
> op should assume
> // the destination is no longer present regardless of return 
> value, thus we do not
> // need to double check if finalPartPath exists before rename.
> // Also in our case, based on the spec, delete returns false only 
> when finalPartPath
> // does not exist. When this happens, we need to take action if 
> parent of finalPartPath
> // also does not exist(e.g. the scenario described on 
> SPARK-23815), because
> // FileSystem API spec on rename op says the rename 
> dest(finalPartPath) must have
> // a parent that exists, otherwise we may get unexpected result 
> on the rename.
> fs.mkdirs(finalPartPath.getParent)
>   }
>   fs.rename(new Path(stagingDir, part), finalPartPath)
> }
>   }
> {code}
> Assuming {{dynamicPartitionOverwrite == true}}, we have the following 
> sequence of events:
> # Block 1 deletes all parent directories of {{filesToMove.values}}
> # Block 2 attempts to rename all {{filesToMove.keys}} to 
> {{filesToMove.values}}
> # Block 3 does directory-level renames to place files into their final 
> locations
> All renames in Block 2 will always fail, since all parent directories of 
> {{filesToMove.values}} were just deleted in Block 1. Under a normal HDFS 
> scenario, the contract of {{fs.rename}} is to return {{false}} under such a 
> failure scenario, as opposed to throwing an exception. There is a separate 
> issue here that Block 2 should probably be checking for those {{false}} 
> return values -- but this allows for {{dynamicPartitionOverwrite}} to "work", 
> albeit with a bunch of failed renames in the middle. Really, we should only 
> run Block 2 in the {{dynamicPartitionOverwrite == false}} case, and 
> consolidate Blocks 1 and 3 to run in the {{true}} case.
> We discovered this issue when testing against a {{FileSystem}} implementation 
> which was throwing an exception for this failed rename scenario instead of 
> returning false, escalating the silent/ignored rename failures into actual 
> failures.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35106) HadoopMapReduceCommitProtocol performs bad rename when dynamic partition overwrite is used

2021-05-12 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17343655#comment-17343655
 ] 

Apache Spark commented on SPARK-35106:
--

User 'YuzhouSun' has created a pull request for this issue:
https://github.com/apache/spark/pull/32529

> HadoopMapReduceCommitProtocol performs bad rename when dynamic partition 
> overwrite is used
> --
>
> Key: SPARK-35106
> URL: https://issues.apache.org/jira/browse/SPARK-35106
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output, Spark Core
>Affects Versions: 3.1.1
>Reporter: Erik Krogen
>Priority: Major
>
> Recently when evaluating the code in 
> {{HadoopMapReduceCommitProtocol#commitJob}}, I found some bad codepath under 
> the {{dynamicPartitionOverwrite == true}} scenario:
> {code:language=scala}
>   # BLOCK 1
>   if (dynamicPartitionOverwrite) {
> val absPartitionPaths = filesToMove.values.map(new 
> Path(_).getParent).toSet
> logDebug(s"Clean up absolute partition directories for overwriting: 
> $absPartitionPaths")
> absPartitionPaths.foreach(fs.delete(_, true))
>   }
>   # BLOCK 2
>   for ((src, dst) <- filesToMove) {
> fs.rename(new Path(src), new Path(dst))
>   }
>   # BLOCK 3
>   if (dynamicPartitionOverwrite) {
> val partitionPaths = allPartitionPaths.foldLeft(Set[String]())(_ ++ _)
> logDebug(s"Clean up default partition directories for overwriting: 
> $partitionPaths")
> for (part <- partitionPaths) {
>   val finalPartPath = new Path(path, part)
>   if (!fs.delete(finalPartPath, true) && 
> !fs.exists(finalPartPath.getParent)) {
> // According to the official hadoop FileSystem API spec, delete 
> op should assume
> // the destination is no longer present regardless of return 
> value, thus we do not
> // need to double check if finalPartPath exists before rename.
> // Also in our case, based on the spec, delete returns false only 
> when finalPartPath
> // does not exist. When this happens, we need to take action if 
> parent of finalPartPath
> // also does not exist(e.g. the scenario described on 
> SPARK-23815), because
> // FileSystem API spec on rename op says the rename 
> dest(finalPartPath) must have
> // a parent that exists, otherwise we may get unexpected result 
> on the rename.
> fs.mkdirs(finalPartPath.getParent)
>   }
>   fs.rename(new Path(stagingDir, part), finalPartPath)
> }
>   }
> {code}
> Assuming {{dynamicPartitionOverwrite == true}}, we have the following 
> sequence of events:
> # Block 1 deletes all parent directories of {{filesToMove.values}}
> # Block 2 attempts to rename all {{filesToMove.keys}} to 
> {{filesToMove.values}}
> # Block 3 does directory-level renames to place files into their final 
> locations
> All renames in Block 2 will always fail, since all parent directories of 
> {{filesToMove.values}} were just deleted in Block 1. Under a normal HDFS 
> scenario, the contract of {{fs.rename}} is to return {{false}} under such a 
> failure scenario, as opposed to throwing an exception. There is a separate 
> issue here that Block 2 should probably be checking for those {{false}} 
> return values -- but this allows for {{dynamicPartitionOverwrite}} to "work", 
> albeit with a bunch of failed renames in the middle. Really, we should only 
> run Block 2 in the {{dynamicPartitionOverwrite == false}} case, and 
> consolidate Blocks 1 and 3 to run in the {{true}} case.
> We discovered this issue when testing against a {{FileSystem}} implementation 
> which was throwing an exception for this failed rename scenario instead of 
> returning false, escalating the silent/ignored rename failures into actual 
> failures.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-35385) skip duplicate queries in the TPCDS-related tests

2021-05-12 Thread Takeshi Yamamuro (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35385?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro resolved SPARK-35385.
--
Fix Version/s: 3.2.0
 Assignee: Takeshi Yamamuro
   Resolution: Fixed

Resolved by https://github.com/apache/spark/pull/32520

> skip duplicate queries in the TPCDS-related tests
> -
>
> Key: SPARK-35385
> URL: https://issues.apache.org/jira/browse/SPARK-35385
> Project: Spark
>  Issue Type: Test
>  Components: SQL, Tests
>Affects Versions: 3.2.0
>Reporter: Takeshi Yamamuro
>Assignee: Takeshi Yamamuro
>Priority: Minor
> Fix For: 3.2.0
>
>
> This ticket proposes to skip the "q6", "q34", "q64", "q74", "q75", "q78" 
> queries in the TPCDS-related tests because the TPCDS v2.7 queries have almost 
> the same ones; the only differences in these queries are ORDER BY columns.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35350) Add code-gen for left semi sort merge join

2021-05-12 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35350?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35350:


Assignee: (was: Apache Spark)

> Add code-gen for left semi sort merge join
> --
>
> Key: SPARK-35350
> URL: https://issues.apache.org/jira/browse/SPARK-35350
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Cheng Su
>Priority: Minor
>
> This Jira is to track the progress to add code-gen support for left semi sort 
> merge join. See motivation in SPARK-34705.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35350) Add code-gen for left semi sort merge join

2021-05-12 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35350?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17343638#comment-17343638
 ] 

Apache Spark commented on SPARK-35350:
--

User 'c21' has created a pull request for this issue:
https://github.com/apache/spark/pull/32528

> Add code-gen for left semi sort merge join
> --
>
> Key: SPARK-35350
> URL: https://issues.apache.org/jira/browse/SPARK-35350
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Cheng Su
>Priority: Minor
>
> This Jira is to track the progress to add code-gen support for left semi sort 
> merge join. See motivation in SPARK-34705.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35350) Add code-gen for left semi sort merge join

2021-05-12 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35350?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35350:


Assignee: Apache Spark

> Add code-gen for left semi sort merge join
> --
>
> Key: SPARK-35350
> URL: https://issues.apache.org/jira/browse/SPARK-35350
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Cheng Su
>Assignee: Apache Spark
>Priority: Minor
>
> This Jira is to track the progress to add code-gen support for left semi sort 
> merge join. See motivation in SPARK-34705.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35371) Scala UDF returning string or complex type applied to array members returns wrong data

2021-05-12 Thread David Benedeki (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17343636#comment-17343636
 ] 

David Benedeki commented on SPARK-35371:


I can confirm, that with Spark 3.1.2-SNAPSHOT the issue is gone. Thank you. (y)

> Scala UDF returning string or complex type applied to array members returns 
> wrong data
> --
>
> Key: SPARK-35371
> URL: https://issues.apache.org/jira/browse/SPARK-35371
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.1
>Reporter: David Benedeki
>Priority: Major
>
> When using an UDF returning string or complex type (Struct) on array members 
> the resulting array consists of the last array member UDF result.
> h3. *Example code:*
> {code:scala}
> import org.apache.spark.sql.{Column, SparkSession}
> import org.apache.spark.sql.functions.{callUDF, col, transform, udf}
> val sparkBuilder: SparkSession.Builder = SparkSession.builder()
>   .master("local[*]")
>   .appName(s"Udf Bug Demo")
>   .config("spark.ui.enabled", "false")
>   .config("spark.debug.maxToStringFields", 100)
> val spark: SparkSession = sparkBuilder
>   .config("spark.driver.bindAddress", "127.0.0.1")
>   .config("spark.driver.host", "127.0.0.1")
>   .getOrCreate()
> import spark.implicits._
> case class Foo(num: Int, s: String)
> val src  = Seq(
>   (1, 2, Array(1, 2, 3)),
>   (2, 2, Array(2, 2, 2)),
>   (3, 4, Array(3, 4, 3, 4))
> ).toDF("A", "B", "C")
> val udfStringName = "UdfString"
> val udfIntName = "UdfInt"
> val udfStructName = "UdfStruct"
> val udfString = udf((num: Int) => {
>   (num + 1).toString
> })
> spark.udf.register(udfStringName, udfString)
> val udfInt = udf((num: Int) => {
>   num + 1
> })
> spark.udf.register(udfIntName, udfInt)
> val udfStruct = udf((num: Int) => {
>   Foo(num + 1, (num + 1).toString)
> })
> spark.udf.register(udfStructName, udfStruct)
> val lambdaString = (forCol: Column) => callUDF(udfStringName, forCol)
> val lambdaInt = (forCol: Column) => callUDF(udfIntName, forCol)
> val lambdaStruct = (forCol: Column) => callUDF(udfStructName, forCol)
> val cA = callUDF(udfStringName, col("A"))
> val cB = callUDF(udfStringName, col("B"))
> val cCString: Column = transform(col("C"), lambdaString)
> val cCInt: Column = transform(col("C"), lambdaInt)
> val cCStruc: Column = transform(col("C"), lambdaStruct)
> val dest = src.withColumn("AStr", cA)
>   .withColumn("BStr", cB)
>   .withColumn("CString (Wrong)", cCString)
>   .withColumn("CInt (OK)", cCInt)
>   .withColumn("CStruct (Wrong)", cCStruc)
> dest.show(false)
> dest.printSchema()
> {code}
> h3. *Expected:*
> {noformat}
> +---+---++++---+++
> |A  |B  |C   |AStr|BStr|CString|CInt|CStruct  
> |
> +---+---++++---+++
> |1  |2  |[1, 2, 3]   |2   |3   |[2, 3, 4]  |[2, 3, 4]   |[{2, 2}, {3, 3}, 
> {4, 4}]|
> |2  |2  |[2, 2, 2]   |3   |3   |[3, 3, 3]  |[3, 3, 3]   |[{3, 3}, {3, 3}, 
> {3, 3}]|
> |3  |4  |[3, 4, 3, 4]|4   |5   |[4, 5, 4, 5]   |[4, 5, 4, 5]|[{4, 4}, {5, 5}, 
> {4, 4}, {5, 5}]|
> +---+---++++---+++
> {noformat}
> h3. *Got:*
> {noformat}
> +---+---++++---+++
> |A  |B  |C   |AStr|BStr|CString (Wrong)|CInt (Ok)   |CStruct (Wrong)  
>|
> +---+---++++---+++
> |1  |2  |[1, 2, 3]   |2   |3   |[4, 4, 4]  |[2, 3, 4]   |[{4, 4}, {4, 4}, 
> {4, 4}]|
> |2  |2  |[2, 2, 2]   |3   |3   |[3, 3, 3]  |[3, 3, 3]   |[{3, 3}, {3, 3}, 
> {3, 3}]|
> |3  |4  |[3, 4, 3, 4]|4   |5   |[5, 5, 5, 5]   |[4, 5, 4, 5]|[{5, 5}, {5, 5}, 
> {5, 5}, {5, 5}]|
> +---+---++++---+++
> {noformat}
> h3. *Observation*
>  * Work correctly on Spark 3.0.2
>  * When UDF is registered as Java UDF, it works as supposed
>  * The UDF is called the appropriate number of times (regardless if UDF is 
> marked as deterministic or non-deterministic).
>  * When debugged, the correct value is actually saved into the result array 
> at first but every subsequent item processing overwrites the previous result 
> values as well. Therefore the last item values filling the array is the final 
> result.
>  * When the UDF returns NULL/None it does not "overwrite” the prior array 
> values nor is “overwritten” by subsequent non-NULL values. See with following 
> UDF impelementation:
> {code:scala}
> val udfString = udf((

[jira] [Assigned] (SPARK-35384) Improve performance for InvokeLike.invoke

2021-05-12 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35384?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35384:


Assignee: Apache Spark

> Improve performance for InvokeLike.invoke
> -
>
> Key: SPARK-35384
> URL: https://issues.apache.org/jira/browse/SPARK-35384
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Chao Sun
>Assignee: Apache Spark
>Priority: Minor
>
> `InvokeLike.invoke` uses `map` to evaluate arguments:
> {code:java}
> val args = arguments.map(e => e.eval(input).asInstanceOf[Object])
> if (needNullCheck && args.exists(_ == null)) {
>   // return null if one of arguments is null
>   null
> } else { 
> {code}
> which seems pretty expensive if the method itself is trivial. We can change 
> it to a plain for-loop.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35384) Improve performance for InvokeLike.invoke

2021-05-12 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35384?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35384:


Assignee: (was: Apache Spark)

> Improve performance for InvokeLike.invoke
> -
>
> Key: SPARK-35384
> URL: https://issues.apache.org/jira/browse/SPARK-35384
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Chao Sun
>Priority: Minor
>
> `InvokeLike.invoke` uses `map` to evaluate arguments:
> {code:java}
> val args = arguments.map(e => e.eval(input).asInstanceOf[Object])
> if (needNullCheck && args.exists(_ == null)) {
>   // return null if one of arguments is null
>   null
> } else { 
> {code}
> which seems pretty expensive if the method itself is trivial. We can change 
> it to a plain for-loop.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35384) Improve performance for InvokeLike.invoke

2021-05-12 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17343631#comment-17343631
 ] 

Apache Spark commented on SPARK-35384:
--

User 'sunchao' has created a pull request for this issue:
https://github.com/apache/spark/pull/32527

> Improve performance for InvokeLike.invoke
> -
>
> Key: SPARK-35384
> URL: https://issues.apache.org/jira/browse/SPARK-35384
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Chao Sun
>Priority: Minor
>
> `InvokeLike.invoke` uses `map` to evaluate arguments:
> {code:java}
> val args = arguments.map(e => e.eval(input).asInstanceOf[Object])
> if (needNullCheck && args.exists(_ == null)) {
>   // return null if one of arguments is null
>   null
> } else { 
> {code}
> which seems pretty expensive if the method itself is trivial. We can change 
> it to a plain for-loop.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-35391) Memory leak in ExecutorAllocationListener breaks dynamic allocation under high load

2021-05-12 Thread Vasily Kolpakov (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vasily Kolpakov updated SPARK-35391:

Comment: was deleted

(was: https://github.com/apache/spark/pull/32526)

> Memory leak in ExecutorAllocationListener breaks dynamic allocation under 
> high load
> ---
>
> Key: SPARK-35391
> URL: https://issues.apache.org/jira/browse/SPARK-35391
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.1.1
>Reporter: Vasily Kolpakov
>Priority: Major
>
> ExecutorAllocationListener doesn't clean up data properly. 
> ExecutorAllocationListener performs progressively slower and eventually fails 
> to process events in time.
> There are two problems:
>  * a bug (typo?) in totalRunningTasksPerResourceProfile() method
>  getOrElseUpdate() is used instead of getOrElse().
>  If spark-dynamic-executor-allocation thread calls schedule() after a 
> SparkListenerTaskEnd event for the last task in a stage
>  but before SparkListenerStageCompleted event for the stage, then 
> stageAttemptToNumRunningTask will not be cleaned up properly.
>  * resourceProfileIdToStageAttempt clean-up is broken
>  If a SparkListenerTaskEnd event for the last task in a stage was processed 
> before SparkListenerStageCompleted for that stage,
>  then resourceProfileIdToStageAttempt will not be cleaned up properly.
>  
> Bugs were introduced in this commit: 
> https://github.com/apache/spark/commit/496f6ac86001d284cbfb7488a63dd3a168919c0f
>  .
> Steps to reproduce:
>  # Launch standalone master and worker with 
> 'spark.shuffle.service.enabled=true'
>  # Run spark-shell with --conf 'spark.shuffle.service.enabled=true' --conf 
> 'spark.dynamicAllocation.enabled=true' and paste this script
> {code:java}
> for (_ <- 0 until 10) {
> Seq(1, 2, 3, 4, 5).toDF.repartition(100).agg("value" -> "sum").show()
> }
> {code}
>  # make a heap dump and examine 
> ExecutorAllocationListener.totalRunningTasksPerResourceProfile and 
> ExecutorAllocationListener.resourceProfileIdToStageAttempt fields
> Expected: totalRunningTasksPerResourceProfile and 
> resourceProfileIdToStageAttempt(defaultResourceProfileId) are empty
> Actual: totalRunningTasksPerResourceProfile and 
> resourceProfileIdToStageAttempt(defaultResourceProfileId) contain 
> non-relevant data
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35391) Memory leak in ExecutorAllocationListener breaks dynamic allocation under high load

2021-05-12 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17343581#comment-17343581
 ] 

Apache Spark commented on SPARK-35391:
--

User 'VasilyKolpakov' has created a pull request for this issue:
https://github.com/apache/spark/pull/32526

> Memory leak in ExecutorAllocationListener breaks dynamic allocation under 
> high load
> ---
>
> Key: SPARK-35391
> URL: https://issues.apache.org/jira/browse/SPARK-35391
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.1.1
>Reporter: Vasily Kolpakov
>Priority: Major
>
> ExecutorAllocationListener doesn't clean up data properly. 
> ExecutorAllocationListener performs progressively slower and eventually fails 
> to process events in time.
> There are two problems:
>  * a bug (typo?) in totalRunningTasksPerResourceProfile() method
>  getOrElseUpdate() is used instead of getOrElse().
>  If spark-dynamic-executor-allocation thread calls schedule() after a 
> SparkListenerTaskEnd event for the last task in a stage
>  but before SparkListenerStageCompleted event for the stage, then 
> stageAttemptToNumRunningTask will not be cleaned up properly.
>  * resourceProfileIdToStageAttempt clean-up is broken
>  If a SparkListenerTaskEnd event for the last task in a stage was processed 
> before SparkListenerStageCompleted for that stage,
>  then resourceProfileIdToStageAttempt will not be cleaned up properly.
>  
> Bugs were introduced in this commit: 
> https://github.com/apache/spark/commit/496f6ac86001d284cbfb7488a63dd3a168919c0f
>  .
> Steps to reproduce:
>  # Launch standalone master and worker with 
> 'spark.shuffle.service.enabled=true'
>  # Run spark-shell with --conf 'spark.shuffle.service.enabled=true' --conf 
> 'spark.dynamicAllocation.enabled=true' and paste this script
> {code:java}
> for (_ <- 0 until 10) {
> Seq(1, 2, 3, 4, 5).toDF.repartition(100).agg("value" -> "sum").show()
> }
> {code}
>  # make a heap dump and examine 
> ExecutorAllocationListener.totalRunningTasksPerResourceProfile and 
> ExecutorAllocationListener.resourceProfileIdToStageAttempt fields
> Expected: totalRunningTasksPerResourceProfile and 
> resourceProfileIdToStageAttempt(defaultResourceProfileId) are empty
> Actual: totalRunningTasksPerResourceProfile and 
> resourceProfileIdToStageAttempt(defaultResourceProfileId) contain 
> non-relevant data
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35391) Memory leak in ExecutorAllocationListener breaks dynamic allocation under high load

2021-05-12 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35391:


Assignee: Apache Spark

> Memory leak in ExecutorAllocationListener breaks dynamic allocation under 
> high load
> ---
>
> Key: SPARK-35391
> URL: https://issues.apache.org/jira/browse/SPARK-35391
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.1.1
>Reporter: Vasily Kolpakov
>Assignee: Apache Spark
>Priority: Major
>
> ExecutorAllocationListener doesn't clean up data properly. 
> ExecutorAllocationListener performs progressively slower and eventually fails 
> to process events in time.
> There are two problems:
>  * a bug (typo?) in totalRunningTasksPerResourceProfile() method
>  getOrElseUpdate() is used instead of getOrElse().
>  If spark-dynamic-executor-allocation thread calls schedule() after a 
> SparkListenerTaskEnd event for the last task in a stage
>  but before SparkListenerStageCompleted event for the stage, then 
> stageAttemptToNumRunningTask will not be cleaned up properly.
>  * resourceProfileIdToStageAttempt clean-up is broken
>  If a SparkListenerTaskEnd event for the last task in a stage was processed 
> before SparkListenerStageCompleted for that stage,
>  then resourceProfileIdToStageAttempt will not be cleaned up properly.
>  
> Bugs were introduced in this commit: 
> https://github.com/apache/spark/commit/496f6ac86001d284cbfb7488a63dd3a168919c0f
>  .
> Steps to reproduce:
>  # Launch standalone master and worker with 
> 'spark.shuffle.service.enabled=true'
>  # Run spark-shell with --conf 'spark.shuffle.service.enabled=true' --conf 
> 'spark.dynamicAllocation.enabled=true' and paste this script
> {code:java}
> for (_ <- 0 until 10) {
> Seq(1, 2, 3, 4, 5).toDF.repartition(100).agg("value" -> "sum").show()
> }
> {code}
>  # make a heap dump and examine 
> ExecutorAllocationListener.totalRunningTasksPerResourceProfile and 
> ExecutorAllocationListener.resourceProfileIdToStageAttempt fields
> Expected: totalRunningTasksPerResourceProfile and 
> resourceProfileIdToStageAttempt(defaultResourceProfileId) are empty
> Actual: totalRunningTasksPerResourceProfile and 
> resourceProfileIdToStageAttempt(defaultResourceProfileId) contain 
> non-relevant data
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35391) Memory leak in ExecutorAllocationListener breaks dynamic allocation under high load

2021-05-12 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35391:


Assignee: (was: Apache Spark)

> Memory leak in ExecutorAllocationListener breaks dynamic allocation under 
> high load
> ---
>
> Key: SPARK-35391
> URL: https://issues.apache.org/jira/browse/SPARK-35391
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.1.1
>Reporter: Vasily Kolpakov
>Priority: Major
>
> ExecutorAllocationListener doesn't clean up data properly. 
> ExecutorAllocationListener performs progressively slower and eventually fails 
> to process events in time.
> There are two problems:
>  * a bug (typo?) in totalRunningTasksPerResourceProfile() method
>  getOrElseUpdate() is used instead of getOrElse().
>  If spark-dynamic-executor-allocation thread calls schedule() after a 
> SparkListenerTaskEnd event for the last task in a stage
>  but before SparkListenerStageCompleted event for the stage, then 
> stageAttemptToNumRunningTask will not be cleaned up properly.
>  * resourceProfileIdToStageAttempt clean-up is broken
>  If a SparkListenerTaskEnd event for the last task in a stage was processed 
> before SparkListenerStageCompleted for that stage,
>  then resourceProfileIdToStageAttempt will not be cleaned up properly.
>  
> Bugs were introduced in this commit: 
> https://github.com/apache/spark/commit/496f6ac86001d284cbfb7488a63dd3a168919c0f
>  .
> Steps to reproduce:
>  # Launch standalone master and worker with 
> 'spark.shuffle.service.enabled=true'
>  # Run spark-shell with --conf 'spark.shuffle.service.enabled=true' --conf 
> 'spark.dynamicAllocation.enabled=true' and paste this script
> {code:java}
> for (_ <- 0 until 10) {
> Seq(1, 2, 3, 4, 5).toDF.repartition(100).agg("value" -> "sum").show()
> }
> {code}
>  # make a heap dump and examine 
> ExecutorAllocationListener.totalRunningTasksPerResourceProfile and 
> ExecutorAllocationListener.resourceProfileIdToStageAttempt fields
> Expected: totalRunningTasksPerResourceProfile and 
> resourceProfileIdToStageAttempt(defaultResourceProfileId) are empty
> Actual: totalRunningTasksPerResourceProfile and 
> resourceProfileIdToStageAttempt(defaultResourceProfileId) contain 
> non-relevant data
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35391) Memory leak in ExecutorAllocationListener breaks dynamic allocation under high load

2021-05-12 Thread Vasily Kolpakov (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17343579#comment-17343579
 ] 

Vasily Kolpakov commented on SPARK-35391:
-

https://github.com/apache/spark/pull/32526

> Memory leak in ExecutorAllocationListener breaks dynamic allocation under 
> high load
> ---
>
> Key: SPARK-35391
> URL: https://issues.apache.org/jira/browse/SPARK-35391
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.1.1
>Reporter: Vasily Kolpakov
>Priority: Major
>
> ExecutorAllocationListener doesn't clean up data properly. 
> ExecutorAllocationListener performs progressively slower and eventually fails 
> to process events in time.
> There are two problems:
>  * a bug (typo?) in totalRunningTasksPerResourceProfile() method
>  getOrElseUpdate() is used instead of getOrElse().
>  If spark-dynamic-executor-allocation thread calls schedule() after a 
> SparkListenerTaskEnd event for the last task in a stage
>  but before SparkListenerStageCompleted event for the stage, then 
> stageAttemptToNumRunningTask will not be cleaned up properly.
>  * resourceProfileIdToStageAttempt clean-up is broken
>  If a SparkListenerTaskEnd event for the last task in a stage was processed 
> before SparkListenerStageCompleted for that stage,
>  then resourceProfileIdToStageAttempt will not be cleaned up properly.
>  
> Bugs were introduced in this commit: 
> https://github.com/apache/spark/commit/496f6ac86001d284cbfb7488a63dd3a168919c0f
>  .
> Steps to reproduce:
>  # Launch standalone master and worker with 
> 'spark.shuffle.service.enabled=true'
>  # Run spark-shell with --conf 'spark.shuffle.service.enabled=true' --conf 
> 'spark.dynamicAllocation.enabled=true' and paste this script
> {code:java}
> for (_ <- 0 until 10) {
> Seq(1, 2, 3, 4, 5).toDF.repartition(100).agg("value" -> "sum").show()
> }
> {code}
>  # make a heap dump and examine 
> ExecutorAllocationListener.totalRunningTasksPerResourceProfile and 
> ExecutorAllocationListener.resourceProfileIdToStageAttempt fields
> Expected: totalRunningTasksPerResourceProfile and 
> resourceProfileIdToStageAttempt(defaultResourceProfileId) are empty
> Actual: totalRunningTasksPerResourceProfile and 
> resourceProfileIdToStageAttempt(defaultResourceProfileId) contain 
> non-relevant data
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-35391) Memory leak in ExecutorAllocationListener breaks dynamic allocation under high load

2021-05-12 Thread Vasily Kolpakov (Jira)
Vasily Kolpakov created SPARK-35391:
---

 Summary: Memory leak in ExecutorAllocationListener breaks dynamic 
allocation under high load
 Key: SPARK-35391
 URL: https://issues.apache.org/jira/browse/SPARK-35391
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 3.1.1
Reporter: Vasily Kolpakov


ExecutorAllocationListener doesn't clean up data properly. 
ExecutorAllocationListener performs progressively slower and eventually fails 
to process events in time.

There are two problems:
 * a bug (typo?) in totalRunningTasksPerResourceProfile() method
 getOrElseUpdate() is used instead of getOrElse().
 If spark-dynamic-executor-allocation thread calls schedule() after a 
SparkListenerTaskEnd event for the last task in a stage
 but before SparkListenerStageCompleted event for the stage, then 
stageAttemptToNumRunningTask will not be cleaned up properly.
 * resourceProfileIdToStageAttempt clean-up is broken
 If a SparkListenerTaskEnd event for the last task in a stage was processed 
before SparkListenerStageCompleted for that stage,
 then resourceProfileIdToStageAttempt will not be cleaned up properly.

 

Bugs were introduced in this commit: 
https://github.com/apache/spark/commit/496f6ac86001d284cbfb7488a63dd3a168919c0f 
.

Steps to reproduce:
 # Launch standalone master and worker with 'spark.shuffle.service.enabled=true'
 # Run spark-shell with --conf 'spark.shuffle.service.enabled=true' --conf 
'spark.dynamicAllocation.enabled=true' and paste this script
{code:java}
for (_ <- 0 until 10) {
Seq(1, 2, 3, 4, 5).toDF.repartition(100).agg("value" -> "sum").show()
}
{code}

 # make a heap dump and examine 
ExecutorAllocationListener.totalRunningTasksPerResourceProfile and 
ExecutorAllocationListener.resourceProfileIdToStageAttempt fields

Expected: totalRunningTasksPerResourceProfile and 
resourceProfileIdToStageAttempt(defaultResourceProfileId) are empty
Actual: totalRunningTasksPerResourceProfile and 
resourceProfileIdToStageAttempt(defaultResourceProfileId) contain non-relevant 
data

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-35369) Document ExecutorAllocationManager metrics

2021-05-12 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-35369:
--
Affects Version/s: (was: 3.1.2)

> Document ExecutorAllocationManager metrics
> --
>
> Key: SPARK-35369
> URL: https://issues.apache.org/jira/browse/SPARK-35369
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 3.2.0
>Reporter: Luca Canali
>Assignee: Luca Canali
>Priority: Minor
> Fix For: 3.2.0
>
>
> The ExecutorAllocationManager is instrumented with metrics using the Spark 
> metrics system.
> The relevant work is in SPARK-7007 and SPARK-33763
> This proposes to document the available metrics.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35369) Document ExecutorAllocationManager metrics

2021-05-12 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-35369:
-

Assignee: Luca Canali

> Document ExecutorAllocationManager metrics
> --
>
> Key: SPARK-35369
> URL: https://issues.apache.org/jira/browse/SPARK-35369
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 3.1.2, 3.2.0
>Reporter: Luca Canali
>Assignee: Luca Canali
>Priority: Minor
>
> The ExecutorAllocationManager is instrumented with metrics using the Spark 
> metrics system.
> The relevant work is in SPARK-7007 and SPARK-33763
> This proposes to document the available metrics.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-35369) Document ExecutorAllocationManager metrics

2021-05-12 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-35369.
---
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 32500
[https://github.com/apache/spark/pull/32500]

> Document ExecutorAllocationManager metrics
> --
>
> Key: SPARK-35369
> URL: https://issues.apache.org/jira/browse/SPARK-35369
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 3.1.2, 3.2.0
>Reporter: Luca Canali
>Assignee: Luca Canali
>Priority: Minor
> Fix For: 3.2.0
>
>
> The ExecutorAllocationManager is instrumented with metrics using the Spark 
> metrics system.
> The relevant work is in SPARK-7007 and SPARK-33763
> This proposes to document the available metrics.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35388) Allow the PR source branch to include slashes.

2021-05-12 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35388?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17343536#comment-17343536
 ] 

Apache Spark commented on SPARK-35388:
--

User 'ueshin' has created a pull request for this issue:
https://github.com/apache/spark/pull/32525

> Allow the PR source branch to include slashes. 
> ---
>
> Key: SPARK-35388
> URL: https://issues.apache.org/jira/browse/SPARK-35388
> Project: Spark
>  Issue Type: Sub-task
>  Components: Project Infra
>Affects Versions: 3.2.0
>Reporter: Takuya Ueshin
>Priority: Major
>
> We should allow the PR source branch to include slashes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35388) Allow the PR source branch to include slashes.

2021-05-12 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35388?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17343535#comment-17343535
 ] 

Apache Spark commented on SPARK-35388:
--

User 'ueshin' has created a pull request for this issue:
https://github.com/apache/spark/pull/32525

> Allow the PR source branch to include slashes. 
> ---
>
> Key: SPARK-35388
> URL: https://issues.apache.org/jira/browse/SPARK-35388
> Project: Spark
>  Issue Type: Sub-task
>  Components: Project Infra
>Affects Versions: 3.2.0
>Reporter: Takuya Ueshin
>Priority: Major
>
> We should allow the PR source branch to include slashes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35388) Allow the PR source branch to include slashes.

2021-05-12 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35388?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35388:


Assignee: Apache Spark

> Allow the PR source branch to include slashes. 
> ---
>
> Key: SPARK-35388
> URL: https://issues.apache.org/jira/browse/SPARK-35388
> Project: Spark
>  Issue Type: Sub-task
>  Components: Project Infra
>Affects Versions: 3.2.0
>Reporter: Takuya Ueshin
>Assignee: Apache Spark
>Priority: Major
>
> We should allow the PR source branch to include slashes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35388) Allow the PR source branch to include slashes.

2021-05-12 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35388?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35388:


Assignee: (was: Apache Spark)

> Allow the PR source branch to include slashes. 
> ---
>
> Key: SPARK-35388
> URL: https://issues.apache.org/jira/browse/SPARK-35388
> Project: Spark
>  Issue Type: Sub-task
>  Components: Project Infra
>Affects Versions: 3.2.0
>Reporter: Takuya Ueshin
>Priority: Major
>
> We should allow the PR source branch to include slashes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35388) Allow the PR source branch to include slashes.

2021-05-12 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35388?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17343532#comment-17343532
 ] 

Apache Spark commented on SPARK-35388:
--

User 'ueshin' has created a pull request for this issue:
https://github.com/apache/spark/pull/32524

> Allow the PR source branch to include slashes. 
> ---
>
> Key: SPARK-35388
> URL: https://issues.apache.org/jira/browse/SPARK-35388
> Project: Spark
>  Issue Type: Sub-task
>  Components: Project Infra
>Affects Versions: 3.2.0
>Reporter: Takuya Ueshin
>Priority: Major
>
> We should allow the PR source branch to include slashes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35013) Spark allows to set spark.driver.cores=0

2021-05-12 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35013?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-35013:
-

Assignee: shahid

> Spark allows to set spark.driver.cores=0
> 
>
> Key: SPARK-35013
> URL: https://issues.apache.org/jira/browse/SPARK-35013
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.7, 3.1.1
>Reporter: Oleg Lypkan
>Assignee: shahid
>Priority: Minor
> Fix For: 3.2.0
>
>
> I found an inconsistency in [validation logic of Spark submit arguments 
> |https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmitArguments.scala#L248-L258]that
>  allows *spark.driver.cores* value to be set to 0 but requires 
> *spark.driver.memory,* *spark.executor.cores, spark.executor.memory* to be 
> positive numbers:
> {quote}Exception in thread "main" org.apache.spark.SparkException: Driver 
> memory must be a positive number
>  Exception in thread "main" org.apache.spark.SparkException: Executor cores 
> must be a positive number
>  Exception in thread "main" org.apache.spark.SparkException: Executor memory 
> must be a positive number
> {quote}
> I would like to understand if there is a reason for this inconsistency in the 
> validation logic or it is a bug?
> Thank you



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-35013) Spark allows to set spark.driver.cores=0

2021-05-12 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35013?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-35013.
---
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 32504
[https://github.com/apache/spark/pull/32504]

> Spark allows to set spark.driver.cores=0
> 
>
> Key: SPARK-35013
> URL: https://issues.apache.org/jira/browse/SPARK-35013
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.7, 3.1.1
>Reporter: Oleg Lypkan
>Priority: Minor
> Fix For: 3.2.0
>
>
> I found an inconsistency in [validation logic of Spark submit arguments 
> |https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmitArguments.scala#L248-L258]that
>  allows *spark.driver.cores* value to be set to 0 but requires 
> *spark.driver.memory,* *spark.executor.cores, spark.executor.memory* to be 
> positive numbers:
> {quote}Exception in thread "main" org.apache.spark.SparkException: Driver 
> memory must be a positive number
>  Exception in thread "main" org.apache.spark.SparkException: Executor cores 
> must be a positive number
>  Exception in thread "main" org.apache.spark.SparkException: Executor memory 
> must be a positive number
> {quote}
> I would like to understand if there is a reason for this inconsistency in the 
> validation logic or it is a bug?
> Thank you



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-35390) Handle type coercion when resolving V2 functions

2021-05-12 Thread Chao Sun (Jira)
Chao Sun created SPARK-35390:


 Summary: Handle type coercion when resolving V2 functions 
 Key: SPARK-35390
 URL: https://issues.apache.org/jira/browse/SPARK-35390
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.2.0
Reporter: Chao Sun


When resolving V2 functions, we should handle type coercion by checking the 
expected argument types from the UDF function.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-35389) Analyzer should set progagateNull to false for magic function invocation

2021-05-12 Thread Chao Sun (Jira)
Chao Sun created SPARK-35389:


 Summary: Analyzer should set progagateNull to false for magic 
function invocation
 Key: SPARK-35389
 URL: https://issues.apache.org/jira/browse/SPARK-35389
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.2.0
Reporter: Chao Sun


For both {{Invoke}} and {{StaticInvoke}} used by magic method of 
{{ScalarFunction}}, we should set {{propgateNull}} to false, so that null 
values will be passed to the UDF for evaluation, instead of bypassing that and 
directly return null. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-35388) Allow the PR source branch to include slashes.

2021-05-12 Thread Takuya Ueshin (Jira)
Takuya Ueshin created SPARK-35388:
-

 Summary: Allow the PR source branch to include slashes. 
 Key: SPARK-35388
 URL: https://issues.apache.org/jira/browse/SPARK-35388
 Project: Spark
  Issue Type: Sub-task
  Components: Project Infra
Affects Versions: 3.2.0
Reporter: Takuya Ueshin


We should allow the PR source branch to include slashes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35382) Fix lambda variable name issues in nested DataFrame functions in Python APIs

2021-05-12 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17343484#comment-17343484
 ] 

Apache Spark commented on SPARK-35382:
--

User 'ueshin' has created a pull request for this issue:
https://github.com/apache/spark/pull/32523

> Fix lambda variable name issues in nested DataFrame functions in Python APIs
> 
>
> Key: SPARK-35382
> URL: https://issues.apache.org/jira/browse/SPARK-35382
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.1.1
>Reporter: Hyukjin Kwon
>Priority: Critical
>  Labels: correctness
>
> Python side also has the same issue as SPARK-34794
> {code}
> from pyspark.sql.functions import *
> df = sql("SELECT array(1, 2, 3) as numbers, array('a', 'b', 'c') as letters")
> df.select(
> transform(
> "numbers",
> lambda n: transform("letters", lambda l: struct(n.alias("n"), 
> l.alias("l")))
> )
> ).show()
> {code}
> {code}
> ++
> |transform(numbers, lambdafunction(transform(letters, 
> lambdafunction(struct(namedlambdavariable() AS n, namedlambdavariable() AS 
> l), namedlambdavariable())), namedlambdavariable()))|
> ++
> | 
>   
>  [[{a, a}, {b, b},...|
> ++
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35382) Fix lambda variable name issues in nested DataFrame functions in Python APIs

2021-05-12 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35382:


Assignee: (was: Apache Spark)

> Fix lambda variable name issues in nested DataFrame functions in Python APIs
> 
>
> Key: SPARK-35382
> URL: https://issues.apache.org/jira/browse/SPARK-35382
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.1.1
>Reporter: Hyukjin Kwon
>Priority: Critical
>  Labels: correctness
>
> Python side also has the same issue as SPARK-34794
> {code}
> from pyspark.sql.functions import *
> df = sql("SELECT array(1, 2, 3) as numbers, array('a', 'b', 'c') as letters")
> df.select(
> transform(
> "numbers",
> lambda n: transform("letters", lambda l: struct(n.alias("n"), 
> l.alias("l")))
> )
> ).show()
> {code}
> {code}
> ++
> |transform(numbers, lambdafunction(transform(letters, 
> lambdafunction(struct(namedlambdavariable() AS n, namedlambdavariable() AS 
> l), namedlambdavariable())), namedlambdavariable()))|
> ++
> | 
>   
>  [[{a, a}, {b, b},...|
> ++
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35382) Fix lambda variable name issues in nested DataFrame functions in Python APIs

2021-05-12 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35382:


Assignee: Apache Spark

> Fix lambda variable name issues in nested DataFrame functions in Python APIs
> 
>
> Key: SPARK-35382
> URL: https://issues.apache.org/jira/browse/SPARK-35382
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.1.1
>Reporter: Hyukjin Kwon
>Assignee: Apache Spark
>Priority: Critical
>  Labels: correctness
>
> Python side also has the same issue as SPARK-34794
> {code}
> from pyspark.sql.functions import *
> df = sql("SELECT array(1, 2, 3) as numbers, array('a', 'b', 'c') as letters")
> df.select(
> transform(
> "numbers",
> lambda n: transform("letters", lambda l: struct(n.alias("n"), 
> l.alias("l")))
> )
> ).show()
> {code}
> {code}
> ++
> |transform(numbers, lambdafunction(transform(letters, 
> lambdafunction(struct(namedlambdavariable() AS n, namedlambdavariable() AS 
> l), namedlambdavariable())), namedlambdavariable()))|
> ++
> | 
>   
>  [[{a, a}, {b, b},...|
> ++
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35383) Improve s3a magic committer support by inferring missing configs

2021-05-12 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35383?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-35383:
-

Assignee: Dongjoon Hyun

> Improve s3a magic committer support by inferring missing configs
> 
>
> Key: SPARK-35383
> URL: https://issues.apache.org/jira/browse/SPARK-35383
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.2.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>
> {code}
> Exception in thread "main" 
> org.apache.hadoop.fs.s3a.commit.PathCommitException: `s3a://my-spark-bucket`: 
> Filesystem does not have support for 'magic' committer enabled in 
> configuration option fs.s3a.committer.magic.enabled
>   at 
> org.apache.hadoop.fs.s3a.commit.CommitUtils.verifyIsMagicCommitFS(CommitUtils.java:74)
>   at 
> org.apache.hadoop.fs.s3a.commit.CommitUtils.getS3AFileSystem(CommitUtils.java:109)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-35383) Improve s3a magic committer support by inferring missing configs

2021-05-12 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35383?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-35383.
---
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 32518
[https://github.com/apache/spark/pull/32518]

> Improve s3a magic committer support by inferring missing configs
> 
>
> Key: SPARK-35383
> URL: https://issues.apache.org/jira/browse/SPARK-35383
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.2.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 3.2.0
>
>
> {code}
> Exception in thread "main" 
> org.apache.hadoop.fs.s3a.commit.PathCommitException: `s3a://my-spark-bucket`: 
> Filesystem does not have support for 'magic' committer enabled in 
> configuration option fs.s3a.committer.magic.enabled
>   at 
> org.apache.hadoop.fs.s3a.commit.CommitUtils.verifyIsMagicCommitFS(CommitUtils.java:74)
>   at 
> org.apache.hadoop.fs.s3a.commit.CommitUtils.getS3AFileSystem(CommitUtils.java:109)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-35387) Increase the stack size of JVM for Java 11 build test

2021-05-12 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-35387.
---
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 32521
[https://github.com/apache/spark/pull/32521]

> Increase the stack size of JVM for Java 11 build test
> -
>
> Key: SPARK-35387
> URL: https://issues.apache.org/jira/browse/SPARK-35387
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Affects Versions: 3.2.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Minor
> Fix For: 3.2.0
>
>
> After merging https://github.com/apache/spark/pull/32439, there is flaky 
> error from the Github action job "Java 11 build with Maven":
> {code:java}
> Error:  ## Exception when compiling 473 sources to 
> /home/runner/work/spark/spark/sql/catalyst/target/scala-2.12/classes
> java.lang.StackOverflowError
> scala.reflect.internal.Trees.itransform(Trees.scala:1376)
> scala.reflect.internal.Trees.itransform$(Trees.scala:1374)
> scala.reflect.internal.SymbolTable.itransform(SymbolTable.scala:28)
> scala.reflect.internal.SymbolTable.itransform(SymbolTable.scala:28)
> scala.reflect.api.Trees$Transformer.transform(Trees.scala:2563)
> scala.tools.nsc.transform.TypingTransformers$TypingTransformer.transform(TypingTransformers.scala:51)
> {code}
> We can resolve it by increasing the stack size of JVM.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35387) Increase the stack size of JVM for Java 11 build test

2021-05-12 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-35387:
-

Assignee: Gengliang Wang

> Increase the stack size of JVM for Java 11 build test
> -
>
> Key: SPARK-35387
> URL: https://issues.apache.org/jira/browse/SPARK-35387
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Affects Versions: 3.2.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Minor
>
> After merging https://github.com/apache/spark/pull/32439, there is flaky 
> error from the Github action job "Java 11 build with Maven":
> {code:java}
> Error:  ## Exception when compiling 473 sources to 
> /home/runner/work/spark/spark/sql/catalyst/target/scala-2.12/classes
> java.lang.StackOverflowError
> scala.reflect.internal.Trees.itransform(Trees.scala:1376)
> scala.reflect.internal.Trees.itransform$(Trees.scala:1374)
> scala.reflect.internal.SymbolTable.itransform(SymbolTable.scala:28)
> scala.reflect.internal.SymbolTable.itransform(SymbolTable.scala:28)
> scala.reflect.api.Trees$Transformer.transform(Trees.scala:2563)
> scala.tools.nsc.transform.TypingTransformers$TypingTransformer.transform(TypingTransformers.scala:51)
> {code}
> We can resolve it by increasing the stack size of JVM.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35361) Improve performance for ApplyFunctionExpression

2021-05-12 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35361?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17343408#comment-17343408
 ] 

Apache Spark commented on SPARK-35361:
--

User 'sunchao' has created a pull request for this issue:
https://github.com/apache/spark/pull/32522

> Improve performance for ApplyFunctionExpression
> ---
>
> Key: SPARK-35361
> URL: https://issues.apache.org/jira/browse/SPARK-35361
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Chao Sun
>Assignee: Chao Sun
>Priority: Minor
> Fix For: 3.2.0
>
>
> When the `ScalarFunction` itself is trivial, `ApplyFunctionExpression` could 
> incur significant runtime cost with `zipWithIndex` call. This proposes to 
> move the call outside the loop over each input row.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35387) Increase the stack size of JVM for Java 11 build test

2021-05-12 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35387:


Assignee: Apache Spark

> Increase the stack size of JVM for Java 11 build test
> -
>
> Key: SPARK-35387
> URL: https://issues.apache.org/jira/browse/SPARK-35387
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Affects Versions: 3.2.0
>Reporter: Gengliang Wang
>Assignee: Apache Spark
>Priority: Minor
>
> After merging https://github.com/apache/spark/pull/32439, there is flaky 
> error from the Github action job "Java 11 build with Maven":
> {code:java}
> Error:  ## Exception when compiling 473 sources to 
> /home/runner/work/spark/spark/sql/catalyst/target/scala-2.12/classes
> java.lang.StackOverflowError
> scala.reflect.internal.Trees.itransform(Trees.scala:1376)
> scala.reflect.internal.Trees.itransform$(Trees.scala:1374)
> scala.reflect.internal.SymbolTable.itransform(SymbolTable.scala:28)
> scala.reflect.internal.SymbolTable.itransform(SymbolTable.scala:28)
> scala.reflect.api.Trees$Transformer.transform(Trees.scala:2563)
> scala.tools.nsc.transform.TypingTransformers$TypingTransformer.transform(TypingTransformers.scala:51)
> {code}
> We can resolve it by increasing the stack size of JVM.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35387) Increase the stack size of JVM for Java 11 build test

2021-05-12 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17343376#comment-17343376
 ] 

Apache Spark commented on SPARK-35387:
--

User 'gengliangwang' has created a pull request for this issue:
https://github.com/apache/spark/pull/32521

> Increase the stack size of JVM for Java 11 build test
> -
>
> Key: SPARK-35387
> URL: https://issues.apache.org/jira/browse/SPARK-35387
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Affects Versions: 3.2.0
>Reporter: Gengliang Wang
>Priority: Minor
>
> After merging https://github.com/apache/spark/pull/32439, there is flaky 
> error from the Github action job "Java 11 build with Maven":
> {code:java}
> Error:  ## Exception when compiling 473 sources to 
> /home/runner/work/spark/spark/sql/catalyst/target/scala-2.12/classes
> java.lang.StackOverflowError
> scala.reflect.internal.Trees.itransform(Trees.scala:1376)
> scala.reflect.internal.Trees.itransform$(Trees.scala:1374)
> scala.reflect.internal.SymbolTable.itransform(SymbolTable.scala:28)
> scala.reflect.internal.SymbolTable.itransform(SymbolTable.scala:28)
> scala.reflect.api.Trees$Transformer.transform(Trees.scala:2563)
> scala.tools.nsc.transform.TypingTransformers$TypingTransformer.transform(TypingTransformers.scala:51)
> {code}
> We can resolve it by increasing the stack size of JVM.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35387) Increase the stack size of JVM for Java 11 build test

2021-05-12 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35387:


Assignee: (was: Apache Spark)

> Increase the stack size of JVM for Java 11 build test
> -
>
> Key: SPARK-35387
> URL: https://issues.apache.org/jira/browse/SPARK-35387
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Affects Versions: 3.2.0
>Reporter: Gengliang Wang
>Priority: Minor
>
> After merging https://github.com/apache/spark/pull/32439, there is flaky 
> error from the Github action job "Java 11 build with Maven":
> {code:java}
> Error:  ## Exception when compiling 473 sources to 
> /home/runner/work/spark/spark/sql/catalyst/target/scala-2.12/classes
> java.lang.StackOverflowError
> scala.reflect.internal.Trees.itransform(Trees.scala:1376)
> scala.reflect.internal.Trees.itransform$(Trees.scala:1374)
> scala.reflect.internal.SymbolTable.itransform(SymbolTable.scala:28)
> scala.reflect.internal.SymbolTable.itransform(SymbolTable.scala:28)
> scala.reflect.api.Trees$Transformer.transform(Trees.scala:2563)
> scala.tools.nsc.transform.TypingTransformers$TypingTransformer.transform(TypingTransformers.scala:51)
> {code}
> We can resolve it by increasing the stack size of JVM.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-35387) Increase the stack size of JVM for Java 11 build test

2021-05-12 Thread Gengliang Wang (Jira)
Gengliang Wang created SPARK-35387:
--

 Summary: Increase the stack size of JVM for Java 11 build test
 Key: SPARK-35387
 URL: https://issues.apache.org/jira/browse/SPARK-35387
 Project: Spark
  Issue Type: Improvement
  Components: Project Infra
Affects Versions: 3.2.0
Reporter: Gengliang Wang


After merging https://github.com/apache/spark/pull/32439, there is flaky error 
from the Github action job "Java 11 build with Maven":

{code:java}
Error:  ## Exception when compiling 473 sources to 
/home/runner/work/spark/spark/sql/catalyst/target/scala-2.12/classes
java.lang.StackOverflowError
scala.reflect.internal.Trees.itransform(Trees.scala:1376)
scala.reflect.internal.Trees.itransform$(Trees.scala:1374)
scala.reflect.internal.SymbolTable.itransform(SymbolTable.scala:28)
scala.reflect.internal.SymbolTable.itransform(SymbolTable.scala:28)
scala.reflect.api.Trees$Transformer.transform(Trees.scala:2563)
scala.tools.nsc.transform.TypingTransformers$TypingTransformer.transform(TypingTransformers.scala:51)
{code}

We can resolve it by increasing the stack size of JVM.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-35349) Add code-gen for left/right outer sort merge join

2021-05-12 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35349?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-35349.
-
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 32476
[https://github.com/apache/spark/pull/32476]

> Add code-gen for left/right outer sort merge join
> -
>
> Key: SPARK-35349
> URL: https://issues.apache.org/jira/browse/SPARK-35349
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Cheng Su
>Assignee: Cheng Su
>Priority: Minor
> Fix For: 3.2.0
>
>
> This Jira is to track the progress to add code-gen support for left outer / 
> right outer sort merge join. See motivation in SPARK-34705.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35349) Add code-gen for left/right outer sort merge join

2021-05-12 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35349?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-35349:
---

Assignee: Cheng Su

> Add code-gen for left/right outer sort merge join
> -
>
> Key: SPARK-35349
> URL: https://issues.apache.org/jira/browse/SPARK-35349
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Cheng Su
>Assignee: Cheng Su
>Priority: Minor
>
> This Jira is to track the progress to add code-gen support for left outer / 
> right outer sort merge join. See motivation in SPARK-34705.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35295) Replace fully com.github.fommil.netlib by dev.ludovic.netlib:2.0

2021-05-12 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen reassigned SPARK-35295:


Assignee: Ludovic Henry

> Replace fully com.github.fommil.netlib by dev.ludovic.netlib:2.0 
> -
>
> Key: SPARK-35295
> URL: https://issues.apache.org/jira/browse/SPARK-35295
> Project: Spark
>  Issue Type: Improvement
>  Components: GraphX, ML, MLlib
>Affects Versions: 3.2.0
>Reporter: Ludovic Henry
>Assignee: Ludovic Henry
>Priority: Major
>
> As discussed in [https://github.com/apache/spark/pull/32253,] Spark cannot 
> distribute `com.github.fommil.netlib:all` due to licensing reasons. This 
> limits the ability of Spark to take advantage out of the box of native 
> libraries that do provide hardware acceleration (SIMD and GPUs).
> With `dev.ludovic.netlib:2.0.0`, it is possible to take advantage of native 
> libraries, but without the licensing limitations, since it doesn't link nor 
> distribute any GPL or LGPL libraries.
> That would allow any other of Spark to install a native library like OpenBLAS 
> or Intel MKL and take advantage transparently of these libraries.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29002) Avoid changing SMJ to BHJ if the build side has a high ratio of empty partitions

2021-05-12 Thread Wei Xue (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29002?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17343270#comment-17343270
 ] 

Wei Xue commented on SPARK-29002:
-

Why couldn't the 1000 small partitions be coalesced?

> Avoid changing SMJ to BHJ if the build side has a high ratio of empty 
> partitions
> 
>
> Key: SPARK-29002
> URL: https://issues.apache.org/jira/browse/SPARK-29002
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wei Xue
>Assignee: Wei Xue
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-35295) Replace fully com.github.fommil.netlib by dev.ludovic.netlib:2.0

2021-05-12 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-35295.
--
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 32415
[https://github.com/apache/spark/pull/32415]

> Replace fully com.github.fommil.netlib by dev.ludovic.netlib:2.0 
> -
>
> Key: SPARK-35295
> URL: https://issues.apache.org/jira/browse/SPARK-35295
> Project: Spark
>  Issue Type: Improvement
>  Components: GraphX, ML, MLlib
>Affects Versions: 3.2.0
>Reporter: Ludovic Henry
>Assignee: Ludovic Henry
>Priority: Major
> Fix For: 3.2.0
>
>
> As discussed in [https://github.com/apache/spark/pull/32253,] Spark cannot 
> distribute `com.github.fommil.netlib:all` due to licensing reasons. This 
> limits the ability of Spark to take advantage out of the box of native 
> libraries that do provide hardware acceleration (SIMD and GPUs).
> With `dev.ludovic.netlib:2.0.0`, it is possible to take advantage of native 
> libraries, but without the licensing limitations, since it doesn't link nor 
> distribute any GPL or LGPL libraries.
> That would allow any other of Spark to install a native library like OpenBLAS 
> or Intel MKL and take advantage transparently of these libraries.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-35253) Upgrade Janino from 3.0.16 to 3.1.4

2021-05-12 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35253?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-35253.
--
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 32455
[https://github.com/apache/spark/pull/32455]

> Upgrade Janino from 3.0.16 to 3.1.4
> ---
>
> Key: SPARK-35253
> URL: https://issues.apache.org/jira/browse/SPARK-35253
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, SQL
>Affects Versions: 3.2.0
>Reporter: Yang Jie
>Assignee: Takeshi Yamamuro
>Priority: Minor
> Fix For: 3.2.0
>
>
> From the [change log|http://janino-compiler.github.io/janino/changelog.html], 
>  the janino 3.0.x line has been deprecated,  we can use 3.1.x line instead of 
> it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35253) Upgrade Janino from 3.0.16 to 3.1.4

2021-05-12 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35253?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen reassigned SPARK-35253:


Assignee: Takeshi Yamamuro

> Upgrade Janino from 3.0.16 to 3.1.4
> ---
>
> Key: SPARK-35253
> URL: https://issues.apache.org/jira/browse/SPARK-35253
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, SQL
>Affects Versions: 3.2.0
>Reporter: Yang Jie
>Assignee: Takeshi Yamamuro
>Priority: Minor
>
> From the [change log|http://janino-compiler.github.io/janino/changelog.html], 
>  the janino 3.0.x line has been deprecated,  we can use 3.1.x line instead of 
> it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-35357) Allow to turn off the normalization applied by static PageRank utilities

2021-05-12 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-35357.
--
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 32485
[https://github.com/apache/spark/pull/32485]

> Allow to turn off the normalization applied by static PageRank utilities
> 
>
> Key: SPARK-35357
> URL: https://issues.apache.org/jira/browse/SPARK-35357
> Project: Spark
>  Issue Type: Improvement
>  Components: GraphX
>Affects Versions: 3.1.1
>Reporter: bonnal-enzo
>Assignee: bonnal-enzo
>Priority: Minor
> Fix For: 3.2.0
>
>
> Since SPARK-18847, static PageRank computations available in `PageRank.scala` 
> are normalizing the sum of the ranks after the fixed number of iterations has 
> completed, and *there is no way for a developer to access the raw non 
> normalized ranks values*.
> Since SPARK-29877 one can run a fixed number of PageRank iterations starting 
> from previous `preRankGraph`'s ranks.
>  This nice feature open the door for interesting *incremental algorithms*, 
> for example:
>  "Run some initial pagerank iterations using `PageRank.runWithOptions` and 
> then update the graph's edges and update the ranks with a call to 
> `PageRank.runWithOptionsWithPreviousPageRank`, and so on...".
> This kind of algorithms would highly benefit (precision gain) from being 
> allowed to manipulate directly the raw ranks (and not the normalized ones) in 
> the case where the graph has a substantial proportion of sinks (vertices 
> without outgoing edges).
> It would be nice to add a method's signature having a boolean that allows to 
> turn off the automatic normalization run at the end of 
> `PageRank.runWithOptions` and `PageRank.runWithOptionsWithPreviousPageRank`, 
> making the developers free to apply the normalization only when they really 
> need it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-35386) parquet read with schema should fail on non-existing columns

2021-05-12 Thread Rafal Wojdyla (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rafal Wojdyla updated SPARK-35386:
--
Description: 
When read schema is specified as I user I would prefer/like if spark failed on 
missing columns.

{code:python}
from pyspark.sql.dataframe import DoubleType, StructType
from typing import List

spark: SparkSession = ...

spark.read.parquet("/tmp/data.snappy.parquet")
# inferred schema, includes 3 columns: col1, col2, new_col
# DataFrame[col1: bigint, col2: bigint, new_col: bigint]

# let's specify a custom read_schema, with **non nullable** col3 (which is not 
present):
read_schema = StructType(fields=[StructField("col3",DoubleType(),False)])

df = spark.read.schema(read_schema).parquet("/tmp/data.snappy.parquet")

df.schema
# we get a DataFrame with **nullable** col3:
# StructType(List(StructField(col3,DoubleType,true)))

df.count()
# 0
{code}

Is this a feature or a bug? In this case there's just a single parquet file, I 
have also tried {{option("mergeSchema", "true")}}, which doesn't help.

Similar read pattern would fail on pandas (and likely dask).

  was:
When read schema is specified as I user I would prefer/like if spark failed on 
missing columns.

{code:python}
spark: SparkSession = ...

spark.read.parquet("/tmp/data.snappy.parquet")
# inferred schema, includes 3 columns: col1, col2, new_col
# DataFrame[col1: bigint, col2: bigint, new_col: bigint]

# let's specify a custom read_schema, with **non nullable** col3 (which is not 
present):
read_schema = StructType(fields=[StructField("col3",DoubleType(),False)])

df = spark.read.schema(read_schema).parquet("/tmp/data.snappy.parquet")

df.schema
# we get a DataFrame with **nullable** col3:
# StructType(List(StructField(col3,DoubleType,true)))

df.count()
# 0
{code}

Is this a feature or a bug? In this case there's just a single parquet file, I 
have also tried {{option("mergeSchema", "true")}}, which doesn't help.

Similar read pattern would fail on pandas (and likely dask).


> parquet read with schema should fail on non-existing columns
> 
>
> Key: SPARK-35386
> URL: https://issues.apache.org/jira/browse/SPARK-35386
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output, PySpark
>Affects Versions: 3.0.1
>Reporter: Rafal Wojdyla
>Priority: Major
>
> When read schema is specified as I user I would prefer/like if spark failed 
> on missing columns.
> {code:python}
> from pyspark.sql.dataframe import DoubleType, StructType
> from typing import List
> spark: SparkSession = ...
> spark.read.parquet("/tmp/data.snappy.parquet")
> # inferred schema, includes 3 columns: col1, col2, new_col
> # DataFrame[col1: bigint, col2: bigint, new_col: bigint]
> # let's specify a custom read_schema, with **non nullable** col3 (which is 
> not present):
> read_schema = StructType(fields=[StructField("col3",DoubleType(),False)])
> df = spark.read.schema(read_schema).parquet("/tmp/data.snappy.parquet")
> df.schema
> # we get a DataFrame with **nullable** col3:
> # StructType(List(StructField(col3,DoubleType,true)))
> df.count()
> # 0
> {code}
> Is this a feature or a bug? In this case there's just a single parquet file, 
> I have also tried {{option("mergeSchema", "true")}}, which doesn't help.
> Similar read pattern would fail on pandas (and likely dask).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35357) Allow to turn off the normalization applied by static PageRank utilities

2021-05-12 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen reassigned SPARK-35357:


Assignee: bonnal-enzo

> Allow to turn off the normalization applied by static PageRank utilities
> 
>
> Key: SPARK-35357
> URL: https://issues.apache.org/jira/browse/SPARK-35357
> Project: Spark
>  Issue Type: Improvement
>  Components: GraphX
>Affects Versions: 3.1.1
>Reporter: bonnal-enzo
>Assignee: bonnal-enzo
>Priority: Minor
>
> Since SPARK-18847, static PageRank computations available in `PageRank.scala` 
> are normalizing the sum of the ranks after the fixed number of iterations has 
> completed, and *there is no way for a developer to access the raw non 
> normalized ranks values*.
> Since SPARK-29877 one can run a fixed number of PageRank iterations starting 
> from previous `preRankGraph`'s ranks.
>  This nice feature open the door for interesting *incremental algorithms*, 
> for example:
>  "Run some initial pagerank iterations using `PageRank.runWithOptions` and 
> then update the graph's edges and update the ranks with a call to 
> `PageRank.runWithOptionsWithPreviousPageRank`, and so on...".
> This kind of algorithms would highly benefit (precision gain) from being 
> allowed to manipulate directly the raw ranks (and not the normalized ones) in 
> the case where the graph has a substantial proportion of sinks (vertices 
> without outgoing edges).
> It would be nice to add a method's signature having a boolean that allows to 
> turn off the automatic normalization run at the end of 
> `PageRank.runWithOptions` and `PageRank.runWithOptionsWithPreviousPageRank`, 
> making the developers free to apply the normalization only when they really 
> need it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-35386) parquet read with schema should fail on non-existing columns

2021-05-12 Thread Rafal Wojdyla (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rafal Wojdyla updated SPARK-35386:
--
Description: 
When read schema is specified as I user I would prefer/like if spark failed on 
missing columns.

{code:python}
from pyspark.sql.dataframe import DoubleType, StructType

spark: SparkSession = ...

spark.read.parquet("/tmp/data.snappy.parquet")
# inferred schema, includes 3 columns: col1, col2, new_col
# DataFrame[col1: bigint, col2: bigint, new_col: bigint]

# let's specify a custom read_schema, with **non nullable** col3 (which is not 
present):
read_schema = StructType(fields=[StructField("col3",DoubleType(),False)])

df = spark.read.schema(read_schema).parquet("/tmp/data.snappy.parquet")

df.schema
# we get a DataFrame with **nullable** col3:
# StructType(List(StructField(col3,DoubleType,true)))

df.count()
# 0
{code}

Is this a feature or a bug? In this case there's just a single parquet file, I 
have also tried {{option("mergeSchema", "true")}}, which doesn't help.

Similar read pattern would fail on pandas (and likely dask).

  was:
When read schema is specified as I user I would prefer/like if spark failed on 
missing columns.

{code:python}
from pyspark.sql.dataframe import DoubleType, StructType
from typing import List

spark: SparkSession = ...

spark.read.parquet("/tmp/data.snappy.parquet")
# inferred schema, includes 3 columns: col1, col2, new_col
# DataFrame[col1: bigint, col2: bigint, new_col: bigint]

# let's specify a custom read_schema, with **non nullable** col3 (which is not 
present):
read_schema = StructType(fields=[StructField("col3",DoubleType(),False)])

df = spark.read.schema(read_schema).parquet("/tmp/data.snappy.parquet")

df.schema
# we get a DataFrame with **nullable** col3:
# StructType(List(StructField(col3,DoubleType,true)))

df.count()
# 0
{code}

Is this a feature or a bug? In this case there's just a single parquet file, I 
have also tried {{option("mergeSchema", "true")}}, which doesn't help.

Similar read pattern would fail on pandas (and likely dask).


> parquet read with schema should fail on non-existing columns
> 
>
> Key: SPARK-35386
> URL: https://issues.apache.org/jira/browse/SPARK-35386
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output, PySpark
>Affects Versions: 3.0.1
>Reporter: Rafal Wojdyla
>Priority: Major
>
> When read schema is specified as I user I would prefer/like if spark failed 
> on missing columns.
> {code:python}
> from pyspark.sql.dataframe import DoubleType, StructType
> spark: SparkSession = ...
> spark.read.parquet("/tmp/data.snappy.parquet")
> # inferred schema, includes 3 columns: col1, col2, new_col
> # DataFrame[col1: bigint, col2: bigint, new_col: bigint]
> # let's specify a custom read_schema, with **non nullable** col3 (which is 
> not present):
> read_schema = StructType(fields=[StructField("col3",DoubleType(),False)])
> df = spark.read.schema(read_schema).parquet("/tmp/data.snappy.parquet")
> df.schema
> # we get a DataFrame with **nullable** col3:
> # StructType(List(StructField(col3,DoubleType,true)))
> df.count()
> # 0
> {code}
> Is this a feature or a bug? In this case there's just a single parquet file, 
> I have also tried {{option("mergeSchema", "true")}}, which doesn't help.
> Similar read pattern would fail on pandas (and likely dask).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-35386) parquet read with schema should fail on non-existing columns

2021-05-12 Thread Rafal Wojdyla (Jira)
Rafal Wojdyla created SPARK-35386:
-

 Summary: parquet read with schema should fail on non-existing 
columns
 Key: SPARK-35386
 URL: https://issues.apache.org/jira/browse/SPARK-35386
 Project: Spark
  Issue Type: Bug
  Components: Input/Output, PySpark
Affects Versions: 3.0.1
Reporter: Rafal Wojdyla


When read schema is specified as I user I would prefer/like if spark failed on 
missing columns.

{code:python}
spark: SparkSession = ...

spark.read.parquet("/tmp/data.snappy.parquet")
# inferred schema, includes 3 columns: col1, col2, new_col
# DataFrame[col1: bigint, col2: bigint, new_col: bigint]

# let's specify a custom read_schema, with **non nullable** col3 (which is not 
present):
read_schema = StructType(fields=[StructField("col3",DoubleType(),False)])

df = spark.read.schema(read_schema).parquet("/tmp/data.snappy.parquet")

df.schema
# we get a DataFrame with **nullable** col3:
# StructType(List(StructField(col3,DoubleType,true)))

df.count()
# 0
{code}

Is this a feature or a bug? In this case there's just a single parquet file, I 
have also tried {{option("mergeSchema", "true")}}, which doesn't help.

Similar read pattern would fail on pandas (and likely dask).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-35385) skip duplicate queries in the TPCDS-related tests

2021-05-12 Thread Takeshi Yamamuro (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35385?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated SPARK-35385:
-
Summary: skip duplicate queries in the TPCDS-related tests  (was: kip 
duplicate queries in the TPCDS-related tests)

> skip duplicate queries in the TPCDS-related tests
> -
>
> Key: SPARK-35385
> URL: https://issues.apache.org/jira/browse/SPARK-35385
> Project: Spark
>  Issue Type: Test
>  Components: SQL, Tests
>Affects Versions: 3.2.0
>Reporter: Takeshi Yamamuro
>Priority: Minor
>
> This ticket proposes to skip the "q6", "q34", "q64", "q74", "q75", "q78" 
> queries in the TPCDS-related tests because the TPCDS v2.7 queries have almost 
> the same ones; the only differences in these queries are ORDER BY columns.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35385) kip duplicate queries in the TPCDS-related tests

2021-05-12 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35385?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35385:


Assignee: Apache Spark

> kip duplicate queries in the TPCDS-related tests
> 
>
> Key: SPARK-35385
> URL: https://issues.apache.org/jira/browse/SPARK-35385
> Project: Spark
>  Issue Type: Test
>  Components: SQL, Tests
>Affects Versions: 3.2.0
>Reporter: Takeshi Yamamuro
>Assignee: Apache Spark
>Priority: Minor
>
> This ticket proposes to skip the "q6", "q34", "q64", "q74", "q75", "q78" 
> queries in the TPCDS-related tests because the TPCDS v2.7 queries have almost 
> the same ones; the only differences in these queries are ORDER BY columns.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35385) kip duplicate queries in the TPCDS-related tests

2021-05-12 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17343238#comment-17343238
 ] 

Apache Spark commented on SPARK-35385:
--

User 'maropu' has created a pull request for this issue:
https://github.com/apache/spark/pull/32520

> kip duplicate queries in the TPCDS-related tests
> 
>
> Key: SPARK-35385
> URL: https://issues.apache.org/jira/browse/SPARK-35385
> Project: Spark
>  Issue Type: Test
>  Components: SQL, Tests
>Affects Versions: 3.2.0
>Reporter: Takeshi Yamamuro
>Priority: Minor
>
> This ticket proposes to skip the "q6", "q34", "q64", "q74", "q75", "q78" 
> queries in the TPCDS-related tests because the TPCDS v2.7 queries have almost 
> the same ones; the only differences in these queries are ORDER BY columns.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35385) kip duplicate queries in the TPCDS-related tests

2021-05-12 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35385?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35385:


Assignee: (was: Apache Spark)

> kip duplicate queries in the TPCDS-related tests
> 
>
> Key: SPARK-35385
> URL: https://issues.apache.org/jira/browse/SPARK-35385
> Project: Spark
>  Issue Type: Test
>  Components: SQL, Tests
>Affects Versions: 3.2.0
>Reporter: Takeshi Yamamuro
>Priority: Minor
>
> This ticket proposes to skip the "q6", "q34", "q64", "q74", "q75", "q78" 
> queries in the TPCDS-related tests because the TPCDS v2.7 queries have almost 
> the same ones; the only differences in these queries are ORDER BY columns.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-35385) kip duplicate queries in the TPCDS-related tests

2021-05-12 Thread Takeshi Yamamuro (Jira)
Takeshi Yamamuro created SPARK-35385:


 Summary: kip duplicate queries in the TPCDS-related tests
 Key: SPARK-35385
 URL: https://issues.apache.org/jira/browse/SPARK-35385
 Project: Spark
  Issue Type: Test
  Components: SQL, Tests
Affects Versions: 3.2.0
Reporter: Takeshi Yamamuro


This ticket proposes to skip the "q6", "q34", "q64", "q74", "q75", "q78" 
queries in the TPCDS-related tests because the TPCDS v2.7 queries have almost 
the same ones; the only differences in these queries are ORDER BY columns.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-35298) Migrate to transformWithPruning for rules in optimizer/Optimizer.scala

2021-05-12 Thread Gengliang Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35298?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang resolved SPARK-35298.

Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 32439
[https://github.com/apache/spark/pull/32439]

> Migrate to transformWithPruning for rules in optimizer/Optimizer.scala
> --
>
> Key: SPARK-35298
> URL: https://issues.apache.org/jira/browse/SPARK-35298
> Project: Spark
>  Issue Type: Sub-task
>  Components: Optimizer
>Affects Versions: 3.1.0
>Reporter: Yingyi Bu
>Assignee: Yingyi Bu
>Priority: Major
> Fix For: 3.2.0
>
>
> PushXxx rules are handled in SPARK-35077.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35298) Migrate to transformWithPruning for rules in optimizer/Optimizer.scala

2021-05-12 Thread Gengliang Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35298?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang reassigned SPARK-35298:
--

Assignee: Yingyi Bu

> Migrate to transformWithPruning for rules in optimizer/Optimizer.scala
> --
>
> Key: SPARK-35298
> URL: https://issues.apache.org/jira/browse/SPARK-35298
> Project: Spark
>  Issue Type: Sub-task
>  Components: Optimizer
>Affects Versions: 3.1.0
>Reporter: Yingyi Bu
>Assignee: Yingyi Bu
>Priority: Major
>
> PushXxx rules are handled in SPARK-35077.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-35346) More clause needed for combining groupby and cube

2021-05-12 Thread Takeshi Yamamuro (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated SPARK-35346:
-
Component/s: SQL

> More clause needed for combining groupby and cube
> -
>
> Key: SPARK-35346
> URL: https://issues.apache.org/jira/browse/SPARK-35346
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 3.0.0, 3.0.2, 3.1.1
>Reporter: Kai
>Priority: Major
>
> As we all know, aggregation clause must follow after groupby, rollup or cube 
> clause in pyspark. I think we should have more features in this part. Because 
> in sql, we can write it like this "group by xxx, xxx, cube(xxx,xxx)". While 
> in pyspark, if you just need cube for one field and group for the others, 
> it's not gonna happen. Using cube for all fields brings much more cost for 
> useless data. So I think we need to improve it. Thank you!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35346) More clause needed for combining groupby and cube

2021-05-12 Thread Takeshi Yamamuro (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17343208#comment-17343208
 ] 

Takeshi Yamamuro commented on SPARK-35346:
--

Do you mean this feature? https://issues.apache.org/jira/browse/SPARK-33229 
([https://github.com/apache/spark/blame/master/sql/core/src/test/resources/sql-tests/inputs/group-analytics.sql#L74-L81)]
 
If yes, we've already support in in the recent master.

> More clause needed for combining groupby and cube
> -
>
> Key: SPARK-35346
> URL: https://issues.apache.org/jira/browse/SPARK-35346
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.0.0, 3.0.2, 3.1.1
>Reporter: Kai
>Priority: Major
>
> As we all know, aggregation clause must follow after groupby, rollup or cube 
> clause in pyspark. I think we should have more features in this part. Because 
> in sql, we can write it like this "group by xxx, xxx, cube(xxx,xxx)". While 
> in pyspark, if you just need cube for one field and group for the others, 
> it's not gonna happen. Using cube for all fields brings much more cost for 
> useless data. So I think we need to improve it. Thank you!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35243) Support columnar execution on ANSI interval types

2021-05-12 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-35243:


Assignee: Peng Lei

> Support columnar execution on ANSI interval types
> -
>
> Key: SPARK-35243
> URL: https://issues.apache.org/jira/browse/SPARK-35243
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Max Gekk
>Assignee: Peng Lei
>Priority: Major
> Fix For: 3.2.0
>
>
> See SPARK-30066 as reference implementation for CalendarIntervalType ()



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-35243) Support columnar execution on ANSI interval types

2021-05-12 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-35243.
--
Fix Version/s: 3.2.0
   Resolution: Fixed

Fixed in https://github.com/apache/spark/pull/32452

> Support columnar execution on ANSI interval types
> -
>
> Key: SPARK-35243
> URL: https://issues.apache.org/jira/browse/SPARK-35243
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Max Gekk
>Priority: Major
> Fix For: 3.2.0
>
>
> See SPARK-30066 as reference implementation for CalendarIntervalType ()



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29002) Avoid changing SMJ to BHJ if the build side has a high ratio of empty partitions

2021-05-12 Thread Penglei Shi (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29002?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17343137#comment-17343137
 ] 

Penglei Shi commented on SPARK-29002:
-

[~maryannxue] Hi Wei Xue, as the issue described, the rule depends on the ratio 
of empty partitions. But in my scenario, i set initial partition num as 1000,  
after a shuffle exchange, there are 1000 small size partition but most of those 
are not empty. When changing smj to bhj, there will be 1000 small tasks, which 
can not be coalesced and produce massive small file, to many small tasks also 
take more time to schedule. Will there have a better way to cover the mentioned 
scenario?

> Avoid changing SMJ to BHJ if the build side has a high ratio of empty 
> partitions
> 
>
> Key: SPARK-29002
> URL: https://issues.apache.org/jira/browse/SPARK-29002
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wei Xue
>Assignee: Wei Xue
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35317) Job submission in high performance computing environments

2021-05-12 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17343112#comment-17343112
 ] 

Hyukjin Kwon commented on SPARK-35317:
--

They do look better to live in thridparty libraries. Spark scheduler is the 
most conservative area that the community put extra care when they touch.
If the schedulers perform better than Spark's scheduler without breaking 
anything, it would be great to discuss with actual numbers and the test 
coverage.
In practice, I would like to see if they are proven in production with large 
group of users too.


> Job submission in high performance computing environments
> -
>
> Key: SPARK-35317
> URL: https://issues.apache.org/jira/browse/SPARK-35317
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Submit
>Affects Versions: 3.1.2
>Reporter: Benson Muite
>Priority: Minor
>
> Spark is often used in high performance computing environments. The default 
> launcher does not directly support schedulers used such as 
> [slurm|https://slurm.schedmd.com/overview.html] and 
> [pbs|https://openpbs.org/].  It would be good to support these directly. The 
> repositories [https://github.com/ekasitk/spark-on-hpc] and 
> [https://github.com/rokroskar/sparkhpc] contain most of the material 
> necessary for this, but it may be good to incorporate this in Spark directly 
> so that Java and Scala are also well supported.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >