date:20211103

[jira] [Updated] (SPARK-37023) Avoid fetching merge status when shuffleMergeEnabled is false for a shuffleDependency during retry

2021-11-03 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37023?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-37023:
--
Affects Version/s: (was: 3.2.0)
   3.3.0

> Avoid fetching merge status when shuffleMergeEnabled is false for a 
> shuffleDependency during retry
> --
>
> Key: SPARK-37023
> URL: https://issues.apache.org/jira/browse/SPARK-37023
> Project: Spark
>  Issue Type: Sub-task
>  Components: Shuffle
>Affects Versions: 3.3.0
>Reporter: Ye Zhou
>Priority: Major
>
> The assertion below in MapOutoutputTracker.getMapSizesByExecutorId is not 
> guaranteed
> {code:java}
> assert(mapSizesByExecutorId.enableBatchFetch == true){code}
> The reason is during some stage retry cases, the 
> shuffleDependency.shuffleMergeEnabled is set to false, but there will be 
> mergeStatus since the Driver has collected the merged status for its shuffle 
> dependency. If this is the case, the current implementation would set the 
> enableBatchFetch to false, since there are mergeStatus.
> Details can be found here:
> [https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/MapOutputTracker.scala#L1492]
> We should improve the implementation here.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-37023) Avoid fetching merge status when shuffleMergeEnabled is false for a shuffleDependency during retry

2021-11-03 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37023?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-37023:
--
Affects Version/s: (was: 3.3.0)
   3.2.0

> Avoid fetching merge status when shuffleMergeEnabled is false for a 
> shuffleDependency during retry
> --
>
> Key: SPARK-37023
> URL: https://issues.apache.org/jira/browse/SPARK-37023
> Project: Spark
>  Issue Type: Sub-task
>  Components: Shuffle
>Affects Versions: 3.2.0
>Reporter: Ye Zhou
>Priority: Major
>
> The assertion below in MapOutoutputTracker.getMapSizesByExecutorId is not 
> guaranteed
> {code:java}
> assert(mapSizesByExecutorId.enableBatchFetch == true){code}
> The reason is during some stage retry cases, the 
> shuffleDependency.shuffleMergeEnabled is set to false, but there will be 
> mergeStatus since the Driver has collected the merged status for its shuffle 
> dependency. If this is the case, the current implementation would set the 
> enableBatchFetch to false, since there are mergeStatus.
> Details can be found here:
> [https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/MapOutputTracker.scala#L1492]
> We should improve the implementation here.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-37206) Upgrade Avro to 1.11.0

2021-11-03 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17438472#comment-17438472
 ] 

Apache Spark commented on SPARK-37206:
--

User 'sarutak' has created a pull request for this issue:
https://github.com/apache/spark/pull/34482

> Upgrade Avro to 1.11.0
> --
>
> Key: SPARK-37206
> URL: https://issues.apache.org/jira/browse/SPARK-37206
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.3.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Major
>
> Recently, Avro 1.1.0 was released which includes bunch of bug fixes.
> https://issues.apache.org/jira/issues/?jql=project%3DAVRO%20AND%20fixVersion%3D1.11.0



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-37206) Upgrade Avro to 1.11.0

2021-11-03 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37206:


Assignee: Kousuke Saruta  (was: Apache Spark)

> Upgrade Avro to 1.11.0
> --
>
> Key: SPARK-37206
> URL: https://issues.apache.org/jira/browse/SPARK-37206
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.3.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Major
>
> Recently, Avro 1.1.0 was released which includes bunch of bug fixes.
> https://issues.apache.org/jira/issues/?jql=project%3DAVRO%20AND%20fixVersion%3D1.11.0



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-37206) Upgrade Avro to 1.11.0

2021-11-03 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17438473#comment-17438473
 ] 

Apache Spark commented on SPARK-37206:
--

User 'sarutak' has created a pull request for this issue:
https://github.com/apache/spark/pull/34482

> Upgrade Avro to 1.11.0
> --
>
> Key: SPARK-37206
> URL: https://issues.apache.org/jira/browse/SPARK-37206
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.3.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Major
>
> Recently, Avro 1.1.0 was released which includes bunch of bug fixes.
> https://issues.apache.org/jira/issues/?jql=project%3DAVRO%20AND%20fixVersion%3D1.11.0



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-37206) Upgrade Avro to 1.11.0

2021-11-03 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37206:


Assignee: Apache Spark  (was: Kousuke Saruta)

> Upgrade Avro to 1.11.0
> --
>
> Key: SPARK-37206
> URL: https://issues.apache.org/jira/browse/SPARK-37206
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.3.0
>Reporter: Kousuke Saruta
>Assignee: Apache Spark
>Priority: Major
>
> Recently, Avro 1.1.0 was released which includes bunch of bug fixes.
> https://issues.apache.org/jira/issues/?jql=project%3DAVRO%20AND%20fixVersion%3D1.11.0



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-37206) Upgrade Avro to 1.11.0

2021-11-03 Thread Kousuke Saruta (Jira)

Kousuke Saruta created SPARK-37206:
--

 Summary: Upgrade Avro to 1.11.0
 Key: SPARK-37206
 URL: https://issues.apache.org/jira/browse/SPARK-37206
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 3.3.0
Reporter: Kousuke Saruta
Assignee: Kousuke Saruta


Recently, Avro 1.1.0 was released which includes bunch of bug fixes.
https://issues.apache.org/jira/issues/?jql=project%3DAVRO%20AND%20fixVersion%3D1.11.0



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-37108) Expose make_date expression in R

2021-11-03 Thread Kousuke Saruta (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37108?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kousuke Saruta resolved SPARK-37108.

Fix Version/s: 3.3.0
 Assignee: Leona Yoda
   Resolution: Fixed

Issue resolved in https://github.com/apache/spark/pull/34480

> Expose make_date expression in R
> 
>
> Key: SPARK-37108
> URL: https://issues.apache.org/jira/browse/SPARK-37108
> Project: Spark
>  Issue Type: Improvement
>  Components: R
>Affects Versions: 3.3.0
>Reporter: Leona Yoda
>Assignee: Leona Yoda
>Priority: Minor
> Fix For: 3.3.0
>
>
> Expose make_date API on SparkR.
>  
> (cf. https://issues.apache.org/jira/browse/SPARK-36554)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-37054) Porting "pandas API on Spark: Internals" to PySpark docs.

2021-11-03 Thread Haejoon Lee (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37054?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Haejoon Lee resolved SPARK-37054.
-
Resolution: Won't Do

Won't do for now.

> Porting "pandas API on Spark: Internals" to PySpark docs.
> -
>
> Key: SPARK-37054
> URL: https://issues.apache.org/jira/browse/SPARK-37054
> Project: Spark
>  Issue Type: Improvement
>  Components: docs, PySpark
>Affects Versions: 3.2.0
>Reporter: Haejoon Lee
>Priority: Major
>
> We have a 
> [documents|https://docs.google.com/document/d/1PR88p6yMHIeSxkDkSqCxLofkcnP0YtwQ2tETfyAWLQQ/edit?usp=sharing]
>  for pandas API on Spark internal features, apart from the PySpark official 
> documents.
>  
> Since pandas API on Spark is officially released in Spark 3.2, it's good to 
> port this internal documents to the PySpark official documents.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-36989) Migrate type hint data tests

2021-11-03 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-36989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17438432#comment-17438432
 ] 

Apache Spark commented on SPARK-36989:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/34481

> Migrate type hint data tests
> 
>
> Key: SPARK-36989
> URL: https://issues.apache.org/jira/browse/SPARK-36989
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Tests
>Affects Versions: 3.3.0
>Reporter: Maciej Szymkiewicz
>Assignee: Maciej Szymkiewicz
>Priority: Major
> Fix For: 3.3.0
>
>
> Before the migration, {{pyspark-stubs}} contained a set of [data 
> tests|https://github.com/zero323/pyspark-stubs/tree/branch-3.0/test-data/unit],
>  modeled after, and using internal test utilities, of mypy.
> These were omitted during the migration for a few reasons:
>  * Simplicity.
>  * Relative slowness.
>  * Dependence on non public API.
>  
> Data tests are useful for a number of reasons:
>  
>  * Improve test coverage for type hints.
>  * Checking if type checkers infer expected types.
>  * Checking if type checkers reject incorrect code.
>  * Detecting unusual errors with code that otherwise type checks,
>  
> Especially, the last two functions are not fulfilled by simple validation of 
> existing codebase.
>  
> Data tests are not required for all annotations and can be restricted to code 
> that has high possibility of failure:
>  * Complex overloaded signatures.
>  * Complex generics.
>  * Generic {{self}} annotations
>  * Code containing {{type: ignore}}
> The biggest risk, is that output matchers have to be updated when signature 
> changes and / or mypy output changes.
> Example of problem detected with data tests can be found in SPARK-36894 PR 
> ([https://github.com/apache/spark/pull/34146]).
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-37108) Expose make_date expression in R

2021-11-03 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37108?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17438389#comment-17438389
 ] 

Apache Spark commented on SPARK-37108:
--

User 'yoda-mon' has created a pull request for this issue:
https://github.com/apache/spark/pull/34480

> Expose make_date expression in R
> 
>
> Key: SPARK-37108
> URL: https://issues.apache.org/jira/browse/SPARK-37108
> Project: Spark
>  Issue Type: Improvement
>  Components: R
>Affects Versions: 3.3.0
>Reporter: Leona Yoda
>Priority: Minor
>
> Expose make_date API on SparkR.
>  
> (cf. https://issues.apache.org/jira/browse/SPARK-36554)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-37108) Expose make_date expression in R

2021-11-03 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37108?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37108:


Assignee: Apache Spark

> Expose make_date expression in R
> 
>
> Key: SPARK-37108
> URL: https://issues.apache.org/jira/browse/SPARK-37108
> Project: Spark
>  Issue Type: Improvement
>  Components: R
>Affects Versions: 3.3.0
>Reporter: Leona Yoda
>Assignee: Apache Spark
>Priority: Minor
>
> Expose make_date API on SparkR.
>  
> (cf. https://issues.apache.org/jira/browse/SPARK-36554)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-37108) Expose make_date expression in R

2021-11-03 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37108?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17438386#comment-17438386
 ] 

Apache Spark commented on SPARK-37108:
--

User 'yoda-mon' has created a pull request for this issue:
https://github.com/apache/spark/pull/34480

> Expose make_date expression in R
> 
>
> Key: SPARK-37108
> URL: https://issues.apache.org/jira/browse/SPARK-37108
> Project: Spark
>  Issue Type: Improvement
>  Components: R
>Affects Versions: 3.3.0
>Reporter: Leona Yoda
>Priority: Minor
>
> Expose make_date API on SparkR.
>  
> (cf. https://issues.apache.org/jira/browse/SPARK-36554)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-37108) Expose make_date expression in R

2021-11-03 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37108?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37108:


Assignee: (was: Apache Spark)

> Expose make_date expression in R
> 
>
> Key: SPARK-37108
> URL: https://issues.apache.org/jira/browse/SPARK-37108
> Project: Spark
>  Issue Type: Improvement
>  Components: R
>Affects Versions: 3.3.0
>Reporter: Leona Yoda
>Priority: Minor
>
> Expose make_date API on SparkR.
>  
> (cf. https://issues.apache.org/jira/browse/SPARK-36554)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-37205) Support mapreduce.job.send-token-conf when starting containers in YARN

2021-11-03 Thread Chao Sun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun updated SPARK-37205:
-
Description: {{mapreduce.job.send-token-conf}} is a useful feature in 
Hadoop (see [YARN-5910|https://issues.apache.org/jira/browse/YARN-5910] with 
which RM is not required to statically have config for all the secure HDFS 
clusters. Currently it only works for MRv2 but it'd be nice if Spark can also 
use this feature. I think we only need to pass the config to 
{{LaunchContainerContext}} in {{Client.createContainerLaunchContext}}.  (was: 
{{mapreduce.job.send-token-conf}} is a useful feature in Hadoop (see 
[YARN-5910|https://issues.apache.org/jira/browse/YARN-5910] with which RM is 
not required to statically have config for all the secure HDFS clusters. 
Currently it only works for MRv2 but it'd be nice if Spark can also use this 
feature. I think we only need to pass the config to {{LaunchContainerContext}} 
before invoking {{NMClient.startContainer}}.)

> Support mapreduce.job.send-token-conf when starting containers in YARN
> --
>
> Key: SPARK-37205
> URL: https://issues.apache.org/jira/browse/SPARK-37205
> Project: Spark
>  Issue Type: New Feature
>  Components: YARN
>Affects Versions: 3.3.0
>Reporter: Chao Sun
>Priority: Major
>
> {{mapreduce.job.send-token-conf}} is a useful feature in Hadoop (see 
> [YARN-5910|https://issues.apache.org/jira/browse/YARN-5910] with which RM is 
> not required to statically have config for all the secure HDFS clusters. 
> Currently it only works for MRv2 but it'd be nice if Spark can also use this 
> feature. I think we only need to pass the config to 
> {{LaunchContainerContext}} in {{Client.createContainerLaunchContext}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-37205) Support mapreduce.job.send-token-conf when starting containers in YARN

2021-11-03 Thread Chao Sun (Jira)

Chao Sun created SPARK-37205:


 Summary: Support mapreduce.job.send-token-conf when starting 
containers in YARN
 Key: SPARK-37205
 URL: https://issues.apache.org/jira/browse/SPARK-37205
 Project: Spark
  Issue Type: New Feature
  Components: YARN
Affects Versions: 3.3.0
Reporter: Chao Sun


{{mapreduce.job.send-token-conf}} is a useful feature in Hadoop (see 
[YARN-5910|https://issues.apache.org/jira/browse/YARN-5910] with which RM is 
not required to statically have config for all the secure HDFS clusters. 
Currently it only works for MRv2 but it'd be nice if Spark can also use this 
feature. I think we only need to pass the config to {{LaunchContainerContext}} 
before invoking {{NMClient.startContainer}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-36566) Add Spark appname as a label to the executor pods

2021-11-03 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36566?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-36566.
---
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 34460
[https://github.com/apache/spark/pull/34460]

> Add Spark appname as a label to the executor pods
> -
>
> Key: SPARK-36566
> URL: https://issues.apache.org/jira/browse/SPARK-36566
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.3.0
>Reporter: Holden Karau
>Assignee: Apache Spark
>Priority: Trivial
> Fix For: 3.3.0
>
>
> Adding the appName as a label to the executor pods could simplify debugging.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-36566) Add Spark appname as a label to the executor pods

2021-11-03 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36566?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-36566:
-

Assignee: Yikun Jiang  (was: Apache Spark)

> Add Spark appname as a label to the executor pods
> -
>
> Key: SPARK-36566
> URL: https://issues.apache.org/jira/browse/SPARK-36566
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.3.0
>Reporter: Holden Karau
>Assignee: Yikun Jiang
>Priority: Trivial
> Fix For: 3.3.0
>
>
> Adding the appName as a label to the executor pods could simplify debugging.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-36566) Add Spark appname as a label to the executor pods

2021-11-03 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36566?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-36566:
-

Assignee: Apache Spark

> Add Spark appname as a label to the executor pods
> -
>
> Key: SPARK-36566
> URL: https://issues.apache.org/jira/browse/SPARK-36566
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.3.0
>Reporter: Holden Karau
>Assignee: Apache Spark
>Priority: Trivial
>
> Adding the appName as a label to the executor pods could simplify debugging.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-37060) Report driver status does not handle response from backup masters

2021-11-03 Thread Mohamadreza Rostami (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37060?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mohamadreza Rostami updated SPARK-37060:

Priority: Critical  (was: Major)

> Report driver status does not handle response from backup masters
> -
>
> Key: SPARK-37060
> URL: https://issues.apache.org/jira/browse/SPARK-37060
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.1.0, 3.1.1, 3.1.2, 3.2.0
>Reporter: Mohamadreza Rostami
>Priority: Critical
>
> After an improvement in SPARK-31486, contributor uses 
> 'asyncSendToMasterAndForwardReply' method instead of 
> 'activeMasterEndpoint.askSync' to get the status of driver. Since the 
> driver's status is only available in active master and the 
> 'asyncSendToMasterAndForwardReply' method iterate over all of the masters, we 
> have to handle the response from the backup masters in the client, which the 
> developer did not consider in the SPARK-31486 change. So drivers running in 
> cluster mode and on a cluster with multi-master affected by this bug. I 
> created the patch for this bug and will soon be sent the pull request.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-37202) Temp view didn't collect temp function that registered with catalog API

2021-11-03 Thread Huaxin Gao (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37202?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Huaxin Gao updated SPARK-37202:
---
Issue Type: Bug  (was: Task)

> Temp view didn't collect temp function that registered with catalog API
> ---
>
> Key: SPARK-37202
> URL: https://issues.apache.org/jira/browse/SPARK-37202
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Linhong Liu
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-37149) Improve error messages for arithmetic overflow under ANSI mode

2021-11-03 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17438194#comment-17438194
 ] 

Apache Spark commented on SPARK-37149:
--

User 'allisonwang-db' has created a pull request for this issue:
https://github.com/apache/spark/pull/34479

> Improve error messages for arithmetic overflow under ANSI mode
> --
>
> Key: SPARK-37149
> URL: https://issues.apache.org/jira/browse/SPARK-37149
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Allison Wang
>Assignee: Allison Wang
>Priority: Major
> Fix For: 3.3.0
>
>
> Improve error messages for arithmetic overflow exceptions. We can instruct 
> users to 1) turn off ANSI mode or 2) use `try_` functions if applicable.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-37149) Improve error messages for arithmetic overflow under ANSI mode

2021-11-03 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17438192#comment-17438192
 ] 

Apache Spark commented on SPARK-37149:
--

User 'allisonwang-db' has created a pull request for this issue:
https://github.com/apache/spark/pull/34479

> Improve error messages for arithmetic overflow under ANSI mode
> --
>
> Key: SPARK-37149
> URL: https://issues.apache.org/jira/browse/SPARK-37149
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Allison Wang
>Assignee: Allison Wang
>Priority: Major
> Fix For: 3.3.0
>
>
> Improve error messages for arithmetic overflow exceptions. We can instruct 
> users to 1) turn off ANSI mode or 2) use `try_` functions if applicable.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26365) spark-submit for k8s cluster doesn't propagate exit code

2021-11-03 Thread Naresh (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-26365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17438174#comment-17438174
 ] 

Naresh commented on SPARK-26365:


Yes. Its not fixed in 3.x yet. I am using spark 3.2 and still see the issue

 

> spark-submit for k8s cluster doesn't propagate exit code
> 
>
> Key: SPARK-26365
> URL: https://issues.apache.org/jira/browse/SPARK-26365
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes, Spark Core, Spark Submit
>Affects Versions: 2.3.2, 2.4.0, 3.0.0, 3.1.0
>Reporter: Oscar Bonilla
>Priority: Major
> Attachments: spark-2.4.5-raise-exception-k8s-failure.patch, 
> spark-3.0.0-raise-exception-k8s-failure.patch
>
>
> When launching apps using spark-submit in a kubernetes cluster, if the Spark 
> applications fails (returns exit code = 1 for example), spark-submit will 
> still exit gracefully and return exit code = 0.
> This is problematic, since there's no way to know if there's been a problem 
> with the Spark application.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-37201) Spark SQL reads unnecessary nested fields (filter after explode)

2021-11-03 Thread Sergey Kotlov (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergey Kotlov updated SPARK-37201:
--
Description: 
In this example, reading unnecessary nested fields still happens.

Data preparation:
{code:java}
case class Struct(v1: String, v2: String, v3: String)
case class Event(struct: Struct, array: Seq[String])

Seq(
  Event(Struct("v1","v2","v3"), Seq("cx1", "cx2"))
).toDF().write.mode("overwrite").saveAsTable("table")
{code}
 v2 and v3 columns aren't needed here, but still exist in the physical plan.
{code:java}
spark.table("table")
  .select($"struct.v1", explode($"array").as("el"))
  .filter($"el" === "cx1")
  .explain(true)
 
== Physical Plan ==
... ReadSchema: 
struct,array:array>

{code}
If you just remove _filter_ or move _explode_ to second _select_, everything is 
fine:
{code:java}
spark.table("table")
  .select($"struct.v1", explode($"array").as("el"))
  //.filter($"el" === "cx1")
  .explain(true)
  
// ... ReadSchema: struct,array:array>

spark.table("table")
  .select($"struct.v1", $"array")
  .select($"v1", explode($"array").as("el"))
  .filter($"el" === "cx1")
  .explain(true)
  
// ... ReadSchema: struct,array:array>
{code}
 

*Yet another example: left_anti join after double select:*
{code:java}
case class Struct(v1: String, v2: String, v3: String)
case class Event(struct: Struct, field1: String, field2: String)
Seq(
  Event(Struct("v1","v2","v3"), "fld1", "fld2")
).toDF().write.mode("overwrite").saveAsTable("table")
val joinDf = Seq("id1").toDF("id")

spark.table("table")
  .select("struct", "field1")
  .select($"struct.v1", $"field1")
  .join(joinDf, $"field1" === joinDf("id"), "left_anti")
  .explain(true)

// ===> ReadSchema: 
struct,field1:string>
{code}
Instead of the first select, it can be other types of manipulations with the 
original df, for example {color:#00875a}.withColumn("field3", lit("f3")){color} 
or {color:#00875a}.drop("field2"){color}, which will also lead to reading 
unnecessary nested fields from _struct_.

But if you just remove the first select or change type of join, reading nested 
fields will be correct:
{code:java}
// .select("struct", "field1")
===> ReadSchema: struct,field1:string>

.join(joinDf, $"field1" === joinDf("id"), "left")
===> ReadSchema: struct,field1:string>
{code}
PS: The first select might look strange in the context of this example, but in 
a real system, it might be part of a common api, that other parts of the system 
use with their own expressions on top of this api.

  was:
In this example, reading unnecessary nested fields still happens.

Data preparation:
{code:java}
case class Struct(v1: String, v2: String, v3: String)
case class Event(struct: Struct, array: Seq[String])

Seq(
  Event(Struct("v1","v2","v3"), Seq("cx1", "cx2"))
).toDF().write.mode("overwrite").saveAsTable("table")
{code}
 v2 and v3 columns aren't needed here, but still exist in the physical plan.
{code:java}
spark.table("table")
  .select($"struct.v1", explode($"array").as("el"))
  .filter($"el" === "cx1")
  .explain(true)
 
== Physical Plan ==
... ReadSchema: 
struct,array:array>

{code}
If you just remove _filter_ or move _explode_ to second _select_, everything is 
fine:
{code:java}
spark.table("table")
  .select($"struct.v1", explode($"array").as("el"))
  //.filter($"el" === "cx1")
  .explain(true)
  
// ... ReadSchema: struct,array:array>

spark.table("table")
  .select($"struct.v1", $"array")
  .select($"v1", explode($"array").as("el"))
  .filter($"el" === "cx1")
  .explain(true)
  
// ... ReadSchema: struct,array:array>
{code}
 

*Yet another example: left_anti join after double select:*
{code:java}
case class Struct(v1: String, v2: String, v3: String)
case class Event(struct: Struct, field1: String, field2: String)
Seq(
  Event(Struct("v1","v2","v3"), "fld1", "fld2")
).toDF().write.mode("overwrite").saveAsTable("table")
val joinDf = Seq("id1").toDF("id")

spark.table("table")
  .select("struct", "field1")
  .select($"struct.v1", $"field1")
  .join(joinDf, $"field1" === joinDf("id"), "left_anti")
  .explain(true)

// ===> ReadSchema: 
struct,field1:string>
{code}
Instead of the first select, it can be other types of manipulations with the 
original df, for example {{^~.withColumn("field3", lit("f3"))~^}} or 
.drop("field2"), which will also lead to reading unnecessary nested fields from 
_struct_.

But if you just remove the first select or change type of join, reading nested 
fields will be correct:**
{code:java}
// .select("struct", "field1")
===> ReadSchema: struct,field1:string>

.join(joinDf, $"field1" === joinDf("id"), "left")
===> ReadSchema: struct,field1:string>
{code}
PS: The first select might look strange in the context of this example, but in 
a real system, it might be part of a common api, that other parts of the system 
use with their own expressions on top of this api.


> Spark SQL

[jira] [Commented] (SPARK-37077) Annotations for pyspark.sql.context.SQLContext.createDataFrame are broken

2021-11-03 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37077?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17438147#comment-17438147
 ] 

Apache Spark commented on SPARK-37077:
--

User 'zero323' has created a pull request for this issue:
https://github.com/apache/spark/pull/34478

> Annotations for pyspark.sql.context.SQLContext.createDataFrame are broken
> -
>
> Key: SPARK-37077
> URL: https://issues.apache.org/jira/browse/SPARK-37077
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Maciej Szymkiewicz
>Assignee: Maciej Szymkiewicz
>Priority: Major
> Fix For: 3.3.0
>
>
> During migration from stubs to inline annotations, variants taking {{RDD}} 
> where incorrectly remove. As a result
>  
> {code:python}
> from pyspark.sql import SQLContext, SparkSession
> from pyspark import SparkContext
> sc = SparkContext.getOrCreate()
> sqlContext= SQLContext(sc)
> sqlContext.createDataFrame(sc.parallelize([(1, 2)]))
> {code}
> although valid, no longer type checks
> {code}
> main.py:7: error: No overload variant of "createDataFrame" of "SQLContext" 
> matches argument type "RDD[Tuple[int, int]]"
> main.py:7: note: Possible overload variants:
> main.py:7: note: def [RowLike in (List[Any], Tuple[Any, ...], Row)] 
> createDataFrame(self, data: Iterable[RowLike], samplingRatio: Optional[float] 
> = ...) -> DataFrame
> main.py:7: note: def [RowLike in (List[Any], Tuple[Any, ...], Row)] 
> createDataFrame(self, data: Iterable[RowLike], schema: Union[List[str], 
> Tuple[str, ...]] = ..., verifySchema: bool = ...) -> DataFrame
> main.py:7: note: def createDataFrame(self, data: DataFrameLike, 
> samplingRatio: Optional[float] = ...) -> DataFrame
> main.py:7: note: <3 more non-matching overloads not shown>
> Found 1 error in 1 file (checked 1 source file)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-37201) Spark SQL reads unnecessary nested fields (filter after explode)

2021-11-03 Thread Sergey Kotlov (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergey Kotlov updated SPARK-37201:
--
Description: 
In this example, reading unnecessary nested fields still happens.

Data preparation:
{code:java}
case class Struct(v1: String, v2: String, v3: String)
case class Event(struct: Struct, array: Seq[String])

Seq(
  Event(Struct("v1","v2","v3"), Seq("cx1", "cx2"))
).toDF().write.mode("overwrite").saveAsTable("table")
{code}
 v2 and v3 columns aren't needed here, but still exist in the physical plan.
{code:java}
spark.table("table")
  .select($"struct.v1", explode($"array").as("el"))
  .filter($"el" === "cx1")
  .explain(true)
 
== Physical Plan ==
... ReadSchema: 
struct,array:array>

{code}
If you just remove _filter_ or move _explode_ to second _select_, everything is 
fine:
{code:java}
spark.table("table")
  .select($"struct.v1", explode($"array").as("el"))
  //.filter($"el" === "cx1")
  .explain(true)
  
// ... ReadSchema: struct,array:array>

spark.table("table")
  .select($"struct.v1", $"array")
  .select($"v1", explode($"array").as("el"))
  .filter($"el" === "cx1")
  .explain(true)
  
// ... ReadSchema: struct,array:array>
{code}
 

*Yet another example: left_anti join after double select:*
{code:java}
case class Struct(v1: String, v2: String, v3: String)
case class Event(struct: Struct, field1: String, field2: String)
Seq(
  Event(Struct("v1","v2","v3"), "fld1", "fld2")
).toDF().write.mode("overwrite").saveAsTable("table")
val joinDf = Seq("id1").toDF("id")

spark.table("table")
  .select("struct", "field1")
  .select($"struct.v1", $"field1")
  .join(joinDf, $"field1" === joinDf("id"), "left_anti")
  .explain(true)

// ===> ReadSchema: 
struct,field1:string>
{code}
Instead of the first select, it can be other types of manipulations with the 
original df, for example {{^~.withColumn("field3", lit("f3"))~^}} or 
.drop("field2"), which will also lead to reading unnecessary nested fields from 
_struct_.

But if you just remove the first select or change type of join, reading nested 
fields will be correct:**
{code:java}
// .select("struct", "field1")
===> ReadSchema: struct,field1:string>

.join(joinDf, $"field1" === joinDf("id"), "left")
===> ReadSchema: struct,field1:string>
{code}
PS: The first select might look strange in the context of this example, but in 
a real system, it might be part of a common api, that other parts of the system 
use with their own expressions on top of this api.

  was:
In this example, reading unnecessary nested fields still happens.

Data preparation:
{code:java}
case class Struct(v1: String, v2: String, v3: String)
case class Event(struct: Struct, array: Seq[String])

Seq(
  Event(Struct("v1","v2","v3"), Seq("cx1", "cx2"))
).toDF().write.mode("overwrite").saveAsTable("table")
{code}
 v2 and v3 columns aren't needed here, but still exist in the physical plan.
{code:java}
spark.table("table")
  .select($"struct.v1", explode($"array").as("el"))
  .filter($"el" === "cx1")
  .explain(true)
 
== Physical Plan ==
... ReadSchema: 
struct,array:array>

{code}
If you just remove _filter_ or move _explode_ to second _select_, everything is 
fine:
{code:java}
spark.table("table")
  .select($"struct.v1", explode($"array").as("el"))
  //.filter($"el" === "cx1")
  .explain(true)
  
// ... ReadSchema: struct,array:array>

spark.table("table")
  .select($"struct.v1", $"array")
  .select($"v1", explode($"array").as("el"))
  .filter($"el" === "cx1")
  .explain(true)
  
// ... ReadSchema: struct,array:array>
{code}
 

*Yet another example: left_anti join after double select:*
{code:java}
case class Struct(v1: String, v2: String, v3: String)
case class Event(struct: Struct, field1: String, field2: String)
Seq(
  Event(Struct("v1","v2","v3"), "fld1", "fld2")
).toDF().write.mode("overwrite").saveAsTable("table")
val joinDf = Seq("id1").toDF("id")

spark.table("table")
  .select("struct", "field1")
  .select($"struct.v1", $"field1")
  .join(joinDf, $"field1" === joinDf("id"), "left_anti")
  .explain(true)

// ===> ReadSchema: 
struct,field1:string>
{code}
If you just remove the first select or change type of join, reading nested 
fields will be correct:**
{code:java}
// .select("struct", "field1")
===> ReadSchema: struct,field1:string>

.join(joinDf, $"field1" === joinDf("id"), "left")
===> ReadSchema: struct,field1:string>
{code}
PS: The first select might look strange in the context of this example, but in 
a real system, it might be part of a common api, that other parts of the system 
use with their own expressions on top of this api.


> Spark SQL reads unnecessary nested fields (filter after explode)
> 
>
> Key: SPARK-37201
> URL: https://issues.apache.org/jira/browse/SPARK-37201
> Project: Spark
>  Issue

[jira] [Commented] (SPARK-37077) Annotations for pyspark.sql.context.SQLContext.createDataFrame are broken

2021-11-03 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37077?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17438149#comment-17438149
 ] 

Apache Spark commented on SPARK-37077:
--

User 'zero323' has created a pull request for this issue:
https://github.com/apache/spark/pull/34478

> Annotations for pyspark.sql.context.SQLContext.createDataFrame are broken
> -
>
> Key: SPARK-37077
> URL: https://issues.apache.org/jira/browse/SPARK-37077
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Maciej Szymkiewicz
>Assignee: Maciej Szymkiewicz
>Priority: Major
> Fix For: 3.3.0
>
>
> During migration from stubs to inline annotations, variants taking {{RDD}} 
> where incorrectly remove. As a result
>  
> {code:python}
> from pyspark.sql import SQLContext, SparkSession
> from pyspark import SparkContext
> sc = SparkContext.getOrCreate()
> sqlContext= SQLContext(sc)
> sqlContext.createDataFrame(sc.parallelize([(1, 2)]))
> {code}
> although valid, no longer type checks
> {code}
> main.py:7: error: No overload variant of "createDataFrame" of "SQLContext" 
> matches argument type "RDD[Tuple[int, int]]"
> main.py:7: note: Possible overload variants:
> main.py:7: note: def [RowLike in (List[Any], Tuple[Any, ...], Row)] 
> createDataFrame(self, data: Iterable[RowLike], samplingRatio: Optional[float] 
> = ...) -> DataFrame
> main.py:7: note: def [RowLike in (List[Any], Tuple[Any, ...], Row)] 
> createDataFrame(self, data: Iterable[RowLike], schema: Union[List[str], 
> Tuple[str, ...]] = ..., verifySchema: bool = ...) -> DataFrame
> main.py:7: note: def createDataFrame(self, data: DataFrameLike, 
> samplingRatio: Optional[float] = ...) -> DataFrame
> main.py:7: note: <3 more non-matching overloads not shown>
> Found 1 error in 1 file (checked 1 source file)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-37077) Annotations for pyspark.sql.context.SQLContext.createDataFrame are broken

2021-11-03 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37077?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17438146#comment-17438146
 ] 

Apache Spark commented on SPARK-37077:
--

User 'zero323' has created a pull request for this issue:
https://github.com/apache/spark/pull/34478

> Annotations for pyspark.sql.context.SQLContext.createDataFrame are broken
> -
>
> Key: SPARK-37077
> URL: https://issues.apache.org/jira/browse/SPARK-37077
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Maciej Szymkiewicz
>Assignee: Maciej Szymkiewicz
>Priority: Major
> Fix For: 3.3.0
>
>
> During migration from stubs to inline annotations, variants taking {{RDD}} 
> where incorrectly remove. As a result
>  
> {code:python}
> from pyspark.sql import SQLContext, SparkSession
> from pyspark import SparkContext
> sc = SparkContext.getOrCreate()
> sqlContext= SQLContext(sc)
> sqlContext.createDataFrame(sc.parallelize([(1, 2)]))
> {code}
> although valid, no longer type checks
> {code}
> main.py:7: error: No overload variant of "createDataFrame" of "SQLContext" 
> matches argument type "RDD[Tuple[int, int]]"
> main.py:7: note: Possible overload variants:
> main.py:7: note: def [RowLike in (List[Any], Tuple[Any, ...], Row)] 
> createDataFrame(self, data: Iterable[RowLike], samplingRatio: Optional[float] 
> = ...) -> DataFrame
> main.py:7: note: def [RowLike in (List[Any], Tuple[Any, ...], Row)] 
> createDataFrame(self, data: Iterable[RowLike], schema: Union[List[str], 
> Tuple[str, ...]] = ..., verifySchema: bool = ...) -> DataFrame
> main.py:7: note: def createDataFrame(self, data: DataFrameLike, 
> samplingRatio: Optional[float] = ...) -> DataFrame
> main.py:7: note: <3 more non-matching overloads not shown>
> Found 1 error in 1 file (checked 1 source file)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-36894) RDD.toDF should be synchronized with dispatched variants of SparkSession.createDataFrame

2021-11-03 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-36894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17438142#comment-17438142
 ] 

Apache Spark commented on SPARK-36894:
--

User 'zero323' has created a pull request for this issue:
https://github.com/apache/spark/pull/34478

> RDD.toDF should be synchronized with dispatched variants of 
> SparkSession.createDataFrame
> 
>
> Key: SPARK-36894
> URL: https://issues.apache.org/jira/browse/SPARK-36894
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 3.1.2, 3.2.0, 3.3.0
>Reporter: Maciej Szymkiewicz
>Assignee: Maciej Szymkiewicz
>Priority: Minor
> Fix For: 3.3.0
>
>
> There are some variants that are supported:
>  * Providing a schema as a {{str}} object for {{RDD[RowLike]}} objects
>  * Providing a schema as a {{Tuple[str, ...]}} names
>  * Calling {{toDF}} on {{RDD}} of atomic values, when schema of {{str}} or 
> {{AtomicType}} is provided.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-36894) RDD.toDF should be synchronized with dispatched variants of SparkSession.createDataFrame

2021-11-03 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-36894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17438141#comment-17438141
 ] 

Apache Spark commented on SPARK-36894:
--

User 'zero323' has created a pull request for this issue:
https://github.com/apache/spark/pull/34478

> RDD.toDF should be synchronized with dispatched variants of 
> SparkSession.createDataFrame
> 
>
> Key: SPARK-36894
> URL: https://issues.apache.org/jira/browse/SPARK-36894
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 3.1.2, 3.2.0, 3.3.0
>Reporter: Maciej Szymkiewicz
>Assignee: Maciej Szymkiewicz
>Priority: Minor
> Fix For: 3.3.0
>
>
> There are some variants that are supported:
>  * Providing a schema as a {{str}} object for {{RDD[RowLike]}} objects
>  * Providing a schema as a {{Tuple[str, ...]}} names
>  * Calling {{toDF}} on {{RDD}} of atomic values, when schema of {{str}} or 
> {{AtomicType}} is provided.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-37204) Update Apache Parent POM version to 24 in the pom.xml

2021-11-03 Thread Janardhan Pulivarthi (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37204?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17438131#comment-17438131
 ] 

Janardhan Pulivarthi commented on SPARK-37204:
--

Hi, I am new here. Can I work on this?

> Update Apache Parent POM version to 24 in the pom.xml
> -
>
> Key: SPARK-37204
> URL: https://issues.apache.org/jira/browse/SPARK-37204
> Project: Spark
>  Issue Type: Task
>  Components: Build
>Affects Versions: 3.2.0
>Reporter: Janardhan Pulivarthi
>Priority: Minor
>
> Some nice things about the version 24:
> 1. Deploy SHA-512 for source-release to Remote Repository
> 2. Reproducible builds option
>  
> Resources:
> [1] [https://lists.apache.org/thread/9wk97dwjlcoxlk1onxotfo8k98b2v0sk]
> [2] [https://maven.apache.org/guides/mini/guide-reproducible-builds.html]
> [3] 
> [https://github.com/apache/maven-apache-parent/compare/apache-18...apache-24diff]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-37204) Update Apache Parent POM version to 24 in the pom.xml

2021-11-03 Thread Janardhan Pulivarthi (Jira)

Janardhan Pulivarthi created SPARK-37204:


 Summary: Update Apache Parent POM version to 24 in the pom.xml
 Key: SPARK-37204
 URL: https://issues.apache.org/jira/browse/SPARK-37204
 Project: Spark
  Issue Type: Task
  Components: Build
Affects Versions: 3.2.0
Reporter: Janardhan Pulivarthi


Some nice things about the version 24:

1. Deploy SHA-512 for source-release to Remote Repository
2. Reproducible builds option

 

Resources:

[1] [https://lists.apache.org/thread/9wk97dwjlcoxlk1onxotfo8k98b2v0sk]
[2] [https://maven.apache.org/guides/mini/guide-reproducible-builds.html]
[3] 
[https://github.com/apache/maven-apache-parent/compare/apache-18...apache-24diff]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-37204) Update Apache Parent POM version to 24 in the pom.xml

2021-11-03 Thread Janardhan Pulivarthi (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Janardhan Pulivarthi updated SPARK-37204:
-
Priority: Minor  (was: Major)

> Update Apache Parent POM version to 24 in the pom.xml
> -
>
> Key: SPARK-37204
> URL: https://issues.apache.org/jira/browse/SPARK-37204
> Project: Spark
>  Issue Type: Task
>  Components: Build
>Affects Versions: 3.2.0
>Reporter: Janardhan Pulivarthi
>Priority: Minor
>
> Some nice things about the version 24:
> 1. Deploy SHA-512 for source-release to Remote Repository
> 2. Reproducible builds option
>  
> Resources:
> [1] [https://lists.apache.org/thread/9wk97dwjlcoxlk1onxotfo8k98b2v0sk]
> [2] [https://maven.apache.org/guides/mini/guide-reproducible-builds.html]
> [3] 
> [https://github.com/apache/maven-apache-parent/compare/apache-18...apache-24diff]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-37201) Spark SQL reads unnecessary nested fields (filter after explode)

2021-11-03 Thread Sergey Kotlov (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergey Kotlov updated SPARK-37201:
--
Description: 
In this example, reading unnecessary nested fields still happens.

Data preparation:
{code:java}
case class Struct(v1: String, v2: String, v3: String)
case class Event(struct: Struct, array: Seq[String])

Seq(
  Event(Struct("v1","v2","v3"), Seq("cx1", "cx2"))
).toDF().write.mode("overwrite").saveAsTable("table")
{code}
 v2 and v3 columns aren't needed here, but still exist in the physical plan.
{code:java}
spark.table("table")
  .select($"struct.v1", explode($"array").as("el"))
  .filter($"el" === "cx1")
  .explain(true)
 
== Physical Plan ==
... ReadSchema: 
struct,array:array>

{code}
If you just remove _filter_ or move _explode_ to second _select_, everything is 
fine:
{code:java}
spark.table("table")
  .select($"struct.v1", explode($"array").as("el"))
  //.filter($"el" === "cx1")
  .explain(true)
  
// ... ReadSchema: struct,array:array>

spark.table("table")
  .select($"struct.v1", $"array")
  .select($"v1", explode($"array").as("el"))
  .filter($"el" === "cx1")
  .explain(true)
  
// ... ReadSchema: struct,array:array>
{code}
 

*Yet another example: left_anti join after double select:*
{code:java}
case class Struct(v1: String, v2: String, v3: String)
case class Event(struct: Struct, field1: String, field2: String)
Seq(
  Event(Struct("v1","v2","v3"), "fld1", "fld2")
).toDF().write.mode("overwrite").saveAsTable("table")
val joinDf = Seq("id1").toDF("id")

spark.table("table")
  .select("struct", "field1")
  .select($"struct.v1", $"field1")
  .join(joinDf, $"field1" === joinDf("id"), "left_anti")
  .explain(true)

// ===> ReadSchema: 
struct,field1:string>
{code}
If you just remove the first select or change type of join, reading nested 
fields will be correct:**
{code:java}
// .select("struct", "field1")
===> ReadSchema: struct,field1:string>

.join(joinDf, $"field1" === joinDf("id"), "left")
===> ReadSchema: struct,field1:string>
{code}
PS: The first select might look strange in the context of this example, but in 
a real system, it might be part of a common api, that other parts of the system 
use with their own expressions on top of this api.

  was:
In this example, reading unnecessary nested fields still happens.

Data preparation:

 
{code:java}
case class Struct(v1: String, v2: String, v3: String)
case class Event(struct: Struct, array: Seq[String])

Seq(
  Event(Struct("v1","v2","v3"), Seq("cx1", "cx2"))
).toDF().write.mode("overwrite").saveAsTable("table")
{code}
 

v2 and v3 columns aren't needed here, but still exist in the physical plan.
{code:java}
spark.table("table")
  .select($"struct.v1", explode($"array").as("el"))
  .filter($"el" === "cx1")
  .explain(true)
 
== Physical Plan ==
... ReadSchema: 
struct,array:array>

{code}
If you just remove _filter_ or move _explode_ to second _select_, everything is 
fine:
{code:java}
spark.table("table")
  .select($"struct.v1", explode($"array").as("el"))
  //.filter($"el" === "cx1")
  .explain(true)
  
// ... ReadSchema: struct,array:array>

spark.table("table")
  .select($"struct.v1", $"array")
  .select($"v1", explode($"array").as("el"))
  .filter($"el" === "cx1")
  .explain(true)
  
// ... ReadSchema: struct,array:array>
{code}


> Spark SQL reads unnecessary nested fields (filter after explode)
> 
>
> Key: SPARK-37201
> URL: https://issues.apache.org/jira/browse/SPARK-37201
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Sergey Kotlov
>Priority: Major
>
> In this example, reading unnecessary nested fields still happens.
> Data preparation:
> {code:java}
> case class Struct(v1: String, v2: String, v3: String)
> case class Event(struct: Struct, array: Seq[String])
> Seq(
>   Event(Struct("v1","v2","v3"), Seq("cx1", "cx2"))
> ).toDF().write.mode("overwrite").saveAsTable("table")
> {code}
>  v2 and v3 columns aren't needed here, but still exist in the physical plan.
> {code:java}
> spark.table("table")
>   .select($"struct.v1", explode($"array").as("el"))
>   .filter($"el" === "cx1")
>   .explain(true)
>  
> == Physical Plan ==
> ... ReadSchema: 
> struct,array:array>
> {code}
> If you just remove _filter_ or move _explode_ to second _select_, everything 
> is fine:
> {code:java}
> spark.table("table")
>   .select($"struct.v1", explode($"array").as("el"))
>   //.filter($"el" === "cx1")
>   .explain(true)
>   
> // ... ReadSchema: struct,array:array>
> spark.table("table")
>   .select($"struct.v1", $"array")
>   .select($"v1", explode($"array").as("el"))
>   .filter($"el" === "cx1")
>   .explain(true)
>   
> // ... ReadSchema: struct,array:array>
> {code}
>  
> *Yet another example: left_anti join after double

[jira] [Commented] (SPARK-37077) Annotations for pyspark.sql.context.SQLContext.createDataFrame are broken

2021-11-03 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37077?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17438089#comment-17438089
 ] 

Apache Spark commented on SPARK-37077:
--

User 'sarutak' has created a pull request for this issue:
https://github.com/apache/spark/pull/34477

> Annotations for pyspark.sql.context.SQLContext.createDataFrame are broken
> -
>
> Key: SPARK-37077
> URL: https://issues.apache.org/jira/browse/SPARK-37077
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Maciej Szymkiewicz
>Assignee: Maciej Szymkiewicz
>Priority: Major
> Fix For: 3.3.0
>
>
> During migration from stubs to inline annotations, variants taking {{RDD}} 
> where incorrectly remove. As a result
>  
> {code:python}
> from pyspark.sql import SQLContext, SparkSession
> from pyspark import SparkContext
> sc = SparkContext.getOrCreate()
> sqlContext= SQLContext(sc)
> sqlContext.createDataFrame(sc.parallelize([(1, 2)]))
> {code}
> although valid, no longer type checks
> {code}
> main.py:7: error: No overload variant of "createDataFrame" of "SQLContext" 
> matches argument type "RDD[Tuple[int, int]]"
> main.py:7: note: Possible overload variants:
> main.py:7: note: def [RowLike in (List[Any], Tuple[Any, ...], Row)] 
> createDataFrame(self, data: Iterable[RowLike], samplingRatio: Optional[float] 
> = ...) -> DataFrame
> main.py:7: note: def [RowLike in (List[Any], Tuple[Any, ...], Row)] 
> createDataFrame(self, data: Iterable[RowLike], schema: Union[List[str], 
> Tuple[str, ...]] = ..., verifySchema: bool = ...) -> DataFrame
> main.py:7: note: def createDataFrame(self, data: DataFrameLike, 
> samplingRatio: Optional[float] = ...) -> DataFrame
> main.py:7: note: <3 more non-matching overloads not shown>
> Found 1 error in 1 file (checked 1 source file)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-37077) Annotations for pyspark.sql.context.SQLContext.createDataFrame are broken

2021-11-03 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37077?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17438087#comment-17438087
 ] 

Apache Spark commented on SPARK-37077:
--

User 'sarutak' has created a pull request for this issue:
https://github.com/apache/spark/pull/34477

> Annotations for pyspark.sql.context.SQLContext.createDataFrame are broken
> -
>
> Key: SPARK-37077
> URL: https://issues.apache.org/jira/browse/SPARK-37077
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Maciej Szymkiewicz
>Assignee: Maciej Szymkiewicz
>Priority: Major
> Fix For: 3.3.0
>
>
> During migration from stubs to inline annotations, variants taking {{RDD}} 
> where incorrectly remove. As a result
>  
> {code:python}
> from pyspark.sql import SQLContext, SparkSession
> from pyspark import SparkContext
> sc = SparkContext.getOrCreate()
> sqlContext= SQLContext(sc)
> sqlContext.createDataFrame(sc.parallelize([(1, 2)]))
> {code}
> although valid, no longer type checks
> {code}
> main.py:7: error: No overload variant of "createDataFrame" of "SQLContext" 
> matches argument type "RDD[Tuple[int, int]]"
> main.py:7: note: Possible overload variants:
> main.py:7: note: def [RowLike in (List[Any], Tuple[Any, ...], Row)] 
> createDataFrame(self, data: Iterable[RowLike], samplingRatio: Optional[float] 
> = ...) -> DataFrame
> main.py:7: note: def [RowLike in (List[Any], Tuple[Any, ...], Row)] 
> createDataFrame(self, data: Iterable[RowLike], schema: Union[List[str], 
> Tuple[str, ...]] = ..., verifySchema: bool = ...) -> DataFrame
> main.py:7: note: def createDataFrame(self, data: DataFrameLike, 
> samplingRatio: Optional[float] = ...) -> DataFrame
> main.py:7: note: <3 more non-matching overloads not shown>
> Found 1 error in 1 file (checked 1 source file)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-36894) RDD.toDF should be synchronized with dispatched variants of SparkSession.createDataFrame

2021-11-03 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-36894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17438085#comment-17438085
 ] 

Apache Spark commented on SPARK-36894:
--

User 'sarutak' has created a pull request for this issue:
https://github.com/apache/spark/pull/34477

> RDD.toDF should be synchronized with dispatched variants of 
> SparkSession.createDataFrame
> 
>
> Key: SPARK-36894
> URL: https://issues.apache.org/jira/browse/SPARK-36894
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 3.1.2, 3.2.0, 3.3.0
>Reporter: Maciej Szymkiewicz
>Assignee: Maciej Szymkiewicz
>Priority: Minor
> Fix For: 3.3.0
>
>
> There are some variants that are supported:
>  * Providing a schema as a {{str}} object for {{RDD[RowLike]}} objects
>  * Providing a schema as a {{Tuple[str, ...]}} names
>  * Calling {{toDF}} on {{RDD}} of atomic values, when schema of {{str}} or 
> {{AtomicType}} is provided.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-37077) Annotations for pyspark.sql.context.SQLContext.createDataFrame are broken

2021-11-03 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37077?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17438086#comment-17438086
 ] 

Apache Spark commented on SPARK-37077:
--

User 'sarutak' has created a pull request for this issue:
https://github.com/apache/spark/pull/34477

> Annotations for pyspark.sql.context.SQLContext.createDataFrame are broken
> -
>
> Key: SPARK-37077
> URL: https://issues.apache.org/jira/browse/SPARK-37077
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Maciej Szymkiewicz
>Assignee: Maciej Szymkiewicz
>Priority: Major
> Fix For: 3.3.0
>
>
> During migration from stubs to inline annotations, variants taking {{RDD}} 
> where incorrectly remove. As a result
>  
> {code:python}
> from pyspark.sql import SQLContext, SparkSession
> from pyspark import SparkContext
> sc = SparkContext.getOrCreate()
> sqlContext= SQLContext(sc)
> sqlContext.createDataFrame(sc.parallelize([(1, 2)]))
> {code}
> although valid, no longer type checks
> {code}
> main.py:7: error: No overload variant of "createDataFrame" of "SQLContext" 
> matches argument type "RDD[Tuple[int, int]]"
> main.py:7: note: Possible overload variants:
> main.py:7: note: def [RowLike in (List[Any], Tuple[Any, ...], Row)] 
> createDataFrame(self, data: Iterable[RowLike], samplingRatio: Optional[float] 
> = ...) -> DataFrame
> main.py:7: note: def [RowLike in (List[Any], Tuple[Any, ...], Row)] 
> createDataFrame(self, data: Iterable[RowLike], schema: Union[List[str], 
> Tuple[str, ...]] = ..., verifySchema: bool = ...) -> DataFrame
> main.py:7: note: def createDataFrame(self, data: DataFrameLike, 
> samplingRatio: Optional[float] = ...) -> DataFrame
> main.py:7: note: <3 more non-matching overloads not shown>
> Found 1 error in 1 file (checked 1 source file)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-36894) RDD.toDF should be synchronized with dispatched variants of SparkSession.createDataFrame

2021-11-03 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-36894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17438084#comment-17438084
 ] 

Apache Spark commented on SPARK-36894:
--

User 'sarutak' has created a pull request for this issue:
https://github.com/apache/spark/pull/34477

> RDD.toDF should be synchronized with dispatched variants of 
> SparkSession.createDataFrame
> 
>
> Key: SPARK-36894
> URL: https://issues.apache.org/jira/browse/SPARK-36894
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 3.1.2, 3.2.0, 3.3.0
>Reporter: Maciej Szymkiewicz
>Assignee: Maciej Szymkiewicz
>Priority: Minor
> Fix For: 3.3.0
>
>
> There are some variants that are supported:
>  * Providing a schema as a {{str}} object for {{RDD[RowLike]}} objects
>  * Providing a schema as a {{Tuple[str, ...]}} names
>  * Calling {{toDF}} on {{RDD}} of atomic values, when schema of {{str}} or 
> {{AtomicType}} is provided.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-37030) Maven build failed in windows!

2021-11-03 Thread Shockang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shockang resolved SPARK-37030.
--
Resolution: Done

> Maven build failed in windows!
> --
>
> Key: SPARK-37030
> URL: https://issues.apache.org/jira/browse/SPARK-37030
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 3.2.0
> Environment: OS: Windows 10 Professional
> OS Version: 21H1
> Maven Version: 3.6.3
>  
>Reporter: Shockang
>Priority: Minor
> Fix For: 3.2.0
>
> Attachments: image-2021-10-17-22-18-16-616.png
>
>
> I pulled the latest Spark master code on my local windows 10 computer and 
> executed the following command:
> {code:java}
> mvn -DskipTests clean install{code}
> Build failed!
> !image-2021-10-17-22-18-16-616.png!
> {code:java}
> Failed to execute goal org.apache.maven.plugins:maven-antrun-plugin:1.8:run 
> (default) on project spark-core_2.12: An Ant BuildException has occured: 
> Execute failed: java.io.IOException: Cannot run program "bash" (in directory 
> "C:\bigdata\spark\core"): CreateProcess error=2{code}
> It seems that the plugin: maven-antrun-plugin cannot run because of windows 
> no bash. 
> The following code comes from pom.xml in spark-core module.
> {code:java}
> 
>   org.apache.maven.plugins
>   maven-antrun-plugin
>   
>     
>       generate-resources
>       
>         
>         
>           
>             
>             
>             
>           
>         
>       
>       
>         run
>       
>     
>   
> 
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-37030) Maven build failed in windows!

2021-11-03 Thread Shockang (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17438017#comment-17438017
 ] 

Shockang commented on SPARK-37030:
--

[~hyukjin.kwon] Thank you for your suggestion. This problem has been solved.

> Maven build failed in windows!
> --
>
> Key: SPARK-37030
> URL: https://issues.apache.org/jira/browse/SPARK-37030
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 3.2.0
> Environment: OS: Windows 10 Professional
> OS Version: 21H1
> Maven Version: 3.6.3
>  
>Reporter: Shockang
>Priority: Minor
> Fix For: 3.2.0
>
> Attachments: image-2021-10-17-22-18-16-616.png
>
>
> I pulled the latest Spark master code on my local windows 10 computer and 
> executed the following command:
> {code:java}
> mvn -DskipTests clean install{code}
> Build failed!
> !image-2021-10-17-22-18-16-616.png!
> {code:java}
> Failed to execute goal org.apache.maven.plugins:maven-antrun-plugin:1.8:run 
> (default) on project spark-core_2.12: An Ant BuildException has occured: 
> Execute failed: java.io.IOException: Cannot run program "bash" (in directory 
> "C:\bigdata\spark\core"): CreateProcess error=2{code}
> It seems that the plugin: maven-antrun-plugin cannot run because of windows 
> no bash. 
> The following code comes from pom.xml in spark-core module.
> {code:java}
> 
>   org.apache.maven.plugins
>   maven-antrun-plugin
>   
>     
>       generate-resources
>       
>         
>         
>           
>             
>             
>             
>           
>         
>       
>       
>         run
>       
>     
>   
> 
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-37077) Annotations for pyspark.sql.context.SQLContext.createDataFrame are broken

2021-11-03 Thread Maciej Szymkiewicz (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37077?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maciej Szymkiewicz resolved SPARK-37077.

Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 34146
[https://github.com/apache/spark/pull/34146]

> Annotations for pyspark.sql.context.SQLContext.createDataFrame are broken
> -
>
> Key: SPARK-37077
> URL: https://issues.apache.org/jira/browse/SPARK-37077
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Maciej Szymkiewicz
>Assignee: Maciej Szymkiewicz
>Priority: Major
> Fix For: 3.3.0
>
>
> During migration from stubs to inline annotations, variants taking {{RDD}} 
> where incorrectly remove. As a result
>  
> {code:python}
> from pyspark.sql import SQLContext, SparkSession
> from pyspark import SparkContext
> sc = SparkContext.getOrCreate()
> sqlContext= SQLContext(sc)
> sqlContext.createDataFrame(sc.parallelize([(1, 2)]))
> {code}
> although valid, no longer type checks
> {code}
> main.py:7: error: No overload variant of "createDataFrame" of "SQLContext" 
> matches argument type "RDD[Tuple[int, int]]"
> main.py:7: note: Possible overload variants:
> main.py:7: note: def [RowLike in (List[Any], Tuple[Any, ...], Row)] 
> createDataFrame(self, data: Iterable[RowLike], samplingRatio: Optional[float] 
> = ...) -> DataFrame
> main.py:7: note: def [RowLike in (List[Any], Tuple[Any, ...], Row)] 
> createDataFrame(self, data: Iterable[RowLike], schema: Union[List[str], 
> Tuple[str, ...]] = ..., verifySchema: bool = ...) -> DataFrame
> main.py:7: note: def createDataFrame(self, data: DataFrameLike, 
> samplingRatio: Optional[float] = ...) -> DataFrame
> main.py:7: note: <3 more non-matching overloads not shown>
> Found 1 error in 1 file (checked 1 source file)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-36894) RDD.toDF should be synchronized with dispatched variants of SparkSession.createDataFrame

2021-11-03 Thread Maciej Szymkiewicz (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maciej Szymkiewicz resolved SPARK-36894.

Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 34146
[https://github.com/apache/spark/pull/34146]

> RDD.toDF should be synchronized with dispatched variants of 
> SparkSession.createDataFrame
> 
>
> Key: SPARK-36894
> URL: https://issues.apache.org/jira/browse/SPARK-36894
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 3.1.2, 3.2.0, 3.3.0
>Reporter: Maciej Szymkiewicz
>Assignee: Maciej Szymkiewicz
>Priority: Minor
> Fix For: 3.3.0
>
>
> There are some variants that are supported:
>  * Providing a schema as a {{str}} object for {{RDD[RowLike]}} objects
>  * Providing a schema as a {{Tuple[str, ...]}} names
>  * Calling {{toDF}} on {{RDD}} of atomic values, when schema of {{str}} or 
> {{AtomicType}} is provided.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-37077) Annotations for pyspark.sql.context.SQLContext.createDataFrame are broken

2021-11-03 Thread Maciej Szymkiewicz (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37077?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maciej Szymkiewicz reassigned SPARK-37077:
--

Assignee: Maciej Szymkiewicz

> Annotations for pyspark.sql.context.SQLContext.createDataFrame are broken
> -
>
> Key: SPARK-37077
> URL: https://issues.apache.org/jira/browse/SPARK-37077
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Maciej Szymkiewicz
>Assignee: Maciej Szymkiewicz
>Priority: Major
>
> During migration from stubs to inline annotations, variants taking {{RDD}} 
> where incorrectly remove. As a result
>  
> {code:python}
> from pyspark.sql import SQLContext, SparkSession
> from pyspark import SparkContext
> sc = SparkContext.getOrCreate()
> sqlContext= SQLContext(sc)
> sqlContext.createDataFrame(sc.parallelize([(1, 2)]))
> {code}
> although valid, no longer type checks
> {code}
> main.py:7: error: No overload variant of "createDataFrame" of "SQLContext" 
> matches argument type "RDD[Tuple[int, int]]"
> main.py:7: note: Possible overload variants:
> main.py:7: note: def [RowLike in (List[Any], Tuple[Any, ...], Row)] 
> createDataFrame(self, data: Iterable[RowLike], samplingRatio: Optional[float] 
> = ...) -> DataFrame
> main.py:7: note: def [RowLike in (List[Any], Tuple[Any, ...], Row)] 
> createDataFrame(self, data: Iterable[RowLike], schema: Union[List[str], 
> Tuple[str, ...]] = ..., verifySchema: bool = ...) -> DataFrame
> main.py:7: note: def createDataFrame(self, data: DataFrameLike, 
> samplingRatio: Optional[float] = ...) -> DataFrame
> main.py:7: note: <3 more non-matching overloads not shown>
> Found 1 error in 1 file (checked 1 source file)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-36894) RDD.toDF should be synchronized with dispatched variants of SparkSession.createDataFrame

2021-11-03 Thread Maciej Szymkiewicz (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maciej Szymkiewicz reassigned SPARK-36894:
--

Assignee: Maciej Szymkiewicz

> RDD.toDF should be synchronized with dispatched variants of 
> SparkSession.createDataFrame
> 
>
> Key: SPARK-36894
> URL: https://issues.apache.org/jira/browse/SPARK-36894
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 3.1.2, 3.2.0, 3.3.0
>Reporter: Maciej Szymkiewicz
>Assignee: Maciej Szymkiewicz
>Priority: Minor
>
> There are some variants that are supported:
>  * Providing a schema as a {{str}} object for {{RDD[RowLike]}} objects
>  * Providing a schema as a {{Tuple[str, ...]}} names
>  * Calling {{toDF}} on {{RDD}} of atomic values, when schema of {{str}} or 
> {{AtomicType}} is provided.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-37195) Unify v1 and v2 SHOW TBLPROPERTIES tests

2021-11-03 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17437871#comment-17437871
 ] 

Apache Spark commented on SPARK-37195:
--

User 'Peng-Lei' has created a pull request for this issue:
https://github.com/apache/spark/pull/34476

> Unify v1 and v2 SHOW TBLPROPERTIES  tests
> -
>
> Key: SPARK-37195
> URL: https://issues.apache.org/jira/browse/SPARK-37195
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: PengLei
>Priority: Major
> Fix For: 3.3.0
>
>
> Unify v1 and v2 SHOW TBLPROPERTIES tests



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-37195) Unify v1 and v2 SHOW TBLPROPERTIES tests

2021-11-03 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37195?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37195:


Assignee: Apache Spark

> Unify v1 and v2 SHOW TBLPROPERTIES  tests
> -
>
> Key: SPARK-37195
> URL: https://issues.apache.org/jira/browse/SPARK-37195
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: PengLei
>Assignee: Apache Spark
>Priority: Major
> Fix For: 3.3.0
>
>
> Unify v1 and v2 SHOW TBLPROPERTIES tests



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-37195) Unify v1 and v2 SHOW TBLPROPERTIES tests

2021-11-03 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37195?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37195:


Assignee: (was: Apache Spark)

> Unify v1 and v2 SHOW TBLPROPERTIES  tests
> -
>
> Key: SPARK-37195
> URL: https://issues.apache.org/jira/browse/SPARK-37195
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: PengLei
>Priority: Major
> Fix For: 3.3.0
>
>
> Unify v1 and v2 SHOW TBLPROPERTIES tests



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-37195) Unify v1 and v2 SHOW TBLPROPERTIES tests

2021-11-03 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17437869#comment-17437869
 ] 

Apache Spark commented on SPARK-37195:
--

User 'Peng-Lei' has created a pull request for this issue:
https://github.com/apache/spark/pull/34476

> Unify v1 and v2 SHOW TBLPROPERTIES  tests
> -
>
> Key: SPARK-37195
> URL: https://issues.apache.org/jira/browse/SPARK-37195
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: PengLei
>Priority: Major
> Fix For: 3.3.0
>
>
> Unify v1 and v2 SHOW TBLPROPERTIES tests



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26365) spark-submit for k8s cluster doesn't propagate exit code

2021-11-03 Thread Oscar Bonilla (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-26365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17437830#comment-17437830
 ] 

Oscar Bonilla commented on SPARK-26365:
---

I've changed the priority to Major, to see if someone can pick it up and fix it

> spark-submit for k8s cluster doesn't propagate exit code
> 
>
> Key: SPARK-26365
> URL: https://issues.apache.org/jira/browse/SPARK-26365
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes, Spark Core, Spark Submit
>Affects Versions: 2.3.2, 2.4.0, 3.0.0, 3.1.0
>Reporter: Oscar Bonilla
>Priority: Major
> Attachments: spark-2.4.5-raise-exception-k8s-failure.patch, 
> spark-3.0.0-raise-exception-k8s-failure.patch
>
>
> When launching apps using spark-submit in a kubernetes cluster, if the Spark 
> applications fails (returns exit code = 1 for example), spark-submit will 
> still exit gracefully and return exit code = 0.
> This is problematic, since there's no way to know if there's been a problem 
> with the Spark application.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26365) spark-submit for k8s cluster doesn't propagate exit code

2021-11-03 Thread Oscar Bonilla (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-26365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Oscar Bonilla updated SPARK-26365:
--
Affects Version/s: 3.0.0
   3.1.0

> spark-submit for k8s cluster doesn't propagate exit code
> 
>
> Key: SPARK-26365
> URL: https://issues.apache.org/jira/browse/SPARK-26365
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes, Spark Core, Spark Submit
>Affects Versions: 2.3.2, 2.4.0, 3.0.0, 3.1.0
>Reporter: Oscar Bonilla
>Priority: Major
> Attachments: spark-2.4.5-raise-exception-k8s-failure.patch, 
> spark-3.0.0-raise-exception-k8s-failure.patch
>
>
> When launching apps using spark-submit in a kubernetes cluster, if the Spark 
> applications fails (returns exit code = 1 for example), spark-submit will 
> still exit gracefully and return exit code = 0.
> This is problematic, since there's no way to know if there's been a problem 
> with the Spark application.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26365) spark-submit for k8s cluster doesn't propagate exit code

2021-11-03 Thread Oscar Bonilla (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-26365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Oscar Bonilla updated SPARK-26365:
--
Priority: Major  (was: Minor)

> spark-submit for k8s cluster doesn't propagate exit code
> 
>
> Key: SPARK-26365
> URL: https://issues.apache.org/jira/browse/SPARK-26365
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes, Spark Core, Spark Submit
>Affects Versions: 2.3.2, 2.4.0
>Reporter: Oscar Bonilla
>Priority: Major
> Attachments: spark-2.4.5-raise-exception-k8s-failure.patch, 
> spark-3.0.0-raise-exception-k8s-failure.patch
>
>
> When launching apps using spark-submit in a kubernetes cluster, if the Spark 
> applications fails (returns exit code = 1 for example), spark-submit will 
> still exit gracefully and return exit code = 0.
> This is problematic, since there's no way to know if there's been a problem 
> with the Spark application.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26365) spark-submit for k8s cluster doesn't propagate exit code

2021-11-03 Thread Vivien Brissat (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-26365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17437825#comment-17437825
 ] 

Vivien Brissat commented on SPARK-26365:


Hi [~oscar.bonilla], this is not since i made tests in version 3.1, and found 
the Jira issue when i looked for a solution to my problem.

> spark-submit for k8s cluster doesn't propagate exit code
> 
>
> Key: SPARK-26365
> URL: https://issues.apache.org/jira/browse/SPARK-26365
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes, Spark Core, Spark Submit
>Affects Versions: 2.3.2, 2.4.0
>Reporter: Oscar Bonilla
>Priority: Minor
> Attachments: spark-2.4.5-raise-exception-k8s-failure.patch, 
> spark-3.0.0-raise-exception-k8s-failure.patch
>
>
> When launching apps using spark-submit in a kubernetes cluster, if the Spark 
> applications fails (returns exit code = 1 for example), spark-submit will 
> still exit gracefully and return exit code = 0.
> This is problematic, since there's no way to know if there's been a problem 
> with the Spark application.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-37203) Fix NotSerializableException when observe with percentile_approx

2021-11-03 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17437805#comment-17437805
 ] 

Apache Spark commented on SPARK-37203:
--

User 'beliefer' has created a pull request for this issue:
https://github.com/apache/spark/pull/34474

> Fix NotSerializableException when observe with percentile_approx
> 
>
> Key: SPARK-37203
> URL: https://issues.apache.org/jira/browse/SPARK-37203
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: jiaan.geng
>Priority: Major
>
> {code:java}
> val namedObservation = Observation("named")
> val df = spark.range(100)
> val observed_df = df.observe(
>namedObservation, percentile_approx($"id", lit(0.5), 
> lit(100)).as("percentile_approx_val"))
> observed_df.collect()
> namedObservation.get
> {code}
> throws exception as follows:
> {code:java}
> 15:16:27.994 ERROR org.apache.spark.util.Utils: Exception encountered
> java.io.NotSerializableException: 
> org.apache.spark.sql.catalyst.expressions.aggregate.ApproximatePercentile$PercentileDigest
>   at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1184)
>   at 
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
>   at 
> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
>   at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
>   at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
>   at java.io.ObjectOutputStream.writeArray(ObjectOutputStream.java:1378)
>   at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1174)
>   at 
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
>   at 
> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
>   at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
>   at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
>   at 
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
>   at 
> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
>   at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
>   at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
>   at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348)
>   at 
> org.apache.spark.scheduler.DirectTaskResult.$anonfun$writeExternal$2(TaskResult.scala:55)
>   at 
> org.apache.spark.scheduler.DirectTaskResult.$anonfun$writeExternal$2$adapted(TaskResult.scala:55)
>   at scala.collection.Iterator.foreach(Iterator.scala:943)
>   at scala.collection.Iterator.foreach$(Iterator.scala:943)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
>   at scala.collection.IterableLike.foreach(IterableLike.scala:74)
>   at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
>   at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
>   at 
> org.apache.spark.scheduler.DirectTaskResult.$anonfun$writeExternal$1(TaskResult.scala:55)
>   at 
> scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
>   at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1434)
>   at 
> org.apache.spark.scheduler.DirectTaskResult.writeExternal(TaskResult.scala:51)
>   at 
> java.io.ObjectOutputStream.writeExternalData(ObjectOutputStream.java:1459)
>   at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1430)
>   at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
>   at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348)
>   at 
> org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:44)
>   at 
> org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:101)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:616)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-37203) Fix NotSerializableException when observe with percentile_approx

2021-11-03 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37203?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37203:


Assignee: (was: Apache Spark)

> Fix NotSerializableException when observe with percentile_approx
> 
>
> Key: SPARK-37203
> URL: https://issues.apache.org/jira/browse/SPARK-37203
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: jiaan.geng
>Priority: Major
>
> {code:java}
> val namedObservation = Observation("named")
> val df = spark.range(100)
> val observed_df = df.observe(
>namedObservation, percentile_approx($"id", lit(0.5), 
> lit(100)).as("percentile_approx_val"))
> observed_df.collect()
> namedObservation.get
> {code}
> throws exception as follows:
> {code:java}
> 15:16:27.994 ERROR org.apache.spark.util.Utils: Exception encountered
> java.io.NotSerializableException: 
> org.apache.spark.sql.catalyst.expressions.aggregate.ApproximatePercentile$PercentileDigest
>   at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1184)
>   at 
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
>   at 
> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
>   at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
>   at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
>   at java.io.ObjectOutputStream.writeArray(ObjectOutputStream.java:1378)
>   at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1174)
>   at 
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
>   at 
> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
>   at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
>   at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
>   at 
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
>   at 
> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
>   at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
>   at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
>   at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348)
>   at 
> org.apache.spark.scheduler.DirectTaskResult.$anonfun$writeExternal$2(TaskResult.scala:55)
>   at 
> org.apache.spark.scheduler.DirectTaskResult.$anonfun$writeExternal$2$adapted(TaskResult.scala:55)
>   at scala.collection.Iterator.foreach(Iterator.scala:943)
>   at scala.collection.Iterator.foreach$(Iterator.scala:943)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
>   at scala.collection.IterableLike.foreach(IterableLike.scala:74)
>   at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
>   at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
>   at 
> org.apache.spark.scheduler.DirectTaskResult.$anonfun$writeExternal$1(TaskResult.scala:55)
>   at 
> scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
>   at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1434)
>   at 
> org.apache.spark.scheduler.DirectTaskResult.writeExternal(TaskResult.scala:51)
>   at 
> java.io.ObjectOutputStream.writeExternalData(ObjectOutputStream.java:1459)
>   at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1430)
>   at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
>   at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348)
>   at 
> org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:44)
>   at 
> org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:101)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:616)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-37203) Fix NotSerializableException when observe with percentile_approx

2021-11-03 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17437803#comment-17437803
 ] 

Apache Spark commented on SPARK-37203:
--

User 'beliefer' has created a pull request for this issue:
https://github.com/apache/spark/pull/34474

> Fix NotSerializableException when observe with percentile_approx
> 
>
> Key: SPARK-37203
> URL: https://issues.apache.org/jira/browse/SPARK-37203
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: jiaan.geng
>Priority: Major
>
> {code:java}
> val namedObservation = Observation("named")
> val df = spark.range(100)
> val observed_df = df.observe(
>namedObservation, percentile_approx($"id", lit(0.5), 
> lit(100)).as("percentile_approx_val"))
> observed_df.collect()
> namedObservation.get
> {code}
> throws exception as follows:
> {code:java}
> 15:16:27.994 ERROR org.apache.spark.util.Utils: Exception encountered
> java.io.NotSerializableException: 
> org.apache.spark.sql.catalyst.expressions.aggregate.ApproximatePercentile$PercentileDigest
>   at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1184)
>   at 
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
>   at 
> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
>   at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
>   at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
>   at java.io.ObjectOutputStream.writeArray(ObjectOutputStream.java:1378)
>   at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1174)
>   at 
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
>   at 
> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
>   at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
>   at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
>   at 
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
>   at 
> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
>   at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
>   at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
>   at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348)
>   at 
> org.apache.spark.scheduler.DirectTaskResult.$anonfun$writeExternal$2(TaskResult.scala:55)
>   at 
> org.apache.spark.scheduler.DirectTaskResult.$anonfun$writeExternal$2$adapted(TaskResult.scala:55)
>   at scala.collection.Iterator.foreach(Iterator.scala:943)
>   at scala.collection.Iterator.foreach$(Iterator.scala:943)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
>   at scala.collection.IterableLike.foreach(IterableLike.scala:74)
>   at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
>   at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
>   at 
> org.apache.spark.scheduler.DirectTaskResult.$anonfun$writeExternal$1(TaskResult.scala:55)
>   at 
> scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
>   at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1434)
>   at 
> org.apache.spark.scheduler.DirectTaskResult.writeExternal(TaskResult.scala:51)
>   at 
> java.io.ObjectOutputStream.writeExternalData(ObjectOutputStream.java:1459)
>   at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1430)
>   at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
>   at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348)
>   at 
> org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:44)
>   at 
> org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:101)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:616)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-37203) Fix NotSerializableException when observe with percentile_approx

2021-11-03 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37203?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37203:


Assignee: Apache Spark

> Fix NotSerializableException when observe with percentile_approx
> 
>
> Key: SPARK-37203
> URL: https://issues.apache.org/jira/browse/SPARK-37203
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: jiaan.geng
>Assignee: Apache Spark
>Priority: Major
>
> {code:java}
> val namedObservation = Observation("named")
> val df = spark.range(100)
> val observed_df = df.observe(
>namedObservation, percentile_approx($"id", lit(0.5), 
> lit(100)).as("percentile_approx_val"))
> observed_df.collect()
> namedObservation.get
> {code}
> throws exception as follows:
> {code:java}
> 15:16:27.994 ERROR org.apache.spark.util.Utils: Exception encountered
> java.io.NotSerializableException: 
> org.apache.spark.sql.catalyst.expressions.aggregate.ApproximatePercentile$PercentileDigest
>   at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1184)
>   at 
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
>   at 
> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
>   at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
>   at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
>   at java.io.ObjectOutputStream.writeArray(ObjectOutputStream.java:1378)
>   at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1174)
>   at 
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
>   at 
> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
>   at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
>   at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
>   at 
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
>   at 
> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
>   at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
>   at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
>   at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348)
>   at 
> org.apache.spark.scheduler.DirectTaskResult.$anonfun$writeExternal$2(TaskResult.scala:55)
>   at 
> org.apache.spark.scheduler.DirectTaskResult.$anonfun$writeExternal$2$adapted(TaskResult.scala:55)
>   at scala.collection.Iterator.foreach(Iterator.scala:943)
>   at scala.collection.Iterator.foreach$(Iterator.scala:943)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
>   at scala.collection.IterableLike.foreach(IterableLike.scala:74)
>   at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
>   at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
>   at 
> org.apache.spark.scheduler.DirectTaskResult.$anonfun$writeExternal$1(TaskResult.scala:55)
>   at 
> scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
>   at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1434)
>   at 
> org.apache.spark.scheduler.DirectTaskResult.writeExternal(TaskResult.scala:51)
>   at 
> java.io.ObjectOutputStream.writeExternalData(ObjectOutputStream.java:1459)
>   at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1430)
>   at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
>   at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348)
>   at 
> org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:44)
>   at 
> org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:101)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:616)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-37203) Fix NotSerializableException when observe with percentile_approx

2021-11-03 Thread jiaan.geng (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17437780#comment-17437780
 ] 

jiaan.geng commented on SPARK-37203:


I'm working on.

> Fix NotSerializableException when observe with percentile_approx
> 
>
> Key: SPARK-37203
> URL: https://issues.apache.org/jira/browse/SPARK-37203
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: jiaan.geng
>Priority: Major
>
> {code:java}
> val namedObservation = Observation("named")
> val df = spark.range(100)
> val observed_df = df.observe(
>namedObservation, percentile_approx($"id", lit(0.5), 
> lit(100)).as("percentile_approx_val"))
> observed_df.collect()
> namedObservation.get
> {code}
> throws exception as follows:
> {code:java}
> 15:16:27.994 ERROR org.apache.spark.util.Utils: Exception encountered
> java.io.NotSerializableException: 
> org.apache.spark.sql.catalyst.expressions.aggregate.ApproximatePercentile$PercentileDigest
>   at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1184)
>   at 
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
>   at 
> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
>   at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
>   at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
>   at java.io.ObjectOutputStream.writeArray(ObjectOutputStream.java:1378)
>   at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1174)
>   at 
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
>   at 
> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
>   at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
>   at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
>   at 
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
>   at 
> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
>   at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
>   at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
>   at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348)
>   at 
> org.apache.spark.scheduler.DirectTaskResult.$anonfun$writeExternal$2(TaskResult.scala:55)
>   at 
> org.apache.spark.scheduler.DirectTaskResult.$anonfun$writeExternal$2$adapted(TaskResult.scala:55)
>   at scala.collection.Iterator.foreach(Iterator.scala:943)
>   at scala.collection.Iterator.foreach$(Iterator.scala:943)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
>   at scala.collection.IterableLike.foreach(IterableLike.scala:74)
>   at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
>   at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
>   at 
> org.apache.spark.scheduler.DirectTaskResult.$anonfun$writeExternal$1(TaskResult.scala:55)
>   at 
> scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
>   at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1434)
>   at 
> org.apache.spark.scheduler.DirectTaskResult.writeExternal(TaskResult.scala:51)
>   at 
> java.io.ObjectOutputStream.writeExternalData(ObjectOutputStream.java:1459)
>   at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1430)
>   at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
>   at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348)
>   at 
> org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:44)
>   at 
> org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:101)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:616)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-37203) Fix NotSerializableException when observe with percentile_approx

2021-11-03 Thread jiaan.geng (Jira)

jiaan.geng created SPARK-37203:
--

 Summary: Fix NotSerializableException when observe with 
percentile_approx
 Key: SPARK-37203
 URL: https://issues.apache.org/jira/browse/SPARK-37203
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.3.0
Reporter: jiaan.geng



{code:java}
val namedObservation = Observation("named")

val df = spark.range(100)
val observed_df = df.observe(
   namedObservation, percentile_approx($"id", lit(0.5), 
lit(100)).as("percentile_approx_val"))

observed_df.collect()
namedObservation.get
{code}

throws exception as follows:

{code:java}
15:16:27.994 ERROR org.apache.spark.util.Utils: Exception encountered
java.io.NotSerializableException: 
org.apache.spark.sql.catalyst.expressions.aggregate.ApproximatePercentile$PercentileDigest
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1184)
at 
java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
at 
java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
at 
java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
at java.io.ObjectOutputStream.writeArray(ObjectOutputStream.java:1378)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1174)
at 
java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
at 
java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
at 
java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
at 
java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
at 
java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
at 
java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348)
at 
org.apache.spark.scheduler.DirectTaskResult.$anonfun$writeExternal$2(TaskResult.scala:55)
at 
org.apache.spark.scheduler.DirectTaskResult.$anonfun$writeExternal$2$adapted(TaskResult.scala:55)
at scala.collection.Iterator.foreach(Iterator.scala:943)
at scala.collection.Iterator.foreach$(Iterator.scala:943)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
at scala.collection.IterableLike.foreach(IterableLike.scala:74)
at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
at 
org.apache.spark.scheduler.DirectTaskResult.$anonfun$writeExternal$1(TaskResult.scala:55)
at 
scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1434)
at 
org.apache.spark.scheduler.DirectTaskResult.writeExternal(TaskResult.scala:51)
at 
java.io.ObjectOutputStream.writeExternalData(ObjectOutputStream.java:1459)
at 
java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1430)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348)
at 
org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:44)
at 
org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:101)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:616)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
{code}





--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-37200) Drop index support

2021-11-03 Thread Huaxin Gao (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Huaxin Gao resolved SPARK-37200.

Fix Version/s: 3.3.0
 Assignee: Huaxin Gao
   Resolution: Fixed

> Drop index support
> --
>
> Key: SPARK-37200
> URL: https://issues.apache.org/jira/browse/SPARK-37200
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Huaxin Gao
>Assignee: Huaxin Gao
>Priority: Major
> Fix For: 3.3.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-37179) ANSI mode: Add a config to allow casting between Datetime and Numeric

2021-11-03 Thread Gengliang Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang updated SPARK-37179:
---
Description: 
Add a config `spark.sql.ansi.allowCastBetweenDatetimeAndNumeric`to allow 
casting between Datetime and Numeric. The default value of the configuration is 
`false`.
Also, casting double/float type to timestamp should raise exceptions if there 
is overflow or the input is Nan/infinite.

This is for better adoption of ANSI SQL mode:
- As we did some data science, we found that many Spark SQL users are actually 
using `Cast(Timestamp as Numeric)` and `Cast(Numeric as Timestamp)`. There are 
also some usages of `Cast(Date as Numeric)`.
- The Spark SQL connector for Tableau is using this feature for DateTime math. 
e.g.
 `CAST(FROM_UNIXTIME(CAST(CAST(%1 AS BIGINT) + (%2 * 86400) AS BIGINT)) AS 
TIMESTAMP)`

So, having a new configuration can provide users with an alternative choice on 
turning on ANSI mode.

  was:
We should allow the casting between Timestamp and Numeric types:
* As we did some data science, we found that many Spark SQL users are actually 
using `Cast(Timestamp as Numeric)` and `Cast(Numeric as Timestamp)`. 
* The Spark SQL connector for Tableau is using this feature for DateTime math. 
e.g.
{code:java}
CAST(FROM_UNIXTIME(CAST(CAST(%1 AS BIGINT) + (%2 * 86400) AS BIGINT)) AS 
TIMESTAMP)
{code}
* In the current syntax, we specially allow Numeric <=> Boolean and String <=> 
Binary since they are straight forward and frequently used.  I suggest we allow 
Timestamp <=> Numeric as well for better ANSI mode adoption.


> ANSI mode: Add a config to allow casting between Datetime and Numeric
> -
>
> Key: SPARK-37179
> URL: https://issues.apache.org/jira/browse/SPARK-37179
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
> Fix For: 3.3.0
>
>
> Add a config `spark.sql.ansi.allowCastBetweenDatetimeAndNumeric`to allow 
> casting between Datetime and Numeric. The default value of the configuration 
> is `false`.
> Also, casting double/float type to timestamp should raise exceptions if there 
> is overflow or the input is Nan/infinite.
> This is for better adoption of ANSI SQL mode:
> - As we did some data science, we found that many Spark SQL users are 
> actually using `Cast(Timestamp as Numeric)` and `Cast(Numeric as Timestamp)`. 
> There are also some usages of `Cast(Date as Numeric)`.
> - The Spark SQL connector for Tableau is using this feature for DateTime 
> math. e.g.
>  `CAST(FROM_UNIXTIME(CAST(CAST(%1 AS BIGINT) + (%2 * 86400) AS BIGINT)) AS 
> TIMESTAMP)`
> So, having a new configuration can provide users with an alternative choice 
> on turning on ANSI mode.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-37179) ANSI mode: Add a config to allow casting between Datetime and Numeric

2021-11-03 Thread Gengliang Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang updated SPARK-37179:
---
Summary: ANSI mode: Add a config to allow casting between Datetime and 
Numeric  (was: ANSI mode: Allow casting between Timestamp and Numeric)

> ANSI mode: Add a config to allow casting between Datetime and Numeric
> -
>
> Key: SPARK-37179
> URL: https://issues.apache.org/jira/browse/SPARK-37179
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
> Fix For: 3.3.0
>
>
> We should allow the casting between Timestamp and Numeric types:
> * As we did some data science, we found that many Spark SQL users are 
> actually using `Cast(Timestamp as Numeric)` and `Cast(Numeric as Timestamp)`. 
> * The Spark SQL connector for Tableau is using this feature for DateTime 
> math. e.g.
> {code:java}
> CAST(FROM_UNIXTIME(CAST(CAST(%1 AS BIGINT) + (%2 * 86400) AS BIGINT)) AS 
> TIMESTAMP)
> {code}
> * In the current syntax, we specially allow Numeric <=> Boolean and String 
> <=> Binary since they are straight forward and frequently used.  I suggest we 
> allow Timestamp <=> Numeric as well for better ANSI mode adoption.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-37179) ANSI mode: Allow casting between Timestamp and Numeric

2021-11-03 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-37179.
-
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 34459
[https://github.com/apache/spark/pull/34459]

> ANSI mode: Allow casting between Timestamp and Numeric
> --
>
> Key: SPARK-37179
> URL: https://issues.apache.org/jira/browse/SPARK-37179
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
> Fix For: 3.3.0
>
>
> We should allow the casting between Timestamp and Numeric types:
> * As we did some data science, we found that many Spark SQL users are 
> actually using `Cast(Timestamp as Numeric)` and `Cast(Numeric as Timestamp)`. 
> * The Spark SQL connector for Tableau is using this feature for DateTime 
> math. e.g.
> {code:java}
> CAST(FROM_UNIXTIME(CAST(CAST(%1 AS BIGINT) + (%2 * 86400) AS BIGINT)) AS 
> TIMESTAMP)
> {code}
> * In the current syntax, we specially allow Numeric <=> Boolean and String 
> <=> Binary since they are straight forward and frequently used.  I suggest we 
> allow Timestamp <=> Numeric as well for better ANSI mode adoption.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

64 matches

Mail list logo