[jira] [Updated] (SPARK-37023) Avoid fetching merge status when shuffleMergeEnabled is false for a shuffleDependency during retry
[ https://issues.apache.org/jira/browse/SPARK-37023?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-37023: -- Affects Version/s: (was: 3.2.0) 3.3.0 > Avoid fetching merge status when shuffleMergeEnabled is false for a > shuffleDependency during retry > -- > > Key: SPARK-37023 > URL: https://issues.apache.org/jira/browse/SPARK-37023 > Project: Spark > Issue Type: Sub-task > Components: Shuffle >Affects Versions: 3.3.0 >Reporter: Ye Zhou >Priority: Major > > The assertion below in MapOutoutputTracker.getMapSizesByExecutorId is not > guaranteed > {code:java} > assert(mapSizesByExecutorId.enableBatchFetch == true){code} > The reason is during some stage retry cases, the > shuffleDependency.shuffleMergeEnabled is set to false, but there will be > mergeStatus since the Driver has collected the merged status for its shuffle > dependency. If this is the case, the current implementation would set the > enableBatchFetch to false, since there are mergeStatus. > Details can be found here: > [https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/MapOutputTracker.scala#L1492] > We should improve the implementation here. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37023) Avoid fetching merge status when shuffleMergeEnabled is false for a shuffleDependency during retry
[ https://issues.apache.org/jira/browse/SPARK-37023?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-37023: -- Affects Version/s: (was: 3.3.0) 3.2.0 > Avoid fetching merge status when shuffleMergeEnabled is false for a > shuffleDependency during retry > -- > > Key: SPARK-37023 > URL: https://issues.apache.org/jira/browse/SPARK-37023 > Project: Spark > Issue Type: Sub-task > Components: Shuffle >Affects Versions: 3.2.0 >Reporter: Ye Zhou >Priority: Major > > The assertion below in MapOutoutputTracker.getMapSizesByExecutorId is not > guaranteed > {code:java} > assert(mapSizesByExecutorId.enableBatchFetch == true){code} > The reason is during some stage retry cases, the > shuffleDependency.shuffleMergeEnabled is set to false, but there will be > mergeStatus since the Driver has collected the merged status for its shuffle > dependency. If this is the case, the current implementation would set the > enableBatchFetch to false, since there are mergeStatus. > Details can be found here: > [https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/MapOutputTracker.scala#L1492] > We should improve the implementation here. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37206) Upgrade Avro to 1.11.0
[ https://issues.apache.org/jira/browse/SPARK-37206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17438472#comment-17438472 ] Apache Spark commented on SPARK-37206: -- User 'sarutak' has created a pull request for this issue: https://github.com/apache/spark/pull/34482 > Upgrade Avro to 1.11.0 > -- > > Key: SPARK-37206 > URL: https://issues.apache.org/jira/browse/SPARK-37206 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.3.0 >Reporter: Kousuke Saruta >Assignee: Kousuke Saruta >Priority: Major > > Recently, Avro 1.1.0 was released which includes bunch of bug fixes. > https://issues.apache.org/jira/issues/?jql=project%3DAVRO%20AND%20fixVersion%3D1.11.0 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37206) Upgrade Avro to 1.11.0
[ https://issues.apache.org/jira/browse/SPARK-37206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37206: Assignee: Kousuke Saruta (was: Apache Spark) > Upgrade Avro to 1.11.0 > -- > > Key: SPARK-37206 > URL: https://issues.apache.org/jira/browse/SPARK-37206 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.3.0 >Reporter: Kousuke Saruta >Assignee: Kousuke Saruta >Priority: Major > > Recently, Avro 1.1.0 was released which includes bunch of bug fixes. > https://issues.apache.org/jira/issues/?jql=project%3DAVRO%20AND%20fixVersion%3D1.11.0 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37206) Upgrade Avro to 1.11.0
[ https://issues.apache.org/jira/browse/SPARK-37206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17438473#comment-17438473 ] Apache Spark commented on SPARK-37206: -- User 'sarutak' has created a pull request for this issue: https://github.com/apache/spark/pull/34482 > Upgrade Avro to 1.11.0 > -- > > Key: SPARK-37206 > URL: https://issues.apache.org/jira/browse/SPARK-37206 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.3.0 >Reporter: Kousuke Saruta >Assignee: Kousuke Saruta >Priority: Major > > Recently, Avro 1.1.0 was released which includes bunch of bug fixes. > https://issues.apache.org/jira/issues/?jql=project%3DAVRO%20AND%20fixVersion%3D1.11.0 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37206) Upgrade Avro to 1.11.0
[ https://issues.apache.org/jira/browse/SPARK-37206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37206: Assignee: Apache Spark (was: Kousuke Saruta) > Upgrade Avro to 1.11.0 > -- > > Key: SPARK-37206 > URL: https://issues.apache.org/jira/browse/SPARK-37206 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.3.0 >Reporter: Kousuke Saruta >Assignee: Apache Spark >Priority: Major > > Recently, Avro 1.1.0 was released which includes bunch of bug fixes. > https://issues.apache.org/jira/issues/?jql=project%3DAVRO%20AND%20fixVersion%3D1.11.0 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-37206) Upgrade Avro to 1.11.0
Kousuke Saruta created SPARK-37206: -- Summary: Upgrade Avro to 1.11.0 Key: SPARK-37206 URL: https://issues.apache.org/jira/browse/SPARK-37206 Project: Spark Issue Type: Improvement Components: Build Affects Versions: 3.3.0 Reporter: Kousuke Saruta Assignee: Kousuke Saruta Recently, Avro 1.1.0 was released which includes bunch of bug fixes. https://issues.apache.org/jira/issues/?jql=project%3DAVRO%20AND%20fixVersion%3D1.11.0 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-37108) Expose make_date expression in R
[ https://issues.apache.org/jira/browse/SPARK-37108?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kousuke Saruta resolved SPARK-37108. Fix Version/s: 3.3.0 Assignee: Leona Yoda Resolution: Fixed Issue resolved in https://github.com/apache/spark/pull/34480 > Expose make_date expression in R > > > Key: SPARK-37108 > URL: https://issues.apache.org/jira/browse/SPARK-37108 > Project: Spark > Issue Type: Improvement > Components: R >Affects Versions: 3.3.0 >Reporter: Leona Yoda >Assignee: Leona Yoda >Priority: Minor > Fix For: 3.3.0 > > > Expose make_date API on SparkR. > > (cf. https://issues.apache.org/jira/browse/SPARK-36554) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-37054) Porting "pandas API on Spark: Internals" to PySpark docs.
[ https://issues.apache.org/jira/browse/SPARK-37054?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Haejoon Lee resolved SPARK-37054. - Resolution: Won't Do Won't do for now. > Porting "pandas API on Spark: Internals" to PySpark docs. > - > > Key: SPARK-37054 > URL: https://issues.apache.org/jira/browse/SPARK-37054 > Project: Spark > Issue Type: Improvement > Components: docs, PySpark >Affects Versions: 3.2.0 >Reporter: Haejoon Lee >Priority: Major > > We have a > [documents|https://docs.google.com/document/d/1PR88p6yMHIeSxkDkSqCxLofkcnP0YtwQ2tETfyAWLQQ/edit?usp=sharing] > for pandas API on Spark internal features, apart from the PySpark official > documents. > > Since pandas API on Spark is officially released in Spark 3.2, it's good to > port this internal documents to the PySpark official documents. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36989) Migrate type hint data tests
[ https://issues.apache.org/jira/browse/SPARK-36989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17438432#comment-17438432 ] Apache Spark commented on SPARK-36989: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/34481 > Migrate type hint data tests > > > Key: SPARK-36989 > URL: https://issues.apache.org/jira/browse/SPARK-36989 > Project: Spark > Issue Type: Improvement > Components: PySpark, Tests >Affects Versions: 3.3.0 >Reporter: Maciej Szymkiewicz >Assignee: Maciej Szymkiewicz >Priority: Major > Fix For: 3.3.0 > > > Before the migration, {{pyspark-stubs}} contained a set of [data > tests|https://github.com/zero323/pyspark-stubs/tree/branch-3.0/test-data/unit], > modeled after, and using internal test utilities, of mypy. > These were omitted during the migration for a few reasons: > * Simplicity. > * Relative slowness. > * Dependence on non public API. > > Data tests are useful for a number of reasons: > > * Improve test coverage for type hints. > * Checking if type checkers infer expected types. > * Checking if type checkers reject incorrect code. > * Detecting unusual errors with code that otherwise type checks, > > Especially, the last two functions are not fulfilled by simple validation of > existing codebase. > > Data tests are not required for all annotations and can be restricted to code > that has high possibility of failure: > * Complex overloaded signatures. > * Complex generics. > * Generic {{self}} annotations > * Code containing {{type: ignore}} > The biggest risk, is that output matchers have to be updated when signature > changes and / or mypy output changes. > Example of problem detected with data tests can be found in SPARK-36894 PR > ([https://github.com/apache/spark/pull/34146]). > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37108) Expose make_date expression in R
[ https://issues.apache.org/jira/browse/SPARK-37108?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17438389#comment-17438389 ] Apache Spark commented on SPARK-37108: -- User 'yoda-mon' has created a pull request for this issue: https://github.com/apache/spark/pull/34480 > Expose make_date expression in R > > > Key: SPARK-37108 > URL: https://issues.apache.org/jira/browse/SPARK-37108 > Project: Spark > Issue Type: Improvement > Components: R >Affects Versions: 3.3.0 >Reporter: Leona Yoda >Priority: Minor > > Expose make_date API on SparkR. > > (cf. https://issues.apache.org/jira/browse/SPARK-36554) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37108) Expose make_date expression in R
[ https://issues.apache.org/jira/browse/SPARK-37108?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37108: Assignee: Apache Spark > Expose make_date expression in R > > > Key: SPARK-37108 > URL: https://issues.apache.org/jira/browse/SPARK-37108 > Project: Spark > Issue Type: Improvement > Components: R >Affects Versions: 3.3.0 >Reporter: Leona Yoda >Assignee: Apache Spark >Priority: Minor > > Expose make_date API on SparkR. > > (cf. https://issues.apache.org/jira/browse/SPARK-36554) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37108) Expose make_date expression in R
[ https://issues.apache.org/jira/browse/SPARK-37108?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17438386#comment-17438386 ] Apache Spark commented on SPARK-37108: -- User 'yoda-mon' has created a pull request for this issue: https://github.com/apache/spark/pull/34480 > Expose make_date expression in R > > > Key: SPARK-37108 > URL: https://issues.apache.org/jira/browse/SPARK-37108 > Project: Spark > Issue Type: Improvement > Components: R >Affects Versions: 3.3.0 >Reporter: Leona Yoda >Priority: Minor > > Expose make_date API on SparkR. > > (cf. https://issues.apache.org/jira/browse/SPARK-36554) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37108) Expose make_date expression in R
[ https://issues.apache.org/jira/browse/SPARK-37108?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37108: Assignee: (was: Apache Spark) > Expose make_date expression in R > > > Key: SPARK-37108 > URL: https://issues.apache.org/jira/browse/SPARK-37108 > Project: Spark > Issue Type: Improvement > Components: R >Affects Versions: 3.3.0 >Reporter: Leona Yoda >Priority: Minor > > Expose make_date API on SparkR. > > (cf. https://issues.apache.org/jira/browse/SPARK-36554) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37205) Support mapreduce.job.send-token-conf when starting containers in YARN
[ https://issues.apache.org/jira/browse/SPARK-37205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun updated SPARK-37205: - Description: {{mapreduce.job.send-token-conf}} is a useful feature in Hadoop (see [YARN-5910|https://issues.apache.org/jira/browse/YARN-5910] with which RM is not required to statically have config for all the secure HDFS clusters. Currently it only works for MRv2 but it'd be nice if Spark can also use this feature. I think we only need to pass the config to {{LaunchContainerContext}} in {{Client.createContainerLaunchContext}}. (was: {{mapreduce.job.send-token-conf}} is a useful feature in Hadoop (see [YARN-5910|https://issues.apache.org/jira/browse/YARN-5910] with which RM is not required to statically have config for all the secure HDFS clusters. Currently it only works for MRv2 but it'd be nice if Spark can also use this feature. I think we only need to pass the config to {{LaunchContainerContext}} before invoking {{NMClient.startContainer}}.) > Support mapreduce.job.send-token-conf when starting containers in YARN > -- > > Key: SPARK-37205 > URL: https://issues.apache.org/jira/browse/SPARK-37205 > Project: Spark > Issue Type: New Feature > Components: YARN >Affects Versions: 3.3.0 >Reporter: Chao Sun >Priority: Major > > {{mapreduce.job.send-token-conf}} is a useful feature in Hadoop (see > [YARN-5910|https://issues.apache.org/jira/browse/YARN-5910] with which RM is > not required to statically have config for all the secure HDFS clusters. > Currently it only works for MRv2 but it'd be nice if Spark can also use this > feature. I think we only need to pass the config to > {{LaunchContainerContext}} in {{Client.createContainerLaunchContext}}. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-37205) Support mapreduce.job.send-token-conf when starting containers in YARN
Chao Sun created SPARK-37205: Summary: Support mapreduce.job.send-token-conf when starting containers in YARN Key: SPARK-37205 URL: https://issues.apache.org/jira/browse/SPARK-37205 Project: Spark Issue Type: New Feature Components: YARN Affects Versions: 3.3.0 Reporter: Chao Sun {{mapreduce.job.send-token-conf}} is a useful feature in Hadoop (see [YARN-5910|https://issues.apache.org/jira/browse/YARN-5910] with which RM is not required to statically have config for all the secure HDFS clusters. Currently it only works for MRv2 but it'd be nice if Spark can also use this feature. I think we only need to pass the config to {{LaunchContainerContext}} before invoking {{NMClient.startContainer}}. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-36566) Add Spark appname as a label to the executor pods
[ https://issues.apache.org/jira/browse/SPARK-36566?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-36566. --- Fix Version/s: 3.3.0 Resolution: Fixed Issue resolved by pull request 34460 [https://github.com/apache/spark/pull/34460] > Add Spark appname as a label to the executor pods > - > > Key: SPARK-36566 > URL: https://issues.apache.org/jira/browse/SPARK-36566 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 3.3.0 >Reporter: Holden Karau >Assignee: Apache Spark >Priority: Trivial > Fix For: 3.3.0 > > > Adding the appName as a label to the executor pods could simplify debugging. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-36566) Add Spark appname as a label to the executor pods
[ https://issues.apache.org/jira/browse/SPARK-36566?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-36566: - Assignee: Yikun Jiang (was: Apache Spark) > Add Spark appname as a label to the executor pods > - > > Key: SPARK-36566 > URL: https://issues.apache.org/jira/browse/SPARK-36566 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 3.3.0 >Reporter: Holden Karau >Assignee: Yikun Jiang >Priority: Trivial > Fix For: 3.3.0 > > > Adding the appName as a label to the executor pods could simplify debugging. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-36566) Add Spark appname as a label to the executor pods
[ https://issues.apache.org/jira/browse/SPARK-36566?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-36566: - Assignee: Apache Spark > Add Spark appname as a label to the executor pods > - > > Key: SPARK-36566 > URL: https://issues.apache.org/jira/browse/SPARK-36566 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 3.3.0 >Reporter: Holden Karau >Assignee: Apache Spark >Priority: Trivial > > Adding the appName as a label to the executor pods could simplify debugging. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37060) Report driver status does not handle response from backup masters
[ https://issues.apache.org/jira/browse/SPARK-37060?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mohamadreza Rostami updated SPARK-37060: Priority: Critical (was: Major) > Report driver status does not handle response from backup masters > - > > Key: SPARK-37060 > URL: https://issues.apache.org/jira/browse/SPARK-37060 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.1.0, 3.1.1, 3.1.2, 3.2.0 >Reporter: Mohamadreza Rostami >Priority: Critical > > After an improvement in SPARK-31486, contributor uses > 'asyncSendToMasterAndForwardReply' method instead of > 'activeMasterEndpoint.askSync' to get the status of driver. Since the > driver's status is only available in active master and the > 'asyncSendToMasterAndForwardReply' method iterate over all of the masters, we > have to handle the response from the backup masters in the client, which the > developer did not consider in the SPARK-31486 change. So drivers running in > cluster mode and on a cluster with multi-master affected by this bug. I > created the patch for this bug and will soon be sent the pull request. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37202) Temp view didn't collect temp function that registered with catalog API
[ https://issues.apache.org/jira/browse/SPARK-37202?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Huaxin Gao updated SPARK-37202: --- Issue Type: Bug (was: Task) > Temp view didn't collect temp function that registered with catalog API > --- > > Key: SPARK-37202 > URL: https://issues.apache.org/jira/browse/SPARK-37202 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0 >Reporter: Linhong Liu >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37149) Improve error messages for arithmetic overflow under ANSI mode
[ https://issues.apache.org/jira/browse/SPARK-37149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17438194#comment-17438194 ] Apache Spark commented on SPARK-37149: -- User 'allisonwang-db' has created a pull request for this issue: https://github.com/apache/spark/pull/34479 > Improve error messages for arithmetic overflow under ANSI mode > -- > > Key: SPARK-37149 > URL: https://issues.apache.org/jira/browse/SPARK-37149 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: Allison Wang >Assignee: Allison Wang >Priority: Major > Fix For: 3.3.0 > > > Improve error messages for arithmetic overflow exceptions. We can instruct > users to 1) turn off ANSI mode or 2) use `try_` functions if applicable. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37149) Improve error messages for arithmetic overflow under ANSI mode
[ https://issues.apache.org/jira/browse/SPARK-37149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17438192#comment-17438192 ] Apache Spark commented on SPARK-37149: -- User 'allisonwang-db' has created a pull request for this issue: https://github.com/apache/spark/pull/34479 > Improve error messages for arithmetic overflow under ANSI mode > -- > > Key: SPARK-37149 > URL: https://issues.apache.org/jira/browse/SPARK-37149 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: Allison Wang >Assignee: Allison Wang >Priority: Major > Fix For: 3.3.0 > > > Improve error messages for arithmetic overflow exceptions. We can instruct > users to 1) turn off ANSI mode or 2) use `try_` functions if applicable. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26365) spark-submit for k8s cluster doesn't propagate exit code
[ https://issues.apache.org/jira/browse/SPARK-26365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17438174#comment-17438174 ] Naresh commented on SPARK-26365: Yes. Its not fixed in 3.x yet. I am using spark 3.2 and still see the issue > spark-submit for k8s cluster doesn't propagate exit code > > > Key: SPARK-26365 > URL: https://issues.apache.org/jira/browse/SPARK-26365 > Project: Spark > Issue Type: Bug > Components: Kubernetes, Spark Core, Spark Submit >Affects Versions: 2.3.2, 2.4.0, 3.0.0, 3.1.0 >Reporter: Oscar Bonilla >Priority: Major > Attachments: spark-2.4.5-raise-exception-k8s-failure.patch, > spark-3.0.0-raise-exception-k8s-failure.patch > > > When launching apps using spark-submit in a kubernetes cluster, if the Spark > applications fails (returns exit code = 1 for example), spark-submit will > still exit gracefully and return exit code = 0. > This is problematic, since there's no way to know if there's been a problem > with the Spark application. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37201) Spark SQL reads unnecessary nested fields (filter after explode)
[ https://issues.apache.org/jira/browse/SPARK-37201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sergey Kotlov updated SPARK-37201: -- Description: In this example, reading unnecessary nested fields still happens. Data preparation: {code:java} case class Struct(v1: String, v2: String, v3: String) case class Event(struct: Struct, array: Seq[String]) Seq( Event(Struct("v1","v2","v3"), Seq("cx1", "cx2")) ).toDF().write.mode("overwrite").saveAsTable("table") {code} v2 and v3 columns aren't needed here, but still exist in the physical plan. {code:java} spark.table("table") .select($"struct.v1", explode($"array").as("el")) .filter($"el" === "cx1") .explain(true) == Physical Plan == ... ReadSchema: struct,array:array> {code} If you just remove _filter_ or move _explode_ to second _select_, everything is fine: {code:java} spark.table("table") .select($"struct.v1", explode($"array").as("el")) //.filter($"el" === "cx1") .explain(true) // ... ReadSchema: struct,array:array> spark.table("table") .select($"struct.v1", $"array") .select($"v1", explode($"array").as("el")) .filter($"el" === "cx1") .explain(true) // ... ReadSchema: struct,array:array> {code} *Yet another example: left_anti join after double select:* {code:java} case class Struct(v1: String, v2: String, v3: String) case class Event(struct: Struct, field1: String, field2: String) Seq( Event(Struct("v1","v2","v3"), "fld1", "fld2") ).toDF().write.mode("overwrite").saveAsTable("table") val joinDf = Seq("id1").toDF("id") spark.table("table") .select("struct", "field1") .select($"struct.v1", $"field1") .join(joinDf, $"field1" === joinDf("id"), "left_anti") .explain(true) // ===> ReadSchema: struct,field1:string> {code} Instead of the first select, it can be other types of manipulations with the original df, for example {color:#00875a}.withColumn("field3", lit("f3")){color} or {color:#00875a}.drop("field2"){color}, which will also lead to reading unnecessary nested fields from _struct_. But if you just remove the first select or change type of join, reading nested fields will be correct: {code:java} // .select("struct", "field1") ===> ReadSchema: struct,field1:string> .join(joinDf, $"field1" === joinDf("id"), "left") ===> ReadSchema: struct,field1:string> {code} PS: The first select might look strange in the context of this example, but in a real system, it might be part of a common api, that other parts of the system use with their own expressions on top of this api. was: In this example, reading unnecessary nested fields still happens. Data preparation: {code:java} case class Struct(v1: String, v2: String, v3: String) case class Event(struct: Struct, array: Seq[String]) Seq( Event(Struct("v1","v2","v3"), Seq("cx1", "cx2")) ).toDF().write.mode("overwrite").saveAsTable("table") {code} v2 and v3 columns aren't needed here, but still exist in the physical plan. {code:java} spark.table("table") .select($"struct.v1", explode($"array").as("el")) .filter($"el" === "cx1") .explain(true) == Physical Plan == ... ReadSchema: struct,array:array> {code} If you just remove _filter_ or move _explode_ to second _select_, everything is fine: {code:java} spark.table("table") .select($"struct.v1", explode($"array").as("el")) //.filter($"el" === "cx1") .explain(true) // ... ReadSchema: struct,array:array> spark.table("table") .select($"struct.v1", $"array") .select($"v1", explode($"array").as("el")) .filter($"el" === "cx1") .explain(true) // ... ReadSchema: struct,array:array> {code} *Yet another example: left_anti join after double select:* {code:java} case class Struct(v1: String, v2: String, v3: String) case class Event(struct: Struct, field1: String, field2: String) Seq( Event(Struct("v1","v2","v3"), "fld1", "fld2") ).toDF().write.mode("overwrite").saveAsTable("table") val joinDf = Seq("id1").toDF("id") spark.table("table") .select("struct", "field1") .select($"struct.v1", $"field1") .join(joinDf, $"field1" === joinDf("id"), "left_anti") .explain(true) // ===> ReadSchema: struct,field1:string> {code} Instead of the first select, it can be other types of manipulations with the original df, for example {{^~.withColumn("field3", lit("f3"))~^}} or .drop("field2"), which will also lead to reading unnecessary nested fields from _struct_. But if you just remove the first select or change type of join, reading nested fields will be correct:** {code:java} // .select("struct", "field1") ===> ReadSchema: struct,field1:string> .join(joinDf, $"field1" === joinDf("id"), "left") ===> ReadSchema: struct,field1:string> {code} PS: The first select might look strange in the context of this example, but in a real system, it might be part of a common api, that other parts of the system use with their own expressions on top of this api. > Spark SQL
[jira] [Commented] (SPARK-37077) Annotations for pyspark.sql.context.SQLContext.createDataFrame are broken
[ https://issues.apache.org/jira/browse/SPARK-37077?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17438147#comment-17438147 ] Apache Spark commented on SPARK-37077: -- User 'zero323' has created a pull request for this issue: https://github.com/apache/spark/pull/34478 > Annotations for pyspark.sql.context.SQLContext.createDataFrame are broken > - > > Key: SPARK-37077 > URL: https://issues.apache.org/jira/browse/SPARK-37077 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Maciej Szymkiewicz >Assignee: Maciej Szymkiewicz >Priority: Major > Fix For: 3.3.0 > > > During migration from stubs to inline annotations, variants taking {{RDD}} > where incorrectly remove. As a result > > {code:python} > from pyspark.sql import SQLContext, SparkSession > from pyspark import SparkContext > sc = SparkContext.getOrCreate() > sqlContext= SQLContext(sc) > sqlContext.createDataFrame(sc.parallelize([(1, 2)])) > {code} > although valid, no longer type checks > {code} > main.py:7: error: No overload variant of "createDataFrame" of "SQLContext" > matches argument type "RDD[Tuple[int, int]]" > main.py:7: note: Possible overload variants: > main.py:7: note: def [RowLike in (List[Any], Tuple[Any, ...], Row)] > createDataFrame(self, data: Iterable[RowLike], samplingRatio: Optional[float] > = ...) -> DataFrame > main.py:7: note: def [RowLike in (List[Any], Tuple[Any, ...], Row)] > createDataFrame(self, data: Iterable[RowLike], schema: Union[List[str], > Tuple[str, ...]] = ..., verifySchema: bool = ...) -> DataFrame > main.py:7: note: def createDataFrame(self, data: DataFrameLike, > samplingRatio: Optional[float] = ...) -> DataFrame > main.py:7: note: <3 more non-matching overloads not shown> > Found 1 error in 1 file (checked 1 source file) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37201) Spark SQL reads unnecessary nested fields (filter after explode)
[ https://issues.apache.org/jira/browse/SPARK-37201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sergey Kotlov updated SPARK-37201: -- Description: In this example, reading unnecessary nested fields still happens. Data preparation: {code:java} case class Struct(v1: String, v2: String, v3: String) case class Event(struct: Struct, array: Seq[String]) Seq( Event(Struct("v1","v2","v3"), Seq("cx1", "cx2")) ).toDF().write.mode("overwrite").saveAsTable("table") {code} v2 and v3 columns aren't needed here, but still exist in the physical plan. {code:java} spark.table("table") .select($"struct.v1", explode($"array").as("el")) .filter($"el" === "cx1") .explain(true) == Physical Plan == ... ReadSchema: struct,array:array> {code} If you just remove _filter_ or move _explode_ to second _select_, everything is fine: {code:java} spark.table("table") .select($"struct.v1", explode($"array").as("el")) //.filter($"el" === "cx1") .explain(true) // ... ReadSchema: struct,array:array> spark.table("table") .select($"struct.v1", $"array") .select($"v1", explode($"array").as("el")) .filter($"el" === "cx1") .explain(true) // ... ReadSchema: struct,array:array> {code} *Yet another example: left_anti join after double select:* {code:java} case class Struct(v1: String, v2: String, v3: String) case class Event(struct: Struct, field1: String, field2: String) Seq( Event(Struct("v1","v2","v3"), "fld1", "fld2") ).toDF().write.mode("overwrite").saveAsTable("table") val joinDf = Seq("id1").toDF("id") spark.table("table") .select("struct", "field1") .select($"struct.v1", $"field1") .join(joinDf, $"field1" === joinDf("id"), "left_anti") .explain(true) // ===> ReadSchema: struct,field1:string> {code} Instead of the first select, it can be other types of manipulations with the original df, for example {{^~.withColumn("field3", lit("f3"))~^}} or .drop("field2"), which will also lead to reading unnecessary nested fields from _struct_. But if you just remove the first select or change type of join, reading nested fields will be correct:** {code:java} // .select("struct", "field1") ===> ReadSchema: struct,field1:string> .join(joinDf, $"field1" === joinDf("id"), "left") ===> ReadSchema: struct,field1:string> {code} PS: The first select might look strange in the context of this example, but in a real system, it might be part of a common api, that other parts of the system use with their own expressions on top of this api. was: In this example, reading unnecessary nested fields still happens. Data preparation: {code:java} case class Struct(v1: String, v2: String, v3: String) case class Event(struct: Struct, array: Seq[String]) Seq( Event(Struct("v1","v2","v3"), Seq("cx1", "cx2")) ).toDF().write.mode("overwrite").saveAsTable("table") {code} v2 and v3 columns aren't needed here, but still exist in the physical plan. {code:java} spark.table("table") .select($"struct.v1", explode($"array").as("el")) .filter($"el" === "cx1") .explain(true) == Physical Plan == ... ReadSchema: struct,array:array> {code} If you just remove _filter_ or move _explode_ to second _select_, everything is fine: {code:java} spark.table("table") .select($"struct.v1", explode($"array").as("el")) //.filter($"el" === "cx1") .explain(true) // ... ReadSchema: struct,array:array> spark.table("table") .select($"struct.v1", $"array") .select($"v1", explode($"array").as("el")) .filter($"el" === "cx1") .explain(true) // ... ReadSchema: struct,array:array> {code} *Yet another example: left_anti join after double select:* {code:java} case class Struct(v1: String, v2: String, v3: String) case class Event(struct: Struct, field1: String, field2: String) Seq( Event(Struct("v1","v2","v3"), "fld1", "fld2") ).toDF().write.mode("overwrite").saveAsTable("table") val joinDf = Seq("id1").toDF("id") spark.table("table") .select("struct", "field1") .select($"struct.v1", $"field1") .join(joinDf, $"field1" === joinDf("id"), "left_anti") .explain(true) // ===> ReadSchema: struct,field1:string> {code} If you just remove the first select or change type of join, reading nested fields will be correct:** {code:java} // .select("struct", "field1") ===> ReadSchema: struct,field1:string> .join(joinDf, $"field1" === joinDf("id"), "left") ===> ReadSchema: struct,field1:string> {code} PS: The first select might look strange in the context of this example, but in a real system, it might be part of a common api, that other parts of the system use with their own expressions on top of this api. > Spark SQL reads unnecessary nested fields (filter after explode) > > > Key: SPARK-37201 > URL: https://issues.apache.org/jira/browse/SPARK-37201 > Project: Spark > Issue
[jira] [Commented] (SPARK-37077) Annotations for pyspark.sql.context.SQLContext.createDataFrame are broken
[ https://issues.apache.org/jira/browse/SPARK-37077?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17438149#comment-17438149 ] Apache Spark commented on SPARK-37077: -- User 'zero323' has created a pull request for this issue: https://github.com/apache/spark/pull/34478 > Annotations for pyspark.sql.context.SQLContext.createDataFrame are broken > - > > Key: SPARK-37077 > URL: https://issues.apache.org/jira/browse/SPARK-37077 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Maciej Szymkiewicz >Assignee: Maciej Szymkiewicz >Priority: Major > Fix For: 3.3.0 > > > During migration from stubs to inline annotations, variants taking {{RDD}} > where incorrectly remove. As a result > > {code:python} > from pyspark.sql import SQLContext, SparkSession > from pyspark import SparkContext > sc = SparkContext.getOrCreate() > sqlContext= SQLContext(sc) > sqlContext.createDataFrame(sc.parallelize([(1, 2)])) > {code} > although valid, no longer type checks > {code} > main.py:7: error: No overload variant of "createDataFrame" of "SQLContext" > matches argument type "RDD[Tuple[int, int]]" > main.py:7: note: Possible overload variants: > main.py:7: note: def [RowLike in (List[Any], Tuple[Any, ...], Row)] > createDataFrame(self, data: Iterable[RowLike], samplingRatio: Optional[float] > = ...) -> DataFrame > main.py:7: note: def [RowLike in (List[Any], Tuple[Any, ...], Row)] > createDataFrame(self, data: Iterable[RowLike], schema: Union[List[str], > Tuple[str, ...]] = ..., verifySchema: bool = ...) -> DataFrame > main.py:7: note: def createDataFrame(self, data: DataFrameLike, > samplingRatio: Optional[float] = ...) -> DataFrame > main.py:7: note: <3 more non-matching overloads not shown> > Found 1 error in 1 file (checked 1 source file) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37077) Annotations for pyspark.sql.context.SQLContext.createDataFrame are broken
[ https://issues.apache.org/jira/browse/SPARK-37077?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17438146#comment-17438146 ] Apache Spark commented on SPARK-37077: -- User 'zero323' has created a pull request for this issue: https://github.com/apache/spark/pull/34478 > Annotations for pyspark.sql.context.SQLContext.createDataFrame are broken > - > > Key: SPARK-37077 > URL: https://issues.apache.org/jira/browse/SPARK-37077 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Maciej Szymkiewicz >Assignee: Maciej Szymkiewicz >Priority: Major > Fix For: 3.3.0 > > > During migration from stubs to inline annotations, variants taking {{RDD}} > where incorrectly remove. As a result > > {code:python} > from pyspark.sql import SQLContext, SparkSession > from pyspark import SparkContext > sc = SparkContext.getOrCreate() > sqlContext= SQLContext(sc) > sqlContext.createDataFrame(sc.parallelize([(1, 2)])) > {code} > although valid, no longer type checks > {code} > main.py:7: error: No overload variant of "createDataFrame" of "SQLContext" > matches argument type "RDD[Tuple[int, int]]" > main.py:7: note: Possible overload variants: > main.py:7: note: def [RowLike in (List[Any], Tuple[Any, ...], Row)] > createDataFrame(self, data: Iterable[RowLike], samplingRatio: Optional[float] > = ...) -> DataFrame > main.py:7: note: def [RowLike in (List[Any], Tuple[Any, ...], Row)] > createDataFrame(self, data: Iterable[RowLike], schema: Union[List[str], > Tuple[str, ...]] = ..., verifySchema: bool = ...) -> DataFrame > main.py:7: note: def createDataFrame(self, data: DataFrameLike, > samplingRatio: Optional[float] = ...) -> DataFrame > main.py:7: note: <3 more non-matching overloads not shown> > Found 1 error in 1 file (checked 1 source file) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36894) RDD.toDF should be synchronized with dispatched variants of SparkSession.createDataFrame
[ https://issues.apache.org/jira/browse/SPARK-36894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17438142#comment-17438142 ] Apache Spark commented on SPARK-36894: -- User 'zero323' has created a pull request for this issue: https://github.com/apache/spark/pull/34478 > RDD.toDF should be synchronized with dispatched variants of > SparkSession.createDataFrame > > > Key: SPARK-36894 > URL: https://issues.apache.org/jira/browse/SPARK-36894 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Affects Versions: 3.1.2, 3.2.0, 3.3.0 >Reporter: Maciej Szymkiewicz >Assignee: Maciej Szymkiewicz >Priority: Minor > Fix For: 3.3.0 > > > There are some variants that are supported: > * Providing a schema as a {{str}} object for {{RDD[RowLike]}} objects > * Providing a schema as a {{Tuple[str, ...]}} names > * Calling {{toDF}} on {{RDD}} of atomic values, when schema of {{str}} or > {{AtomicType}} is provided. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36894) RDD.toDF should be synchronized with dispatched variants of SparkSession.createDataFrame
[ https://issues.apache.org/jira/browse/SPARK-36894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17438141#comment-17438141 ] Apache Spark commented on SPARK-36894: -- User 'zero323' has created a pull request for this issue: https://github.com/apache/spark/pull/34478 > RDD.toDF should be synchronized with dispatched variants of > SparkSession.createDataFrame > > > Key: SPARK-36894 > URL: https://issues.apache.org/jira/browse/SPARK-36894 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Affects Versions: 3.1.2, 3.2.0, 3.3.0 >Reporter: Maciej Szymkiewicz >Assignee: Maciej Szymkiewicz >Priority: Minor > Fix For: 3.3.0 > > > There are some variants that are supported: > * Providing a schema as a {{str}} object for {{RDD[RowLike]}} objects > * Providing a schema as a {{Tuple[str, ...]}} names > * Calling {{toDF}} on {{RDD}} of atomic values, when schema of {{str}} or > {{AtomicType}} is provided. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37204) Update Apache Parent POM version to 24 in the pom.xml
[ https://issues.apache.org/jira/browse/SPARK-37204?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17438131#comment-17438131 ] Janardhan Pulivarthi commented on SPARK-37204: -- Hi, I am new here. Can I work on this? > Update Apache Parent POM version to 24 in the pom.xml > - > > Key: SPARK-37204 > URL: https://issues.apache.org/jira/browse/SPARK-37204 > Project: Spark > Issue Type: Task > Components: Build >Affects Versions: 3.2.0 >Reporter: Janardhan Pulivarthi >Priority: Minor > > Some nice things about the version 24: > 1. Deploy SHA-512 for source-release to Remote Repository > 2. Reproducible builds option > > Resources: > [1] [https://lists.apache.org/thread/9wk97dwjlcoxlk1onxotfo8k98b2v0sk] > [2] [https://maven.apache.org/guides/mini/guide-reproducible-builds.html] > [3] > [https://github.com/apache/maven-apache-parent/compare/apache-18...apache-24diff] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-37204) Update Apache Parent POM version to 24 in the pom.xml
Janardhan Pulivarthi created SPARK-37204: Summary: Update Apache Parent POM version to 24 in the pom.xml Key: SPARK-37204 URL: https://issues.apache.org/jira/browse/SPARK-37204 Project: Spark Issue Type: Task Components: Build Affects Versions: 3.2.0 Reporter: Janardhan Pulivarthi Some nice things about the version 24: 1. Deploy SHA-512 for source-release to Remote Repository 2. Reproducible builds option Resources: [1] [https://lists.apache.org/thread/9wk97dwjlcoxlk1onxotfo8k98b2v0sk] [2] [https://maven.apache.org/guides/mini/guide-reproducible-builds.html] [3] [https://github.com/apache/maven-apache-parent/compare/apache-18...apache-24diff] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37204) Update Apache Parent POM version to 24 in the pom.xml
[ https://issues.apache.org/jira/browse/SPARK-37204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Janardhan Pulivarthi updated SPARK-37204: - Priority: Minor (was: Major) > Update Apache Parent POM version to 24 in the pom.xml > - > > Key: SPARK-37204 > URL: https://issues.apache.org/jira/browse/SPARK-37204 > Project: Spark > Issue Type: Task > Components: Build >Affects Versions: 3.2.0 >Reporter: Janardhan Pulivarthi >Priority: Minor > > Some nice things about the version 24: > 1. Deploy SHA-512 for source-release to Remote Repository > 2. Reproducible builds option > > Resources: > [1] [https://lists.apache.org/thread/9wk97dwjlcoxlk1onxotfo8k98b2v0sk] > [2] [https://maven.apache.org/guides/mini/guide-reproducible-builds.html] > [3] > [https://github.com/apache/maven-apache-parent/compare/apache-18...apache-24diff] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37201) Spark SQL reads unnecessary nested fields (filter after explode)
[ https://issues.apache.org/jira/browse/SPARK-37201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sergey Kotlov updated SPARK-37201: -- Description: In this example, reading unnecessary nested fields still happens. Data preparation: {code:java} case class Struct(v1: String, v2: String, v3: String) case class Event(struct: Struct, array: Seq[String]) Seq( Event(Struct("v1","v2","v3"), Seq("cx1", "cx2")) ).toDF().write.mode("overwrite").saveAsTable("table") {code} v2 and v3 columns aren't needed here, but still exist in the physical plan. {code:java} spark.table("table") .select($"struct.v1", explode($"array").as("el")) .filter($"el" === "cx1") .explain(true) == Physical Plan == ... ReadSchema: struct,array:array> {code} If you just remove _filter_ or move _explode_ to second _select_, everything is fine: {code:java} spark.table("table") .select($"struct.v1", explode($"array").as("el")) //.filter($"el" === "cx1") .explain(true) // ... ReadSchema: struct,array:array> spark.table("table") .select($"struct.v1", $"array") .select($"v1", explode($"array").as("el")) .filter($"el" === "cx1") .explain(true) // ... ReadSchema: struct,array:array> {code} *Yet another example: left_anti join after double select:* {code:java} case class Struct(v1: String, v2: String, v3: String) case class Event(struct: Struct, field1: String, field2: String) Seq( Event(Struct("v1","v2","v3"), "fld1", "fld2") ).toDF().write.mode("overwrite").saveAsTable("table") val joinDf = Seq("id1").toDF("id") spark.table("table") .select("struct", "field1") .select($"struct.v1", $"field1") .join(joinDf, $"field1" === joinDf("id"), "left_anti") .explain(true) // ===> ReadSchema: struct,field1:string> {code} If you just remove the first select or change type of join, reading nested fields will be correct:** {code:java} // .select("struct", "field1") ===> ReadSchema: struct,field1:string> .join(joinDf, $"field1" === joinDf("id"), "left") ===> ReadSchema: struct,field1:string> {code} PS: The first select might look strange in the context of this example, but in a real system, it might be part of a common api, that other parts of the system use with their own expressions on top of this api. was: In this example, reading unnecessary nested fields still happens. Data preparation: {code:java} case class Struct(v1: String, v2: String, v3: String) case class Event(struct: Struct, array: Seq[String]) Seq( Event(Struct("v1","v2","v3"), Seq("cx1", "cx2")) ).toDF().write.mode("overwrite").saveAsTable("table") {code} v2 and v3 columns aren't needed here, but still exist in the physical plan. {code:java} spark.table("table") .select($"struct.v1", explode($"array").as("el")) .filter($"el" === "cx1") .explain(true) == Physical Plan == ... ReadSchema: struct,array:array> {code} If you just remove _filter_ or move _explode_ to second _select_, everything is fine: {code:java} spark.table("table") .select($"struct.v1", explode($"array").as("el")) //.filter($"el" === "cx1") .explain(true) // ... ReadSchema: struct,array:array> spark.table("table") .select($"struct.v1", $"array") .select($"v1", explode($"array").as("el")) .filter($"el" === "cx1") .explain(true) // ... ReadSchema: struct,array:array> {code} > Spark SQL reads unnecessary nested fields (filter after explode) > > > Key: SPARK-37201 > URL: https://issues.apache.org/jira/browse/SPARK-37201 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0 >Reporter: Sergey Kotlov >Priority: Major > > In this example, reading unnecessary nested fields still happens. > Data preparation: > {code:java} > case class Struct(v1: String, v2: String, v3: String) > case class Event(struct: Struct, array: Seq[String]) > Seq( > Event(Struct("v1","v2","v3"), Seq("cx1", "cx2")) > ).toDF().write.mode("overwrite").saveAsTable("table") > {code} > v2 and v3 columns aren't needed here, but still exist in the physical plan. > {code:java} > spark.table("table") > .select($"struct.v1", explode($"array").as("el")) > .filter($"el" === "cx1") > .explain(true) > > == Physical Plan == > ... ReadSchema: > struct,array:array> > {code} > If you just remove _filter_ or move _explode_ to second _select_, everything > is fine: > {code:java} > spark.table("table") > .select($"struct.v1", explode($"array").as("el")) > //.filter($"el" === "cx1") > .explain(true) > > // ... ReadSchema: struct,array:array> > spark.table("table") > .select($"struct.v1", $"array") > .select($"v1", explode($"array").as("el")) > .filter($"el" === "cx1") > .explain(true) > > // ... ReadSchema: struct,array:array> > {code} > > *Yet another example: left_anti join after double
[jira] [Commented] (SPARK-37077) Annotations for pyspark.sql.context.SQLContext.createDataFrame are broken
[ https://issues.apache.org/jira/browse/SPARK-37077?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17438089#comment-17438089 ] Apache Spark commented on SPARK-37077: -- User 'sarutak' has created a pull request for this issue: https://github.com/apache/spark/pull/34477 > Annotations for pyspark.sql.context.SQLContext.createDataFrame are broken > - > > Key: SPARK-37077 > URL: https://issues.apache.org/jira/browse/SPARK-37077 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Maciej Szymkiewicz >Assignee: Maciej Szymkiewicz >Priority: Major > Fix For: 3.3.0 > > > During migration from stubs to inline annotations, variants taking {{RDD}} > where incorrectly remove. As a result > > {code:python} > from pyspark.sql import SQLContext, SparkSession > from pyspark import SparkContext > sc = SparkContext.getOrCreate() > sqlContext= SQLContext(sc) > sqlContext.createDataFrame(sc.parallelize([(1, 2)])) > {code} > although valid, no longer type checks > {code} > main.py:7: error: No overload variant of "createDataFrame" of "SQLContext" > matches argument type "RDD[Tuple[int, int]]" > main.py:7: note: Possible overload variants: > main.py:7: note: def [RowLike in (List[Any], Tuple[Any, ...], Row)] > createDataFrame(self, data: Iterable[RowLike], samplingRatio: Optional[float] > = ...) -> DataFrame > main.py:7: note: def [RowLike in (List[Any], Tuple[Any, ...], Row)] > createDataFrame(self, data: Iterable[RowLike], schema: Union[List[str], > Tuple[str, ...]] = ..., verifySchema: bool = ...) -> DataFrame > main.py:7: note: def createDataFrame(self, data: DataFrameLike, > samplingRatio: Optional[float] = ...) -> DataFrame > main.py:7: note: <3 more non-matching overloads not shown> > Found 1 error in 1 file (checked 1 source file) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37077) Annotations for pyspark.sql.context.SQLContext.createDataFrame are broken
[ https://issues.apache.org/jira/browse/SPARK-37077?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17438087#comment-17438087 ] Apache Spark commented on SPARK-37077: -- User 'sarutak' has created a pull request for this issue: https://github.com/apache/spark/pull/34477 > Annotations for pyspark.sql.context.SQLContext.createDataFrame are broken > - > > Key: SPARK-37077 > URL: https://issues.apache.org/jira/browse/SPARK-37077 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Maciej Szymkiewicz >Assignee: Maciej Szymkiewicz >Priority: Major > Fix For: 3.3.0 > > > During migration from stubs to inline annotations, variants taking {{RDD}} > where incorrectly remove. As a result > > {code:python} > from pyspark.sql import SQLContext, SparkSession > from pyspark import SparkContext > sc = SparkContext.getOrCreate() > sqlContext= SQLContext(sc) > sqlContext.createDataFrame(sc.parallelize([(1, 2)])) > {code} > although valid, no longer type checks > {code} > main.py:7: error: No overload variant of "createDataFrame" of "SQLContext" > matches argument type "RDD[Tuple[int, int]]" > main.py:7: note: Possible overload variants: > main.py:7: note: def [RowLike in (List[Any], Tuple[Any, ...], Row)] > createDataFrame(self, data: Iterable[RowLike], samplingRatio: Optional[float] > = ...) -> DataFrame > main.py:7: note: def [RowLike in (List[Any], Tuple[Any, ...], Row)] > createDataFrame(self, data: Iterable[RowLike], schema: Union[List[str], > Tuple[str, ...]] = ..., verifySchema: bool = ...) -> DataFrame > main.py:7: note: def createDataFrame(self, data: DataFrameLike, > samplingRatio: Optional[float] = ...) -> DataFrame > main.py:7: note: <3 more non-matching overloads not shown> > Found 1 error in 1 file (checked 1 source file) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36894) RDD.toDF should be synchronized with dispatched variants of SparkSession.createDataFrame
[ https://issues.apache.org/jira/browse/SPARK-36894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17438085#comment-17438085 ] Apache Spark commented on SPARK-36894: -- User 'sarutak' has created a pull request for this issue: https://github.com/apache/spark/pull/34477 > RDD.toDF should be synchronized with dispatched variants of > SparkSession.createDataFrame > > > Key: SPARK-36894 > URL: https://issues.apache.org/jira/browse/SPARK-36894 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Affects Versions: 3.1.2, 3.2.0, 3.3.0 >Reporter: Maciej Szymkiewicz >Assignee: Maciej Szymkiewicz >Priority: Minor > Fix For: 3.3.0 > > > There are some variants that are supported: > * Providing a schema as a {{str}} object for {{RDD[RowLike]}} objects > * Providing a schema as a {{Tuple[str, ...]}} names > * Calling {{toDF}} on {{RDD}} of atomic values, when schema of {{str}} or > {{AtomicType}} is provided. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37077) Annotations for pyspark.sql.context.SQLContext.createDataFrame are broken
[ https://issues.apache.org/jira/browse/SPARK-37077?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17438086#comment-17438086 ] Apache Spark commented on SPARK-37077: -- User 'sarutak' has created a pull request for this issue: https://github.com/apache/spark/pull/34477 > Annotations for pyspark.sql.context.SQLContext.createDataFrame are broken > - > > Key: SPARK-37077 > URL: https://issues.apache.org/jira/browse/SPARK-37077 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Maciej Szymkiewicz >Assignee: Maciej Szymkiewicz >Priority: Major > Fix For: 3.3.0 > > > During migration from stubs to inline annotations, variants taking {{RDD}} > where incorrectly remove. As a result > > {code:python} > from pyspark.sql import SQLContext, SparkSession > from pyspark import SparkContext > sc = SparkContext.getOrCreate() > sqlContext= SQLContext(sc) > sqlContext.createDataFrame(sc.parallelize([(1, 2)])) > {code} > although valid, no longer type checks > {code} > main.py:7: error: No overload variant of "createDataFrame" of "SQLContext" > matches argument type "RDD[Tuple[int, int]]" > main.py:7: note: Possible overload variants: > main.py:7: note: def [RowLike in (List[Any], Tuple[Any, ...], Row)] > createDataFrame(self, data: Iterable[RowLike], samplingRatio: Optional[float] > = ...) -> DataFrame > main.py:7: note: def [RowLike in (List[Any], Tuple[Any, ...], Row)] > createDataFrame(self, data: Iterable[RowLike], schema: Union[List[str], > Tuple[str, ...]] = ..., verifySchema: bool = ...) -> DataFrame > main.py:7: note: def createDataFrame(self, data: DataFrameLike, > samplingRatio: Optional[float] = ...) -> DataFrame > main.py:7: note: <3 more non-matching overloads not shown> > Found 1 error in 1 file (checked 1 source file) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36894) RDD.toDF should be synchronized with dispatched variants of SparkSession.createDataFrame
[ https://issues.apache.org/jira/browse/SPARK-36894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17438084#comment-17438084 ] Apache Spark commented on SPARK-36894: -- User 'sarutak' has created a pull request for this issue: https://github.com/apache/spark/pull/34477 > RDD.toDF should be synchronized with dispatched variants of > SparkSession.createDataFrame > > > Key: SPARK-36894 > URL: https://issues.apache.org/jira/browse/SPARK-36894 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Affects Versions: 3.1.2, 3.2.0, 3.3.0 >Reporter: Maciej Szymkiewicz >Assignee: Maciej Szymkiewicz >Priority: Minor > Fix For: 3.3.0 > > > There are some variants that are supported: > * Providing a schema as a {{str}} object for {{RDD[RowLike]}} objects > * Providing a schema as a {{Tuple[str, ...]}} names > * Calling {{toDF}} on {{RDD}} of atomic values, when schema of {{str}} or > {{AtomicType}} is provided. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-37030) Maven build failed in windows!
[ https://issues.apache.org/jira/browse/SPARK-37030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shockang resolved SPARK-37030. -- Resolution: Done > Maven build failed in windows! > -- > > Key: SPARK-37030 > URL: https://issues.apache.org/jira/browse/SPARK-37030 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 3.2.0 > Environment: OS: Windows 10 Professional > OS Version: 21H1 > Maven Version: 3.6.3 > >Reporter: Shockang >Priority: Minor > Fix For: 3.2.0 > > Attachments: image-2021-10-17-22-18-16-616.png > > > I pulled the latest Spark master code on my local windows 10 computer and > executed the following command: > {code:java} > mvn -DskipTests clean install{code} > Build failed! > !image-2021-10-17-22-18-16-616.png! > {code:java} > Failed to execute goal org.apache.maven.plugins:maven-antrun-plugin:1.8:run > (default) on project spark-core_2.12: An Ant BuildException has occured: > Execute failed: java.io.IOException: Cannot run program "bash" (in directory > "C:\bigdata\spark\core"): CreateProcess error=2{code} > It seems that the plugin: maven-antrun-plugin cannot run because of windows > no bash. > The following code comes from pom.xml in spark-core module. > {code:java} > > org.apache.maven.plugins > maven-antrun-plugin > > > generate-resources > > > > > > > > > > > > run > > > > > {code} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37030) Maven build failed in windows!
[ https://issues.apache.org/jira/browse/SPARK-37030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17438017#comment-17438017 ] Shockang commented on SPARK-37030: -- [~hyukjin.kwon] Thank you for your suggestion. This problem has been solved. > Maven build failed in windows! > -- > > Key: SPARK-37030 > URL: https://issues.apache.org/jira/browse/SPARK-37030 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 3.2.0 > Environment: OS: Windows 10 Professional > OS Version: 21H1 > Maven Version: 3.6.3 > >Reporter: Shockang >Priority: Minor > Fix For: 3.2.0 > > Attachments: image-2021-10-17-22-18-16-616.png > > > I pulled the latest Spark master code on my local windows 10 computer and > executed the following command: > {code:java} > mvn -DskipTests clean install{code} > Build failed! > !image-2021-10-17-22-18-16-616.png! > {code:java} > Failed to execute goal org.apache.maven.plugins:maven-antrun-plugin:1.8:run > (default) on project spark-core_2.12: An Ant BuildException has occured: > Execute failed: java.io.IOException: Cannot run program "bash" (in directory > "C:\bigdata\spark\core"): CreateProcess error=2{code} > It seems that the plugin: maven-antrun-plugin cannot run because of windows > no bash. > The following code comes from pom.xml in spark-core module. > {code:java} > > org.apache.maven.plugins > maven-antrun-plugin > > > generate-resources > > > > > > > > > > > > run > > > > > {code} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-37077) Annotations for pyspark.sql.context.SQLContext.createDataFrame are broken
[ https://issues.apache.org/jira/browse/SPARK-37077?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maciej Szymkiewicz resolved SPARK-37077. Fix Version/s: 3.3.0 Resolution: Fixed Issue resolved by pull request 34146 [https://github.com/apache/spark/pull/34146] > Annotations for pyspark.sql.context.SQLContext.createDataFrame are broken > - > > Key: SPARK-37077 > URL: https://issues.apache.org/jira/browse/SPARK-37077 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Maciej Szymkiewicz >Assignee: Maciej Szymkiewicz >Priority: Major > Fix For: 3.3.0 > > > During migration from stubs to inline annotations, variants taking {{RDD}} > where incorrectly remove. As a result > > {code:python} > from pyspark.sql import SQLContext, SparkSession > from pyspark import SparkContext > sc = SparkContext.getOrCreate() > sqlContext= SQLContext(sc) > sqlContext.createDataFrame(sc.parallelize([(1, 2)])) > {code} > although valid, no longer type checks > {code} > main.py:7: error: No overload variant of "createDataFrame" of "SQLContext" > matches argument type "RDD[Tuple[int, int]]" > main.py:7: note: Possible overload variants: > main.py:7: note: def [RowLike in (List[Any], Tuple[Any, ...], Row)] > createDataFrame(self, data: Iterable[RowLike], samplingRatio: Optional[float] > = ...) -> DataFrame > main.py:7: note: def [RowLike in (List[Any], Tuple[Any, ...], Row)] > createDataFrame(self, data: Iterable[RowLike], schema: Union[List[str], > Tuple[str, ...]] = ..., verifySchema: bool = ...) -> DataFrame > main.py:7: note: def createDataFrame(self, data: DataFrameLike, > samplingRatio: Optional[float] = ...) -> DataFrame > main.py:7: note: <3 more non-matching overloads not shown> > Found 1 error in 1 file (checked 1 source file) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-36894) RDD.toDF should be synchronized with dispatched variants of SparkSession.createDataFrame
[ https://issues.apache.org/jira/browse/SPARK-36894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maciej Szymkiewicz resolved SPARK-36894. Fix Version/s: 3.3.0 Resolution: Fixed Issue resolved by pull request 34146 [https://github.com/apache/spark/pull/34146] > RDD.toDF should be synchronized with dispatched variants of > SparkSession.createDataFrame > > > Key: SPARK-36894 > URL: https://issues.apache.org/jira/browse/SPARK-36894 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Affects Versions: 3.1.2, 3.2.0, 3.3.0 >Reporter: Maciej Szymkiewicz >Assignee: Maciej Szymkiewicz >Priority: Minor > Fix For: 3.3.0 > > > There are some variants that are supported: > * Providing a schema as a {{str}} object for {{RDD[RowLike]}} objects > * Providing a schema as a {{Tuple[str, ...]}} names > * Calling {{toDF}} on {{RDD}} of atomic values, when schema of {{str}} or > {{AtomicType}} is provided. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37077) Annotations for pyspark.sql.context.SQLContext.createDataFrame are broken
[ https://issues.apache.org/jira/browse/SPARK-37077?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maciej Szymkiewicz reassigned SPARK-37077: -- Assignee: Maciej Szymkiewicz > Annotations for pyspark.sql.context.SQLContext.createDataFrame are broken > - > > Key: SPARK-37077 > URL: https://issues.apache.org/jira/browse/SPARK-37077 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Maciej Szymkiewicz >Assignee: Maciej Szymkiewicz >Priority: Major > > During migration from stubs to inline annotations, variants taking {{RDD}} > where incorrectly remove. As a result > > {code:python} > from pyspark.sql import SQLContext, SparkSession > from pyspark import SparkContext > sc = SparkContext.getOrCreate() > sqlContext= SQLContext(sc) > sqlContext.createDataFrame(sc.parallelize([(1, 2)])) > {code} > although valid, no longer type checks > {code} > main.py:7: error: No overload variant of "createDataFrame" of "SQLContext" > matches argument type "RDD[Tuple[int, int]]" > main.py:7: note: Possible overload variants: > main.py:7: note: def [RowLike in (List[Any], Tuple[Any, ...], Row)] > createDataFrame(self, data: Iterable[RowLike], samplingRatio: Optional[float] > = ...) -> DataFrame > main.py:7: note: def [RowLike in (List[Any], Tuple[Any, ...], Row)] > createDataFrame(self, data: Iterable[RowLike], schema: Union[List[str], > Tuple[str, ...]] = ..., verifySchema: bool = ...) -> DataFrame > main.py:7: note: def createDataFrame(self, data: DataFrameLike, > samplingRatio: Optional[float] = ...) -> DataFrame > main.py:7: note: <3 more non-matching overloads not shown> > Found 1 error in 1 file (checked 1 source file) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-36894) RDD.toDF should be synchronized with dispatched variants of SparkSession.createDataFrame
[ https://issues.apache.org/jira/browse/SPARK-36894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maciej Szymkiewicz reassigned SPARK-36894: -- Assignee: Maciej Szymkiewicz > RDD.toDF should be synchronized with dispatched variants of > SparkSession.createDataFrame > > > Key: SPARK-36894 > URL: https://issues.apache.org/jira/browse/SPARK-36894 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Affects Versions: 3.1.2, 3.2.0, 3.3.0 >Reporter: Maciej Szymkiewicz >Assignee: Maciej Szymkiewicz >Priority: Minor > > There are some variants that are supported: > * Providing a schema as a {{str}} object for {{RDD[RowLike]}} objects > * Providing a schema as a {{Tuple[str, ...]}} names > * Calling {{toDF}} on {{RDD}} of atomic values, when schema of {{str}} or > {{AtomicType}} is provided. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37195) Unify v1 and v2 SHOW TBLPROPERTIES tests
[ https://issues.apache.org/jira/browse/SPARK-37195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17437871#comment-17437871 ] Apache Spark commented on SPARK-37195: -- User 'Peng-Lei' has created a pull request for this issue: https://github.com/apache/spark/pull/34476 > Unify v1 and v2 SHOW TBLPROPERTIES tests > - > > Key: SPARK-37195 > URL: https://issues.apache.org/jira/browse/SPARK-37195 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: PengLei >Priority: Major > Fix For: 3.3.0 > > > Unify v1 and v2 SHOW TBLPROPERTIES tests -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37195) Unify v1 and v2 SHOW TBLPROPERTIES tests
[ https://issues.apache.org/jira/browse/SPARK-37195?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37195: Assignee: Apache Spark > Unify v1 and v2 SHOW TBLPROPERTIES tests > - > > Key: SPARK-37195 > URL: https://issues.apache.org/jira/browse/SPARK-37195 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: PengLei >Assignee: Apache Spark >Priority: Major > Fix For: 3.3.0 > > > Unify v1 and v2 SHOW TBLPROPERTIES tests -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37195) Unify v1 and v2 SHOW TBLPROPERTIES tests
[ https://issues.apache.org/jira/browse/SPARK-37195?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37195: Assignee: (was: Apache Spark) > Unify v1 and v2 SHOW TBLPROPERTIES tests > - > > Key: SPARK-37195 > URL: https://issues.apache.org/jira/browse/SPARK-37195 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: PengLei >Priority: Major > Fix For: 3.3.0 > > > Unify v1 and v2 SHOW TBLPROPERTIES tests -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37195) Unify v1 and v2 SHOW TBLPROPERTIES tests
[ https://issues.apache.org/jira/browse/SPARK-37195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17437869#comment-17437869 ] Apache Spark commented on SPARK-37195: -- User 'Peng-Lei' has created a pull request for this issue: https://github.com/apache/spark/pull/34476 > Unify v1 and v2 SHOW TBLPROPERTIES tests > - > > Key: SPARK-37195 > URL: https://issues.apache.org/jira/browse/SPARK-37195 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: PengLei >Priority: Major > Fix For: 3.3.0 > > > Unify v1 and v2 SHOW TBLPROPERTIES tests -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26365) spark-submit for k8s cluster doesn't propagate exit code
[ https://issues.apache.org/jira/browse/SPARK-26365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17437830#comment-17437830 ] Oscar Bonilla commented on SPARK-26365: --- I've changed the priority to Major, to see if someone can pick it up and fix it > spark-submit for k8s cluster doesn't propagate exit code > > > Key: SPARK-26365 > URL: https://issues.apache.org/jira/browse/SPARK-26365 > Project: Spark > Issue Type: Bug > Components: Kubernetes, Spark Core, Spark Submit >Affects Versions: 2.3.2, 2.4.0, 3.0.0, 3.1.0 >Reporter: Oscar Bonilla >Priority: Major > Attachments: spark-2.4.5-raise-exception-k8s-failure.patch, > spark-3.0.0-raise-exception-k8s-failure.patch > > > When launching apps using spark-submit in a kubernetes cluster, if the Spark > applications fails (returns exit code = 1 for example), spark-submit will > still exit gracefully and return exit code = 0. > This is problematic, since there's no way to know if there's been a problem > with the Spark application. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26365) spark-submit for k8s cluster doesn't propagate exit code
[ https://issues.apache.org/jira/browse/SPARK-26365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Oscar Bonilla updated SPARK-26365: -- Affects Version/s: 3.0.0 3.1.0 > spark-submit for k8s cluster doesn't propagate exit code > > > Key: SPARK-26365 > URL: https://issues.apache.org/jira/browse/SPARK-26365 > Project: Spark > Issue Type: Bug > Components: Kubernetes, Spark Core, Spark Submit >Affects Versions: 2.3.2, 2.4.0, 3.0.0, 3.1.0 >Reporter: Oscar Bonilla >Priority: Major > Attachments: spark-2.4.5-raise-exception-k8s-failure.patch, > spark-3.0.0-raise-exception-k8s-failure.patch > > > When launching apps using spark-submit in a kubernetes cluster, if the Spark > applications fails (returns exit code = 1 for example), spark-submit will > still exit gracefully and return exit code = 0. > This is problematic, since there's no way to know if there's been a problem > with the Spark application. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26365) spark-submit for k8s cluster doesn't propagate exit code
[ https://issues.apache.org/jira/browse/SPARK-26365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Oscar Bonilla updated SPARK-26365: -- Priority: Major (was: Minor) > spark-submit for k8s cluster doesn't propagate exit code > > > Key: SPARK-26365 > URL: https://issues.apache.org/jira/browse/SPARK-26365 > Project: Spark > Issue Type: Bug > Components: Kubernetes, Spark Core, Spark Submit >Affects Versions: 2.3.2, 2.4.0 >Reporter: Oscar Bonilla >Priority: Major > Attachments: spark-2.4.5-raise-exception-k8s-failure.patch, > spark-3.0.0-raise-exception-k8s-failure.patch > > > When launching apps using spark-submit in a kubernetes cluster, if the Spark > applications fails (returns exit code = 1 for example), spark-submit will > still exit gracefully and return exit code = 0. > This is problematic, since there's no way to know if there's been a problem > with the Spark application. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26365) spark-submit for k8s cluster doesn't propagate exit code
[ https://issues.apache.org/jira/browse/SPARK-26365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17437825#comment-17437825 ] Vivien Brissat commented on SPARK-26365: Hi [~oscar.bonilla], this is not since i made tests in version 3.1, and found the Jira issue when i looked for a solution to my problem. > spark-submit for k8s cluster doesn't propagate exit code > > > Key: SPARK-26365 > URL: https://issues.apache.org/jira/browse/SPARK-26365 > Project: Spark > Issue Type: Bug > Components: Kubernetes, Spark Core, Spark Submit >Affects Versions: 2.3.2, 2.4.0 >Reporter: Oscar Bonilla >Priority: Minor > Attachments: spark-2.4.5-raise-exception-k8s-failure.patch, > spark-3.0.0-raise-exception-k8s-failure.patch > > > When launching apps using spark-submit in a kubernetes cluster, if the Spark > applications fails (returns exit code = 1 for example), spark-submit will > still exit gracefully and return exit code = 0. > This is problematic, since there's no way to know if there's been a problem > with the Spark application. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37203) Fix NotSerializableException when observe with percentile_approx
[ https://issues.apache.org/jira/browse/SPARK-37203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17437805#comment-17437805 ] Apache Spark commented on SPARK-37203: -- User 'beliefer' has created a pull request for this issue: https://github.com/apache/spark/pull/34474 > Fix NotSerializableException when observe with percentile_approx > > > Key: SPARK-37203 > URL: https://issues.apache.org/jira/browse/SPARK-37203 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.0 >Reporter: jiaan.geng >Priority: Major > > {code:java} > val namedObservation = Observation("named") > val df = spark.range(100) > val observed_df = df.observe( >namedObservation, percentile_approx($"id", lit(0.5), > lit(100)).as("percentile_approx_val")) > observed_df.collect() > namedObservation.get > {code} > throws exception as follows: > {code:java} > 15:16:27.994 ERROR org.apache.spark.util.Utils: Exception encountered > java.io.NotSerializableException: > org.apache.spark.sql.catalyst.expressions.aggregate.ApproximatePercentile$PercentileDigest > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1184) > at > java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548) > at > java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509) > at > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) > at java.io.ObjectOutputStream.writeArray(ObjectOutputStream.java:1378) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1174) > at > java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548) > at > java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509) > at > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) > at > java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548) > at > java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509) > at > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) > at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348) > at > org.apache.spark.scheduler.DirectTaskResult.$anonfun$writeExternal$2(TaskResult.scala:55) > at > org.apache.spark.scheduler.DirectTaskResult.$anonfun$writeExternal$2$adapted(TaskResult.scala:55) > at scala.collection.Iterator.foreach(Iterator.scala:943) > at scala.collection.Iterator.foreach$(Iterator.scala:943) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1431) > at scala.collection.IterableLike.foreach(IterableLike.scala:74) > at scala.collection.IterableLike.foreach$(IterableLike.scala:73) > at scala.collection.AbstractIterable.foreach(Iterable.scala:56) > at > org.apache.spark.scheduler.DirectTaskResult.$anonfun$writeExternal$1(TaskResult.scala:55) > at > scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) > at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1434) > at > org.apache.spark.scheduler.DirectTaskResult.writeExternal(TaskResult.scala:51) > at > java.io.ObjectOutputStream.writeExternalData(ObjectOutputStream.java:1459) > at > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1430) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) > at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348) > at > org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:44) > at > org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:101) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:616) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37203) Fix NotSerializableException when observe with percentile_approx
[ https://issues.apache.org/jira/browse/SPARK-37203?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37203: Assignee: (was: Apache Spark) > Fix NotSerializableException when observe with percentile_approx > > > Key: SPARK-37203 > URL: https://issues.apache.org/jira/browse/SPARK-37203 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.0 >Reporter: jiaan.geng >Priority: Major > > {code:java} > val namedObservation = Observation("named") > val df = spark.range(100) > val observed_df = df.observe( >namedObservation, percentile_approx($"id", lit(0.5), > lit(100)).as("percentile_approx_val")) > observed_df.collect() > namedObservation.get > {code} > throws exception as follows: > {code:java} > 15:16:27.994 ERROR org.apache.spark.util.Utils: Exception encountered > java.io.NotSerializableException: > org.apache.spark.sql.catalyst.expressions.aggregate.ApproximatePercentile$PercentileDigest > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1184) > at > java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548) > at > java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509) > at > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) > at java.io.ObjectOutputStream.writeArray(ObjectOutputStream.java:1378) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1174) > at > java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548) > at > java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509) > at > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) > at > java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548) > at > java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509) > at > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) > at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348) > at > org.apache.spark.scheduler.DirectTaskResult.$anonfun$writeExternal$2(TaskResult.scala:55) > at > org.apache.spark.scheduler.DirectTaskResult.$anonfun$writeExternal$2$adapted(TaskResult.scala:55) > at scala.collection.Iterator.foreach(Iterator.scala:943) > at scala.collection.Iterator.foreach$(Iterator.scala:943) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1431) > at scala.collection.IterableLike.foreach(IterableLike.scala:74) > at scala.collection.IterableLike.foreach$(IterableLike.scala:73) > at scala.collection.AbstractIterable.foreach(Iterable.scala:56) > at > org.apache.spark.scheduler.DirectTaskResult.$anonfun$writeExternal$1(TaskResult.scala:55) > at > scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) > at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1434) > at > org.apache.spark.scheduler.DirectTaskResult.writeExternal(TaskResult.scala:51) > at > java.io.ObjectOutputStream.writeExternalData(ObjectOutputStream.java:1459) > at > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1430) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) > at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348) > at > org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:44) > at > org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:101) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:616) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37203) Fix NotSerializableException when observe with percentile_approx
[ https://issues.apache.org/jira/browse/SPARK-37203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17437803#comment-17437803 ] Apache Spark commented on SPARK-37203: -- User 'beliefer' has created a pull request for this issue: https://github.com/apache/spark/pull/34474 > Fix NotSerializableException when observe with percentile_approx > > > Key: SPARK-37203 > URL: https://issues.apache.org/jira/browse/SPARK-37203 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.0 >Reporter: jiaan.geng >Priority: Major > > {code:java} > val namedObservation = Observation("named") > val df = spark.range(100) > val observed_df = df.observe( >namedObservation, percentile_approx($"id", lit(0.5), > lit(100)).as("percentile_approx_val")) > observed_df.collect() > namedObservation.get > {code} > throws exception as follows: > {code:java} > 15:16:27.994 ERROR org.apache.spark.util.Utils: Exception encountered > java.io.NotSerializableException: > org.apache.spark.sql.catalyst.expressions.aggregate.ApproximatePercentile$PercentileDigest > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1184) > at > java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548) > at > java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509) > at > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) > at java.io.ObjectOutputStream.writeArray(ObjectOutputStream.java:1378) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1174) > at > java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548) > at > java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509) > at > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) > at > java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548) > at > java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509) > at > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) > at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348) > at > org.apache.spark.scheduler.DirectTaskResult.$anonfun$writeExternal$2(TaskResult.scala:55) > at > org.apache.spark.scheduler.DirectTaskResult.$anonfun$writeExternal$2$adapted(TaskResult.scala:55) > at scala.collection.Iterator.foreach(Iterator.scala:943) > at scala.collection.Iterator.foreach$(Iterator.scala:943) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1431) > at scala.collection.IterableLike.foreach(IterableLike.scala:74) > at scala.collection.IterableLike.foreach$(IterableLike.scala:73) > at scala.collection.AbstractIterable.foreach(Iterable.scala:56) > at > org.apache.spark.scheduler.DirectTaskResult.$anonfun$writeExternal$1(TaskResult.scala:55) > at > scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) > at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1434) > at > org.apache.spark.scheduler.DirectTaskResult.writeExternal(TaskResult.scala:51) > at > java.io.ObjectOutputStream.writeExternalData(ObjectOutputStream.java:1459) > at > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1430) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) > at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348) > at > org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:44) > at > org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:101) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:616) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37203) Fix NotSerializableException when observe with percentile_approx
[ https://issues.apache.org/jira/browse/SPARK-37203?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37203: Assignee: Apache Spark > Fix NotSerializableException when observe with percentile_approx > > > Key: SPARK-37203 > URL: https://issues.apache.org/jira/browse/SPARK-37203 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.0 >Reporter: jiaan.geng >Assignee: Apache Spark >Priority: Major > > {code:java} > val namedObservation = Observation("named") > val df = spark.range(100) > val observed_df = df.observe( >namedObservation, percentile_approx($"id", lit(0.5), > lit(100)).as("percentile_approx_val")) > observed_df.collect() > namedObservation.get > {code} > throws exception as follows: > {code:java} > 15:16:27.994 ERROR org.apache.spark.util.Utils: Exception encountered > java.io.NotSerializableException: > org.apache.spark.sql.catalyst.expressions.aggregate.ApproximatePercentile$PercentileDigest > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1184) > at > java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548) > at > java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509) > at > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) > at java.io.ObjectOutputStream.writeArray(ObjectOutputStream.java:1378) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1174) > at > java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548) > at > java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509) > at > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) > at > java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548) > at > java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509) > at > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) > at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348) > at > org.apache.spark.scheduler.DirectTaskResult.$anonfun$writeExternal$2(TaskResult.scala:55) > at > org.apache.spark.scheduler.DirectTaskResult.$anonfun$writeExternal$2$adapted(TaskResult.scala:55) > at scala.collection.Iterator.foreach(Iterator.scala:943) > at scala.collection.Iterator.foreach$(Iterator.scala:943) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1431) > at scala.collection.IterableLike.foreach(IterableLike.scala:74) > at scala.collection.IterableLike.foreach$(IterableLike.scala:73) > at scala.collection.AbstractIterable.foreach(Iterable.scala:56) > at > org.apache.spark.scheduler.DirectTaskResult.$anonfun$writeExternal$1(TaskResult.scala:55) > at > scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) > at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1434) > at > org.apache.spark.scheduler.DirectTaskResult.writeExternal(TaskResult.scala:51) > at > java.io.ObjectOutputStream.writeExternalData(ObjectOutputStream.java:1459) > at > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1430) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) > at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348) > at > org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:44) > at > org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:101) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:616) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37203) Fix NotSerializableException when observe with percentile_approx
[ https://issues.apache.org/jira/browse/SPARK-37203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17437780#comment-17437780 ] jiaan.geng commented on SPARK-37203: I'm working on. > Fix NotSerializableException when observe with percentile_approx > > > Key: SPARK-37203 > URL: https://issues.apache.org/jira/browse/SPARK-37203 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.0 >Reporter: jiaan.geng >Priority: Major > > {code:java} > val namedObservation = Observation("named") > val df = spark.range(100) > val observed_df = df.observe( >namedObservation, percentile_approx($"id", lit(0.5), > lit(100)).as("percentile_approx_val")) > observed_df.collect() > namedObservation.get > {code} > throws exception as follows: > {code:java} > 15:16:27.994 ERROR org.apache.spark.util.Utils: Exception encountered > java.io.NotSerializableException: > org.apache.spark.sql.catalyst.expressions.aggregate.ApproximatePercentile$PercentileDigest > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1184) > at > java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548) > at > java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509) > at > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) > at java.io.ObjectOutputStream.writeArray(ObjectOutputStream.java:1378) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1174) > at > java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548) > at > java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509) > at > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) > at > java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548) > at > java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509) > at > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) > at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348) > at > org.apache.spark.scheduler.DirectTaskResult.$anonfun$writeExternal$2(TaskResult.scala:55) > at > org.apache.spark.scheduler.DirectTaskResult.$anonfun$writeExternal$2$adapted(TaskResult.scala:55) > at scala.collection.Iterator.foreach(Iterator.scala:943) > at scala.collection.Iterator.foreach$(Iterator.scala:943) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1431) > at scala.collection.IterableLike.foreach(IterableLike.scala:74) > at scala.collection.IterableLike.foreach$(IterableLike.scala:73) > at scala.collection.AbstractIterable.foreach(Iterable.scala:56) > at > org.apache.spark.scheduler.DirectTaskResult.$anonfun$writeExternal$1(TaskResult.scala:55) > at > scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) > at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1434) > at > org.apache.spark.scheduler.DirectTaskResult.writeExternal(TaskResult.scala:51) > at > java.io.ObjectOutputStream.writeExternalData(ObjectOutputStream.java:1459) > at > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1430) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) > at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348) > at > org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:44) > at > org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:101) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:616) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-37203) Fix NotSerializableException when observe with percentile_approx
jiaan.geng created SPARK-37203: -- Summary: Fix NotSerializableException when observe with percentile_approx Key: SPARK-37203 URL: https://issues.apache.org/jira/browse/SPARK-37203 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.3.0 Reporter: jiaan.geng {code:java} val namedObservation = Observation("named") val df = spark.range(100) val observed_df = df.observe( namedObservation, percentile_approx($"id", lit(0.5), lit(100)).as("percentile_approx_val")) observed_df.collect() namedObservation.get {code} throws exception as follows: {code:java} 15:16:27.994 ERROR org.apache.spark.util.Utils: Exception encountered java.io.NotSerializableException: org.apache.spark.sql.catalyst.expressions.aggregate.ApproximatePercentile$PercentileDigest at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1184) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) at java.io.ObjectOutputStream.writeArray(ObjectOutputStream.java:1378) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1174) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348) at org.apache.spark.scheduler.DirectTaskResult.$anonfun$writeExternal$2(TaskResult.scala:55) at org.apache.spark.scheduler.DirectTaskResult.$anonfun$writeExternal$2$adapted(TaskResult.scala:55) at scala.collection.Iterator.foreach(Iterator.scala:943) at scala.collection.Iterator.foreach$(Iterator.scala:943) at scala.collection.AbstractIterator.foreach(Iterator.scala:1431) at scala.collection.IterableLike.foreach(IterableLike.scala:74) at scala.collection.IterableLike.foreach$(IterableLike.scala:73) at scala.collection.AbstractIterable.foreach(Iterable.scala:56) at org.apache.spark.scheduler.DirectTaskResult.$anonfun$writeExternal$1(TaskResult.scala:55) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1434) at org.apache.spark.scheduler.DirectTaskResult.writeExternal(TaskResult.scala:51) at java.io.ObjectOutputStream.writeExternalData(ObjectOutputStream.java:1459) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1430) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348) at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:44) at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:101) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:616) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-37200) Drop index support
[ https://issues.apache.org/jira/browse/SPARK-37200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Huaxin Gao resolved SPARK-37200. Fix Version/s: 3.3.0 Assignee: Huaxin Gao Resolution: Fixed > Drop index support > -- > > Key: SPARK-37200 > URL: https://issues.apache.org/jira/browse/SPARK-37200 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: Huaxin Gao >Assignee: Huaxin Gao >Priority: Major > Fix For: 3.3.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37179) ANSI mode: Add a config to allow casting between Datetime and Numeric
[ https://issues.apache.org/jira/browse/SPARK-37179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gengliang Wang updated SPARK-37179: --- Description: Add a config `spark.sql.ansi.allowCastBetweenDatetimeAndNumeric`to allow casting between Datetime and Numeric. The default value of the configuration is `false`. Also, casting double/float type to timestamp should raise exceptions if there is overflow or the input is Nan/infinite. This is for better adoption of ANSI SQL mode: - As we did some data science, we found that many Spark SQL users are actually using `Cast(Timestamp as Numeric)` and `Cast(Numeric as Timestamp)`. There are also some usages of `Cast(Date as Numeric)`. - The Spark SQL connector for Tableau is using this feature for DateTime math. e.g. `CAST(FROM_UNIXTIME(CAST(CAST(%1 AS BIGINT) + (%2 * 86400) AS BIGINT)) AS TIMESTAMP)` So, having a new configuration can provide users with an alternative choice on turning on ANSI mode. was: We should allow the casting between Timestamp and Numeric types: * As we did some data science, we found that many Spark SQL users are actually using `Cast(Timestamp as Numeric)` and `Cast(Numeric as Timestamp)`. * The Spark SQL connector for Tableau is using this feature for DateTime math. e.g. {code:java} CAST(FROM_UNIXTIME(CAST(CAST(%1 AS BIGINT) + (%2 * 86400) AS BIGINT)) AS TIMESTAMP) {code} * In the current syntax, we specially allow Numeric <=> Boolean and String <=> Binary since they are straight forward and frequently used. I suggest we allow Timestamp <=> Numeric as well for better ANSI mode adoption. > ANSI mode: Add a config to allow casting between Datetime and Numeric > - > > Key: SPARK-37179 > URL: https://issues.apache.org/jira/browse/SPARK-37179 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Major > Fix For: 3.3.0 > > > Add a config `spark.sql.ansi.allowCastBetweenDatetimeAndNumeric`to allow > casting between Datetime and Numeric. The default value of the configuration > is `false`. > Also, casting double/float type to timestamp should raise exceptions if there > is overflow or the input is Nan/infinite. > This is for better adoption of ANSI SQL mode: > - As we did some data science, we found that many Spark SQL users are > actually using `Cast(Timestamp as Numeric)` and `Cast(Numeric as Timestamp)`. > There are also some usages of `Cast(Date as Numeric)`. > - The Spark SQL connector for Tableau is using this feature for DateTime > math. e.g. > `CAST(FROM_UNIXTIME(CAST(CAST(%1 AS BIGINT) + (%2 * 86400) AS BIGINT)) AS > TIMESTAMP)` > So, having a new configuration can provide users with an alternative choice > on turning on ANSI mode. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37179) ANSI mode: Add a config to allow casting between Datetime and Numeric
[ https://issues.apache.org/jira/browse/SPARK-37179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gengliang Wang updated SPARK-37179: --- Summary: ANSI mode: Add a config to allow casting between Datetime and Numeric (was: ANSI mode: Allow casting between Timestamp and Numeric) > ANSI mode: Add a config to allow casting between Datetime and Numeric > - > > Key: SPARK-37179 > URL: https://issues.apache.org/jira/browse/SPARK-37179 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Major > Fix For: 3.3.0 > > > We should allow the casting between Timestamp and Numeric types: > * As we did some data science, we found that many Spark SQL users are > actually using `Cast(Timestamp as Numeric)` and `Cast(Numeric as Timestamp)`. > * The Spark SQL connector for Tableau is using this feature for DateTime > math. e.g. > {code:java} > CAST(FROM_UNIXTIME(CAST(CAST(%1 AS BIGINT) + (%2 * 86400) AS BIGINT)) AS > TIMESTAMP) > {code} > * In the current syntax, we specially allow Numeric <=> Boolean and String > <=> Binary since they are straight forward and frequently used. I suggest we > allow Timestamp <=> Numeric as well for better ANSI mode adoption. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-37179) ANSI mode: Allow casting between Timestamp and Numeric
[ https://issues.apache.org/jira/browse/SPARK-37179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-37179. - Fix Version/s: 3.3.0 Resolution: Fixed Issue resolved by pull request 34459 [https://github.com/apache/spark/pull/34459] > ANSI mode: Allow casting between Timestamp and Numeric > -- > > Key: SPARK-37179 > URL: https://issues.apache.org/jira/browse/SPARK-37179 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Major > Fix For: 3.3.0 > > > We should allow the casting between Timestamp and Numeric types: > * As we did some data science, we found that many Spark SQL users are > actually using `Cast(Timestamp as Numeric)` and `Cast(Numeric as Timestamp)`. > * The Spark SQL connector for Tableau is using this feature for DateTime > math. e.g. > {code:java} > CAST(FROM_UNIXTIME(CAST(CAST(%1 AS BIGINT) + (%2 * 86400) AS BIGINT)) AS > TIMESTAMP) > {code} > * In the current syntax, we specially allow Numeric <=> Boolean and String > <=> Binary since they are straight forward and frequently used. I suggest we > allow Timestamp <=> Numeric as well for better ANSI mode adoption. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org