[jira] [Commented] (SPARK-34648) Reading Parquet Files in Spark Extremely Slow for Large Number of Files?
[ https://issues.apache.org/jira/browse/SPARK-34648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17296471#comment-17296471 ] Pankaj Bhootra commented on SPARK-34648: [~marmbrus] - I understand you were involved in discussions on SPARK-9347. Will you be able to help with this? > Reading Parquet Files in Spark Extremely Slow for Large Number of Files? > > > Key: SPARK-34648 > URL: https://issues.apache.org/jira/browse/SPARK-34648 > Project: Spark > Issue Type: Question > Components: SQL >Affects Versions: 2.3.0 >Reporter: Pankaj Bhootra >Priority: Major > > Hello Team > I am new to Spark and this question may be a possible duplicate of the issue > highlighted here: https://issues.apache.org/jira/browse/SPARK-9347 > We have a large dataset partitioned by calendar date, and within each date > partition, we are storing the data as *parquet* files in 128 parts. > We are trying to run aggregation on this dataset for 366 dates at a time with > Spark SQL on spark version 2.3.0, hence our Spark job is reading > 366*128=46848 partitions, all of which are parquet files. There is currently > no *_metadata* or *_common_metadata* file(s) available for this dataset. > The problem we are facing is that when we try to run *spark.read.parquet* on > the above 46848 partitions, our data reads are extremely slow. It takes a > long time to run even a simple map task (no shuffling) without any > aggregation or group by. > I read through the above issue and I think I perhaps generally understand the > ideas around *_common_metadata* file. But the above issue was raised for > Spark 1.3.1 and for Spark 2.3.0, I have not found any documentation related > to this metadata file so far. > I would like to clarify: > # What's the latest, best practice for reading large number of parquet files > efficiently? > # Does this involve using any additional options with spark.read.parquet? > How would that work? > # Are there other possible reasons for slow data reads apart from reading > metadata for every part? We are basically trying to migrate our existing > spark pipeline from using csv files to parquet, but from my hands-on so far, > it seems that parquet's read time is slower than csv? This seems > contradictory to popular opinion that parquet performs better in terms of > both computation and storage? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-34648) Reading Parquet Files in Spark Extremely Slow for Large Number of Files?
Pankaj Bhootra created SPARK-34648: -- Summary: Reading Parquet Files in Spark Extremely Slow for Large Number of Files? Key: SPARK-34648 URL: https://issues.apache.org/jira/browse/SPARK-34648 Project: Spark Issue Type: Question Components: SQL Affects Versions: 2.3.0 Reporter: Pankaj Bhootra Hello Team I am new to Spark and this question may be a possible duplicate of the issue highlighted here: https://issues.apache.org/jira/browse/SPARK-9347 We have a large dataset partitioned by calendar date, and within each date partition, we are storing the data as *parquet* files in 128 parts. We are trying to run aggregation on this dataset for 366 dates at a time with Spark SQL on spark version 2.3.0, hence our Spark job is reading 366*128=46848 partitions, all of which are parquet files. There is currently no *_metadata* or *_common_metadata* file(s) available for this dataset. The problem we are facing is that when we try to run *spark.read.parquet* on the above 46848 partitions, our data reads are extremely slow. It takes a long time to run even a simple map task (no shuffling) without any aggregation or group by. I read through the above issue and I think I perhaps generally understand the ideas around *_common_metadata* file. But the above issue was raised for Spark 1.3.1 and for Spark 2.3.0, I have not found any documentation related to this metadata file so far. I would like to clarify: # What's the latest, best practice for reading large number of parquet files efficiently? # Does this involve using any additional options with spark.read.parquet? How would that work? # Are there other possible reasons for slow data reads apart from reading metadata for every part? We are basically trying to migrate our existing spark pipeline from using csv files to parquet, but from my hands-on so far, it seems that parquet's read time is slower than csv? This seems contradictory to popular opinion that parquet performs better in terms of both computation and storage? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-34647) Upgrade ZSTD-JNI to 1.4.8-7 and use NoFinalizer classes
[ https://issues.apache.org/jira/browse/SPARK-34647?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-34647: Assignee: Apache Spark > Upgrade ZSTD-JNI to 1.4.8-7 and use NoFinalizer classes > --- > > Key: SPARK-34647 > URL: https://issues.apache.org/jira/browse/SPARK-34647 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.2.0 >Reporter: Dongjoon Hyun >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34647) Upgrade ZSTD-JNI to 1.4.8-7 and use NoFinalizer classes
[ https://issues.apache.org/jira/browse/SPARK-34647?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17296443#comment-17296443 ] Apache Spark commented on SPARK-34647: -- User 'dongjoon-hyun' has created a pull request for this issue: https://github.com/apache/spark/pull/31762 > Upgrade ZSTD-JNI to 1.4.8-7 and use NoFinalizer classes > --- > > Key: SPARK-34647 > URL: https://issues.apache.org/jira/browse/SPARK-34647 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.2.0 >Reporter: Dongjoon Hyun >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-34647) Upgrade ZSTD-JNI to 1.4.8-7 and use NoFinalizer classes
[ https://issues.apache.org/jira/browse/SPARK-34647?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-34647: Assignee: (was: Apache Spark) > Upgrade ZSTD-JNI to 1.4.8-7 and use NoFinalizer classes > --- > > Key: SPARK-34647 > URL: https://issues.apache.org/jira/browse/SPARK-34647 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.2.0 >Reporter: Dongjoon Hyun >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-34647) Upgrade ZSTD-JNI to 1.4.8-7 and use NoFinalizer classes
Dongjoon Hyun created SPARK-34647: - Summary: Upgrade ZSTD-JNI to 1.4.8-7 and use NoFinalizer classes Key: SPARK-34647 URL: https://issues.apache.org/jira/browse/SPARK-34647 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 3.2.0 Reporter: Dongjoon Hyun -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-30303) Percentile will hit match err when percentage is null
[ https://issues.apache.org/jira/browse/SPARK-30303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang resolved SPARK-30303. - Fix Version/s: 3.0.0 Resolution: Fixed > Percentile will hit match err when percentage is null > - > > Key: SPARK-30303 > URL: https://issues.apache.org/jira/browse/SPARK-30303 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Kent Yao >Priority: Major > Fix For: 3.0.0 > > > {code:java} > spark-sql> select percentile(1, null); > 19/12/19 16:35:04 ERROR SparkSQLDriver: Failed in [select percentile(1, null)] > scala.MatchError: null > spark-sql> select percentile(1, array(0.0, null)); -- null is illegal but > treat as zero > {code} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33539) Standardize exception messages in Spark
[ https://issues.apache.org/jira/browse/SPARK-33539?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Allison Wang updated SPARK-33539: - Description: In the SPIP: Standardize Exception Messages in Spark, there are three major improvements proposed: # Group error messages in dedicated files. # Establish an error message guideline for developers. # Improve error message quality. The first step is to centralize error messages for each component into its own dedicated file(s). This can help with auditing error messages and subsequent tasks to establish a guideline and improve message quality in the future. A general rule of thumb for grouping exceptions: * AnalysisException => QueryCompilationErrors * SparkException, RuntimeException(UnsupportedOperationException, IllegalStateException...) => QueryExecutionErrors * ParseException => QueryParsingErrors (A special case for {{Command}}: since commands are executed eagerly, users can immediately see the errors even if the exceptions are thrown in {{SparkPlan.execute}}. In this case, we should group the exceptions in commands as QueryCompilationErrors.) Here is an example RP to group all `AnalysisExcpetion` in Analyzer into QueryCompilationErrors: [SPARK-32670|https://github.com/apache/spark/pull/29497] Please see the [SPIP|https://docs.google.com/document/d/1XGj1o3xAFh8BA7RCn3DtwIPC6--hIFOaNUNSlpaOIZs/edit?usp=sharing] for more details. Exceptions per component: ||component||exception|| |sql|714| |core|334| |mllib|161| |streaming|43| |resource-managers|42| |external|35| |examples|20| |mllib-local|10| |graphx|6| |repl|5| |hadoop-cloud|1| was: In the SPIP: Standardize Exception Messages in Spark, there are three major improvements proposed: # Group error messages in dedicated files. # Establish an error message guideline for developers. # Improve error message quality. The first step is to centralize error messages for each component into its own dedicated file(s). This can help with auditing error messages and subsequent tasks to establish a guideline and improve message quality in the future. A general rule of thumb for grouping exceptions: * AnalysisException => QueryCompilationErrors * SparkException, RuntimeException(UnsupportedOperationException, IllegalStateException...) => QueryExecutionErrors * ParseException => QueryParsingErrors (A special case for `Command`: since commands are executed eagerly, users can immediately see the errors even if the exceptions are thrown in `SparkPlan.execute`. In this case, we should group the exceptions in commands as QueryCompilationErrors.) Here is an example RP to group all `AnalysisExcpetion` in Analyzer into QueryCompilationErrors: [SPARK-32670|https://github.com/apache/spark/pull/29497] Please see the [SPIP|https://docs.google.com/document/d/1XGj1o3xAFh8BA7RCn3DtwIPC6--hIFOaNUNSlpaOIZs/edit?usp=sharing] for more details. Exceptions per component: ||component||exception|| |sql|714| |core|334| |mllib|161| |streaming|43| |resource-managers|42| |external|35| |examples|20| |mllib-local|10| |graphx|6| |repl|5| |hadoop-cloud|1| > Standardize exception messages in Spark > --- > > Key: SPARK-33539 > URL: https://issues.apache.org/jira/browse/SPARK-33539 > Project: Spark > Issue Type: Improvement > Components: Spark Core, SQL >Affects Versions: 3.1.0 >Reporter: Allison Wang >Priority: Major > > In the SPIP: Standardize Exception Messages in Spark, there are three major > improvements proposed: > # Group error messages in dedicated files. > # Establish an error message guideline for developers. > # Improve error message quality. > The first step is to centralize error messages for each component into its > own dedicated file(s). This can help with auditing error messages and > subsequent tasks to establish a guideline and improve message quality in the > future. > A general rule of thumb for grouping exceptions: > * AnalysisException => QueryCompilationErrors > * SparkException, RuntimeException(UnsupportedOperationException, > IllegalStateException...) => QueryExecutionErrors > * ParseException => QueryParsingErrors > (A special case for {{Command}}: since commands are executed eagerly, users > can immediately see the errors even if the exceptions are thrown in > {{SparkPlan.execute}}. In this case, we should group the exceptions in > commands as QueryCompilationErrors.) > Here is an example RP to group all `AnalysisExcpetion` in Analyzer into > QueryCompilationErrors: > [SPARK-32670|https://github.com/apache/spark/pull/29497] > Please see the > [SPIP|https://docs.google.com/document/d/1XGj1o3xAFh8BA7RCn3DtwIPC6--hIFOaNUNSlpaOIZs/edit?usp=sharing] > for more details. > Exceptions per component: > ||component||exception|| > |sql|714| > |core|334| > |mllib|161| >
[jira] [Updated] (SPARK-33539) Standardize exception messages in Spark
[ https://issues.apache.org/jira/browse/SPARK-33539?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Allison Wang updated SPARK-33539: - Description: In the SPIP: Standardize Exception Messages in Spark, there are three major improvements proposed: # Group error messages in dedicated files. # Establish an error message guideline for developers. # Improve error message quality. The first step is to centralize error messages for each component into its own dedicated file(s). This can help with auditing error messages and subsequent tasks to establish a guideline and improve message quality in the future. A general rule of thumb for grouping exceptions: * AnalysisException => QueryCompilationErrors * SparkException, RuntimeException(UnsupportedOperationException, IllegalStateException...) => QueryExecutionErrors * ParseException => QueryParsingErrors (A special case for `Command`: since commands are executed eagerly, users can immediately see the errors even if the exceptions are thrown in `SparkPlan.execute`. In this case, we should group the exceptions in commands as QueryCompilationErrors.) Here is an example RP to group all `AnalysisExcpetion` in Analyzer into QueryCompilationErrors: [SPARK-32670|https://github.com/apache/spark/pull/29497] Please see the [SPIP|https://docs.google.com/document/d/1XGj1o3xAFh8BA7RCn3DtwIPC6--hIFOaNUNSlpaOIZs/edit?usp=sharing] for more details. Exceptions per component: ||component||exception|| |sql|714| |core|334| |mllib|161| |streaming|43| |resource-managers|42| |external|35| |examples|20| |mllib-local|10| |graphx|6| |repl|5| |hadoop-cloud|1| was: In the SPIP: Standardize Exception Messages in Spark, there are three major improvements proposed: # Group error messages in dedicated files. # Establish an error message guideline for developers. # Improve error message quality. The first step is to centralize error messages for each component into its own dedicated file(s). This can help with auditing error messages and subsequent tasks to establish a guideline and improve message quality in the future. A general rule of thumb for grouping exceptions: * AnalysisException => QueryCompilationErrors * SparkException, RuntimeException(UnsupportedOperationException, IllegalStateException...) => QueryExecutionErrors * ParseException => QueryParsingErrors Here is an example RP to group all `AnalysisExcpetion` in Analyzer into QueryCompilationErrors: [SPARK-32670|https://github.com/apache/spark/pull/29497] Please see the [SPIP|https://docs.google.com/document/d/1XGj1o3xAFh8BA7RCn3DtwIPC6--hIFOaNUNSlpaOIZs/edit?usp=sharing] for more details. Exceptions per component: ||component||exception|| |sql|714| |core|334| |mllib|161| |streaming|43| |resource-managers|42| |external|35| |examples|20| |mllib-local|10| |graphx|6| |repl|5| |hadoop-cloud|1| > Standardize exception messages in Spark > --- > > Key: SPARK-33539 > URL: https://issues.apache.org/jira/browse/SPARK-33539 > Project: Spark > Issue Type: Improvement > Components: Spark Core, SQL >Affects Versions: 3.1.0 >Reporter: Allison Wang >Priority: Major > > In the SPIP: Standardize Exception Messages in Spark, there are three major > improvements proposed: > # Group error messages in dedicated files. > # Establish an error message guideline for developers. > # Improve error message quality. > The first step is to centralize error messages for each component into its > own dedicated file(s). This can help with auditing error messages and > subsequent tasks to establish a guideline and improve message quality in the > future. > A general rule of thumb for grouping exceptions: > * AnalysisException => QueryCompilationErrors > * SparkException, RuntimeException(UnsupportedOperationException, > IllegalStateException...) => QueryExecutionErrors > * ParseException => QueryParsingErrors > (A special case for `Command`: since commands are executed eagerly, users can > immediately see the errors even if the exceptions are thrown in > `SparkPlan.execute`. In this case, we should group the exceptions in commands > as QueryCompilationErrors.) > Here is an example RP to group all `AnalysisExcpetion` in Analyzer into > QueryCompilationErrors: > [SPARK-32670|https://github.com/apache/spark/pull/29497] > Please see the > [SPIP|https://docs.google.com/document/d/1XGj1o3xAFh8BA7RCn3DtwIPC6--hIFOaNUNSlpaOIZs/edit?usp=sharing] > for more details. > Exceptions per component: > ||component||exception|| > |sql|714| > |core|334| > |mllib|161| > |streaming|43| > |resource-managers|42| > |external|35| > |examples|20| > |mllib-local|10| > |graphx|6| > |repl|5| > |hadoop-cloud|1| -- This message was sent by Atlassian Jira (v8.3.4#803005) -
[jira] [Commented] (SPARK-34645) [K8S] Driver pod stuck in Running state after job completes
[ https://issues.apache.org/jira/browse/SPARK-34645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17296393#comment-17296393 ] Dongjoon Hyun commented on SPARK-34645: --- Thank you for investigation. I'm looking forward to testing with it. > [K8S] Driver pod stuck in Running state after job completes > --- > > Key: SPARK-34645 > URL: https://issues.apache.org/jira/browse/SPARK-34645 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 3.0.2 > Environment: Kubernetes: > {code:java} > Client Version: version.Info{Major:"1", Minor:"19", GitVersion:"v1.19.2", > GitCommit:"f5743093fd1c663cb0cbc89748f730662345d44d", GitTreeState:"clean", > BuildDate:"2020-09-16T13:41:02Z", GoVersion:"go1.15", Compiler:"gc", > Platform:"linux/amd64"} > Server Version: version.Info{Major:"1", Minor:"13", GitVersion:"v1.13.5", > GitCommit:"2166946f41b36dea2c4626f90a77706f426cdea2", GitTreeState:"clean", > BuildDate:"2019-03-25T15:19:22Z", GoVersion:"go1.11.5", Compiler:"gc", > Platform:"linux/amd64"} > {code} >Reporter: Andy Grove >Priority: Major > > I am running automated benchmarks in k8s, using spark-submit in cluster mode, > so the driver runs in a pod. > When running with Spark 3.0.1 and 3.1.1 everything works as expected and I > see the Spark context being shut down after the job completes. > However, when running with Spark 3.0.2 I do not see the context get shut down > and the driver pod is stuck in the Running state indefinitely. > This is the output I see after job completion with 3.0.1 and 3.1.1 and this > output does not appear with 3.0.2. With 3.0.2 there is no output at all after > the job completes. > {code:java} > 2021-03-05 20:09:24,576 INFO spark.SparkContext: Invoking stop() from > shutdown hook > 2021-03-05 20:09:24,592 INFO server.AbstractConnector: Stopped > Spark@784499d0{HTTP/1.1, (http/1.1)}{0.0.0.0:4040} > 2021-03-05 20:09:24,594 INFO ui.SparkUI: Stopped Spark web UI at > http://benchmark-runner-3e8a38780400e0d1-driver-svc.default.svc:4040 > 2021-03-05 20:09:24,599 INFO k8s.KubernetesClusterSchedulerBackend: Shutting > down all executors > 2021-03-05 20:09:24,600 INFO > k8s.KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: Asking each > executor to shut down > 2021-03-05 20:09:24,609 WARN k8s.ExecutorPodsWatchSnapshotSource: Kubernetes > client has been closed (this is expected if the application is shutting down.) > 2021-03-05 20:09:24,719 INFO spark.MapOutputTrackerMasterEndpoint: > MapOutputTrackerMasterEndpoint stopped! > 2021-03-05 20:09:24,736 INFO memory.MemoryStore: MemoryStore cleared > 2021-03-05 20:09:24,738 INFO storage.BlockManager: BlockManager stopped > 2021-03-05 20:09:24,744 INFO storage.BlockManagerMaster: BlockManagerMaster > stopped > 2021-03-05 20:09:24,752 INFO > scheduler.OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: > OutputCommitCoordinator stopped! > 2021-03-05 20:09:24,768 INFO spark.SparkContext: Successfully stopped > SparkContext > 2021-03-05 20:09:24,768 INFO util.ShutdownHookManager: Shutdown hook called > 2021-03-05 20:09:24,769 INFO util.ShutdownHookManager: Deleting directory > /var/data/spark-67fa44df-e86c-463a-a149-25d95817ff8e/spark-a5476c14-c103-4108-b733-961400485d8a > 2021-03-05 20:09:24,772 INFO util.ShutdownHookManager: Deleting directory > /tmp/spark-9d6261f5-4394-472b-9c9a-e22bde877814 > 2021-03-05 20:09:24,778 INFO impl.MetricsSystemImpl: Stopping s3a-file-system > metrics system... > 2021-03-05 20:09:24,779 INFO impl.MetricsSystemImpl: s3a-file-system metrics > system stopped. > 2021-03-05 20:09:24,779 INFO impl.MetricsSystemImpl: s3a-file-system metrics > system shutdown complete. > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34645) [K8S] Driver pod stuck in Running state after job completes
[ https://issues.apache.org/jira/browse/SPARK-34645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17296391#comment-17296391 ] Andy Grove commented on SPARK-34645: I can reproduce this with Spark 3.1.1 + JDK 11 as well (and works fine with JDK 8). I will see if I can make a repro case. > [K8S] Driver pod stuck in Running state after job completes > --- > > Key: SPARK-34645 > URL: https://issues.apache.org/jira/browse/SPARK-34645 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 3.0.2 > Environment: Kubernetes: > {code:java} > Client Version: version.Info{Major:"1", Minor:"19", GitVersion:"v1.19.2", > GitCommit:"f5743093fd1c663cb0cbc89748f730662345d44d", GitTreeState:"clean", > BuildDate:"2020-09-16T13:41:02Z", GoVersion:"go1.15", Compiler:"gc", > Platform:"linux/amd64"} > Server Version: version.Info{Major:"1", Minor:"13", GitVersion:"v1.13.5", > GitCommit:"2166946f41b36dea2c4626f90a77706f426cdea2", GitTreeState:"clean", > BuildDate:"2019-03-25T15:19:22Z", GoVersion:"go1.11.5", Compiler:"gc", > Platform:"linux/amd64"} > {code} >Reporter: Andy Grove >Priority: Major > > I am running automated benchmarks in k8s, using spark-submit in cluster mode, > so the driver runs in a pod. > When running with Spark 3.0.1 and 3.1.1 everything works as expected and I > see the Spark context being shut down after the job completes. > However, when running with Spark 3.0.2 I do not see the context get shut down > and the driver pod is stuck in the Running state indefinitely. > This is the output I see after job completion with 3.0.1 and 3.1.1 and this > output does not appear with 3.0.2. With 3.0.2 there is no output at all after > the job completes. > {code:java} > 2021-03-05 20:09:24,576 INFO spark.SparkContext: Invoking stop() from > shutdown hook > 2021-03-05 20:09:24,592 INFO server.AbstractConnector: Stopped > Spark@784499d0{HTTP/1.1, (http/1.1)}{0.0.0.0:4040} > 2021-03-05 20:09:24,594 INFO ui.SparkUI: Stopped Spark web UI at > http://benchmark-runner-3e8a38780400e0d1-driver-svc.default.svc:4040 > 2021-03-05 20:09:24,599 INFO k8s.KubernetesClusterSchedulerBackend: Shutting > down all executors > 2021-03-05 20:09:24,600 INFO > k8s.KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: Asking each > executor to shut down > 2021-03-05 20:09:24,609 WARN k8s.ExecutorPodsWatchSnapshotSource: Kubernetes > client has been closed (this is expected if the application is shutting down.) > 2021-03-05 20:09:24,719 INFO spark.MapOutputTrackerMasterEndpoint: > MapOutputTrackerMasterEndpoint stopped! > 2021-03-05 20:09:24,736 INFO memory.MemoryStore: MemoryStore cleared > 2021-03-05 20:09:24,738 INFO storage.BlockManager: BlockManager stopped > 2021-03-05 20:09:24,744 INFO storage.BlockManagerMaster: BlockManagerMaster > stopped > 2021-03-05 20:09:24,752 INFO > scheduler.OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: > OutputCommitCoordinator stopped! > 2021-03-05 20:09:24,768 INFO spark.SparkContext: Successfully stopped > SparkContext > 2021-03-05 20:09:24,768 INFO util.ShutdownHookManager: Shutdown hook called > 2021-03-05 20:09:24,769 INFO util.ShutdownHookManager: Deleting directory > /var/data/spark-67fa44df-e86c-463a-a149-25d95817ff8e/spark-a5476c14-c103-4108-b733-961400485d8a > 2021-03-05 20:09:24,772 INFO util.ShutdownHookManager: Deleting directory > /tmp/spark-9d6261f5-4394-472b-9c9a-e22bde877814 > 2021-03-05 20:09:24,778 INFO impl.MetricsSystemImpl: Stopping s3a-file-system > metrics system... > 2021-03-05 20:09:24,779 INFO impl.MetricsSystemImpl: s3a-file-system metrics > system stopped. > 2021-03-05 20:09:24,779 INFO impl.MetricsSystemImpl: s3a-file-system metrics > system shutdown complete. > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-34601) Do not delete shuffle file on executor lost event when using remote shuffle service
[ https://issues.apache.org/jira/browse/SPARK-34601?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] BoYang resolved SPARK-34601. Resolution: Won't Fix > Do not delete shuffle file on executor lost event when using remote shuffle > service > --- > > Key: SPARK-34601 > URL: https://issues.apache.org/jira/browse/SPARK-34601 > Project: Spark > Issue Type: New Feature > Components: Shuffle >Affects Versions: 3.2.0 >Reporter: BoYang >Priority: Major > Labels: shuffle > > There are multiple work going on with disaggregated/remote shuffle service > (e.g. [LinkedIn > shuffle|https://engineering.linkedin.com/blog/2020/introducing-magnet], > [Facebook shuffle > service|https://databricks.com/session/cosco-an-efficient-facebook-scale-shuffle-service], > [Uber shuffle service|https://github.com/uber/RemoteShuffleService]). Such > remote shuffle service is not Spark External Shuffle Service. It could be > third party shuffle solution and user uses it by setting > spark.shuffle.manager. In those systems, shuffle data will be stored on > different server other than executor. Spark should not mark shuffle data lost > when the executor is lost. We could add a Spark configuration to control this > behavior. By default, Spark still mark shuffle file lost. For > disaggregated/remote shuffle service, people could set the configure to not > mark shuffle file lost. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34601) Do not delete shuffle file on executor lost event when using remote shuffle service
[ https://issues.apache.org/jira/browse/SPARK-34601?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17296390#comment-17296390 ] BoYang commented on SPARK-34601: It looks Spark already checks and matches the executor id when it tries to remove map output. I will close this ticket. > Do not delete shuffle file on executor lost event when using remote shuffle > service > --- > > Key: SPARK-34601 > URL: https://issues.apache.org/jira/browse/SPARK-34601 > Project: Spark > Issue Type: New Feature > Components: Shuffle >Affects Versions: 3.2.0 >Reporter: BoYang >Priority: Major > Labels: shuffle > > There are multiple work going on with disaggregated/remote shuffle service > (e.g. [LinkedIn > shuffle|https://engineering.linkedin.com/blog/2020/introducing-magnet], > [Facebook shuffle > service|https://databricks.com/session/cosco-an-efficient-facebook-scale-shuffle-service], > [Uber shuffle service|https://github.com/uber/RemoteShuffleService]). Such > remote shuffle service is not Spark External Shuffle Service. It could be > third party shuffle solution and user uses it by setting > spark.shuffle.manager. In those systems, shuffle data will be stored on > different server other than executor. Spark should not mark shuffle data lost > when the executor is lost. We could add a Spark configuration to control this > behavior. By default, Spark still mark shuffle file lost. For > disaggregated/remote shuffle service, people could set the configure to not > mark shuffle file lost. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34645) [K8S] Driver pod stuck in Running state after job completes
[ https://issues.apache.org/jira/browse/SPARK-34645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17296385#comment-17296385 ] Dongjoon Hyun commented on SPARK-34645: --- BTW, since Apache Spark 3.0.2 is released before Apache Spark 3.1.1, Apache Spark 3.1.1 contains all changes in Apache Spark 3.0.2. > [K8S] Driver pod stuck in Running state after job completes > --- > > Key: SPARK-34645 > URL: https://issues.apache.org/jira/browse/SPARK-34645 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 3.0.2 > Environment: Kubernetes: > {code:java} > Client Version: version.Info{Major:"1", Minor:"19", GitVersion:"v1.19.2", > GitCommit:"f5743093fd1c663cb0cbc89748f730662345d44d", GitTreeState:"clean", > BuildDate:"2020-09-16T13:41:02Z", GoVersion:"go1.15", Compiler:"gc", > Platform:"linux/amd64"} > Server Version: version.Info{Major:"1", Minor:"13", GitVersion:"v1.13.5", > GitCommit:"2166946f41b36dea2c4626f90a77706f426cdea2", GitTreeState:"clean", > BuildDate:"2019-03-25T15:19:22Z", GoVersion:"go1.11.5", Compiler:"gc", > Platform:"linux/amd64"} > {code} >Reporter: Andy Grove >Priority: Major > > I am running automated benchmarks in k8s, using spark-submit in cluster mode, > so the driver runs in a pod. > When running with Spark 3.0.1 and 3.1.1 everything works as expected and I > see the Spark context being shut down after the job completes. > However, when running with Spark 3.0.2 I do not see the context get shut down > and the driver pod is stuck in the Running state indefinitely. > This is the output I see after job completion with 3.0.1 and 3.1.1 and this > output does not appear with 3.0.2. With 3.0.2 there is no output at all after > the job completes. > {code:java} > 2021-03-05 20:09:24,576 INFO spark.SparkContext: Invoking stop() from > shutdown hook > 2021-03-05 20:09:24,592 INFO server.AbstractConnector: Stopped > Spark@784499d0{HTTP/1.1, (http/1.1)}{0.0.0.0:4040} > 2021-03-05 20:09:24,594 INFO ui.SparkUI: Stopped Spark web UI at > http://benchmark-runner-3e8a38780400e0d1-driver-svc.default.svc:4040 > 2021-03-05 20:09:24,599 INFO k8s.KubernetesClusterSchedulerBackend: Shutting > down all executors > 2021-03-05 20:09:24,600 INFO > k8s.KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: Asking each > executor to shut down > 2021-03-05 20:09:24,609 WARN k8s.ExecutorPodsWatchSnapshotSource: Kubernetes > client has been closed (this is expected if the application is shutting down.) > 2021-03-05 20:09:24,719 INFO spark.MapOutputTrackerMasterEndpoint: > MapOutputTrackerMasterEndpoint stopped! > 2021-03-05 20:09:24,736 INFO memory.MemoryStore: MemoryStore cleared > 2021-03-05 20:09:24,738 INFO storage.BlockManager: BlockManager stopped > 2021-03-05 20:09:24,744 INFO storage.BlockManagerMaster: BlockManagerMaster > stopped > 2021-03-05 20:09:24,752 INFO > scheduler.OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: > OutputCommitCoordinator stopped! > 2021-03-05 20:09:24,768 INFO spark.SparkContext: Successfully stopped > SparkContext > 2021-03-05 20:09:24,768 INFO util.ShutdownHookManager: Shutdown hook called > 2021-03-05 20:09:24,769 INFO util.ShutdownHookManager: Deleting directory > /var/data/spark-67fa44df-e86c-463a-a149-25d95817ff8e/spark-a5476c14-c103-4108-b733-961400485d8a > 2021-03-05 20:09:24,772 INFO util.ShutdownHookManager: Deleting directory > /tmp/spark-9d6261f5-4394-472b-9c9a-e22bde877814 > 2021-03-05 20:09:24,778 INFO impl.MetricsSystemImpl: Stopping s3a-file-system > metrics system... > 2021-03-05 20:09:24,779 INFO impl.MetricsSystemImpl: s3a-file-system metrics > system stopped. > 2021-03-05 20:09:24,779 INFO impl.MetricsSystemImpl: s3a-file-system metrics > system shutdown complete. > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34645) [K8S] Driver pod stuck in Running state after job completes
[ https://issues.apache.org/jira/browse/SPARK-34645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17296374#comment-17296374 ] Dongjoon Hyun commented on SPARK-34645: --- Could you provide a reproducer for this? > [K8S] Driver pod stuck in Running state after job completes > --- > > Key: SPARK-34645 > URL: https://issues.apache.org/jira/browse/SPARK-34645 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 3.0.2 > Environment: Kubernetes: > {code:java} > Client Version: version.Info{Major:"1", Minor:"19", GitVersion:"v1.19.2", > GitCommit:"f5743093fd1c663cb0cbc89748f730662345d44d", GitTreeState:"clean", > BuildDate:"2020-09-16T13:41:02Z", GoVersion:"go1.15", Compiler:"gc", > Platform:"linux/amd64"} > Server Version: version.Info{Major:"1", Minor:"13", GitVersion:"v1.13.5", > GitCommit:"2166946f41b36dea2c4626f90a77706f426cdea2", GitTreeState:"clean", > BuildDate:"2019-03-25T15:19:22Z", GoVersion:"go1.11.5", Compiler:"gc", > Platform:"linux/amd64"} > {code} >Reporter: Andy Grove >Priority: Major > > I am running automated benchmarks in k8s, using spark-submit in cluster mode, > so the driver runs in a pod. > When running with Spark 3.0.1 and 3.1.1 everything works as expected and I > see the Spark context being shut down after the job completes. > However, when running with Spark 3.0.2 I do not see the context get shut down > and the driver pod is stuck in the Running state indefinitely. > This is the output I see after job completion with 3.0.1 and 3.1.1 and this > output does not appear with 3.0.2. With 3.0.2 there is no output at all after > the job completes. > {code:java} > 2021-03-05 20:09:24,576 INFO spark.SparkContext: Invoking stop() from > shutdown hook > 2021-03-05 20:09:24,592 INFO server.AbstractConnector: Stopped > Spark@784499d0{HTTP/1.1, (http/1.1)}{0.0.0.0:4040} > 2021-03-05 20:09:24,594 INFO ui.SparkUI: Stopped Spark web UI at > http://benchmark-runner-3e8a38780400e0d1-driver-svc.default.svc:4040 > 2021-03-05 20:09:24,599 INFO k8s.KubernetesClusterSchedulerBackend: Shutting > down all executors > 2021-03-05 20:09:24,600 INFO > k8s.KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: Asking each > executor to shut down > 2021-03-05 20:09:24,609 WARN k8s.ExecutorPodsWatchSnapshotSource: Kubernetes > client has been closed (this is expected if the application is shutting down.) > 2021-03-05 20:09:24,719 INFO spark.MapOutputTrackerMasterEndpoint: > MapOutputTrackerMasterEndpoint stopped! > 2021-03-05 20:09:24,736 INFO memory.MemoryStore: MemoryStore cleared > 2021-03-05 20:09:24,738 INFO storage.BlockManager: BlockManager stopped > 2021-03-05 20:09:24,744 INFO storage.BlockManagerMaster: BlockManagerMaster > stopped > 2021-03-05 20:09:24,752 INFO > scheduler.OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: > OutputCommitCoordinator stopped! > 2021-03-05 20:09:24,768 INFO spark.SparkContext: Successfully stopped > SparkContext > 2021-03-05 20:09:24,768 INFO util.ShutdownHookManager: Shutdown hook called > 2021-03-05 20:09:24,769 INFO util.ShutdownHookManager: Deleting directory > /var/data/spark-67fa44df-e86c-463a-a149-25d95817ff8e/spark-a5476c14-c103-4108-b733-961400485d8a > 2021-03-05 20:09:24,772 INFO util.ShutdownHookManager: Deleting directory > /tmp/spark-9d6261f5-4394-472b-9c9a-e22bde877814 > 2021-03-05 20:09:24,778 INFO impl.MetricsSystemImpl: Stopping s3a-file-system > metrics system... > 2021-03-05 20:09:24,779 INFO impl.MetricsSystemImpl: s3a-file-system metrics > system stopped. > 2021-03-05 20:09:24,779 INFO impl.MetricsSystemImpl: s3a-file-system metrics > system shutdown complete. > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34645) [K8S] Driver pod stuck in Running state after job completes
[ https://issues.apache.org/jira/browse/SPARK-34645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17296370#comment-17296370 ] Dongjoon Hyun commented on SPARK-34645: --- The above my result and my local K8s IT integration test is using JDK11. {code:java} $ docker run -it --rm $ECR:3.0.2 java --version | tail -n3 openjdk 11.0.10 2021-01-19 OpenJDK Runtime Environment 18.9 (build 11.0.10+9) OpenJDK 64-Bit Server VM 18.9 (build 11.0.10+9, mixed mode, sharing) {code} > [K8S] Driver pod stuck in Running state after job completes > --- > > Key: SPARK-34645 > URL: https://issues.apache.org/jira/browse/SPARK-34645 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 3.0.2 > Environment: Kubernetes: > {code:java} > Client Version: version.Info{Major:"1", Minor:"19", GitVersion:"v1.19.2", > GitCommit:"f5743093fd1c663cb0cbc89748f730662345d44d", GitTreeState:"clean", > BuildDate:"2020-09-16T13:41:02Z", GoVersion:"go1.15", Compiler:"gc", > Platform:"linux/amd64"} > Server Version: version.Info{Major:"1", Minor:"13", GitVersion:"v1.13.5", > GitCommit:"2166946f41b36dea2c4626f90a77706f426cdea2", GitTreeState:"clean", > BuildDate:"2019-03-25T15:19:22Z", GoVersion:"go1.11.5", Compiler:"gc", > Platform:"linux/amd64"} > {code} >Reporter: Andy Grove >Priority: Major > > I am running automated benchmarks in k8s, using spark-submit in cluster mode, > so the driver runs in a pod. > When running with Spark 3.0.1 and 3.1.1 everything works as expected and I > see the Spark context being shut down after the job completes. > However, when running with Spark 3.0.2 I do not see the context get shut down > and the driver pod is stuck in the Running state indefinitely. > This is the output I see after job completion with 3.0.1 and 3.1.1 and this > output does not appear with 3.0.2. With 3.0.2 there is no output at all after > the job completes. > {code:java} > 2021-03-05 20:09:24,576 INFO spark.SparkContext: Invoking stop() from > shutdown hook > 2021-03-05 20:09:24,592 INFO server.AbstractConnector: Stopped > Spark@784499d0{HTTP/1.1, (http/1.1)}{0.0.0.0:4040} > 2021-03-05 20:09:24,594 INFO ui.SparkUI: Stopped Spark web UI at > http://benchmark-runner-3e8a38780400e0d1-driver-svc.default.svc:4040 > 2021-03-05 20:09:24,599 INFO k8s.KubernetesClusterSchedulerBackend: Shutting > down all executors > 2021-03-05 20:09:24,600 INFO > k8s.KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: Asking each > executor to shut down > 2021-03-05 20:09:24,609 WARN k8s.ExecutorPodsWatchSnapshotSource: Kubernetes > client has been closed (this is expected if the application is shutting down.) > 2021-03-05 20:09:24,719 INFO spark.MapOutputTrackerMasterEndpoint: > MapOutputTrackerMasterEndpoint stopped! > 2021-03-05 20:09:24,736 INFO memory.MemoryStore: MemoryStore cleared > 2021-03-05 20:09:24,738 INFO storage.BlockManager: BlockManager stopped > 2021-03-05 20:09:24,744 INFO storage.BlockManagerMaster: BlockManagerMaster > stopped > 2021-03-05 20:09:24,752 INFO > scheduler.OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: > OutputCommitCoordinator stopped! > 2021-03-05 20:09:24,768 INFO spark.SparkContext: Successfully stopped > SparkContext > 2021-03-05 20:09:24,768 INFO util.ShutdownHookManager: Shutdown hook called > 2021-03-05 20:09:24,769 INFO util.ShutdownHookManager: Deleting directory > /var/data/spark-67fa44df-e86c-463a-a149-25d95817ff8e/spark-a5476c14-c103-4108-b733-961400485d8a > 2021-03-05 20:09:24,772 INFO util.ShutdownHookManager: Deleting directory > /tmp/spark-9d6261f5-4394-472b-9c9a-e22bde877814 > 2021-03-05 20:09:24,778 INFO impl.MetricsSystemImpl: Stopping s3a-file-system > metrics system... > 2021-03-05 20:09:24,779 INFO impl.MetricsSystemImpl: s3a-file-system metrics > system stopped. > 2021-03-05 20:09:24,779 INFO impl.MetricsSystemImpl: s3a-file-system metrics > system shutdown complete. > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34645) [K8S] Driver pod stuck in Running state after job completes
[ https://issues.apache.org/jira/browse/SPARK-34645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17296368#comment-17296368 ] Dongjoon Hyun commented on SPARK-34645: --- First of all, both Spark 3.0.1/3.0.2 has the same K8s library. So, I guess we can exclude K8s server/client version compatibility issue. {code:java} spark-release:$ ls -al spark-3.0.2-bin-hadoop3.2/jars/kubernetes-* -rw-r--r-- 1 dongjoon staff775174 Feb 15 22:32 spark-3.0.2-bin-hadoop3.2/jars/kubernetes-client-4.9.2.jar -rw-r--r-- 1 dongjoon staff 11908731 Feb 15 22:32 spark-3.0.2-bin-hadoop3.2/jars/kubernetes-model-4.9.2.jar -rw-r--r-- 1 dongjoon staff 3954 Feb 15 22:32 spark-3.0.2-bin-hadoop3.2/jars/kubernetes-model-common-4.9.2.jar spark-release:$ ls -al spark-3.0.1-bin-hadoop3.2/jars/kubernetes-* -rw-r--r-- 1 dongjoon staff775174 Aug 28 2020 spark-3.0.1-bin-hadoop3.2/jars/kubernetes-client-4.9.2.jar -rw-r--r-- 1 dongjoon staff 11908731 Aug 28 2020 spark-3.0.1-bin-hadoop3.2/jars/kubernetes-model-4.9.2.jar -rw-r--r-- 1 dongjoon staff 3954 Aug 28 2020 spark-3.0.1-bin-hadoop3.2/jars/kubernetes-model-common-4.9.2.jar {code} The following is the output I get from Apache Spark 3.0.2 on EKS v1.18 as of today. It works for me. Also, Apache Spark 3.0.2 was tested on EKS and Minikube during RC period and the K8s IT test coverage on `branch-3.0` has been healthy. {code:java} Pi is roughly 3.141517357075868 21/03/05 22:17:19 INFO SparkUI: Stopped Spark web UI at http://pi-81b8dc7804756ea4-driver-svc.default.svc:4040 21/03/05 22:17:19 INFO KubernetesClusterSchedulerBackend: Shutting down all executors 21/03/05 22:17:19 INFO KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: Asking each executor to shut down 21/03/05 22:17:19 WARN ExecutorPodsWatchSnapshotSource: Kubernetes client has been closed (this is expected if the application is shutting down.) 21/03/05 22:17:20 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped! 21/03/05 22:17:20 INFO MemoryStore: MemoryStore cleared 21/03/05 22:17:20 INFO BlockManager: BlockManager stopped 21/03/05 22:17:20 INFO BlockManagerMaster: BlockManagerMaster stopped 21/03/05 22:17:20 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped! 21/03/05 22:17:20 INFO SparkContext: Successfully stopped SparkContext 21/03/05 22:17:20 INFO ShutdownHookManager: Shutdown hook called 21/03/05 22:17:20 INFO ShutdownHookManager: Deleting directory /tmp/spark-053d86cb-d3ed-4823-986b-f9dd65dcce6a 21/03/05 22:17:20 INFO ShutdownHookManager: Deleting directory /var/data/spark-eb14ad59-0e8b-431f-a138-0d5138376194/spark-e5844bae-114c-402d-83ba-c9b41dda0bbf {code} So, could you describe more about your situation, [~andygrove] and [~tgraves]? > [K8S] Driver pod stuck in Running state after job completes > --- > > Key: SPARK-34645 > URL: https://issues.apache.org/jira/browse/SPARK-34645 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 3.0.2 > Environment: Kubernetes: > {code:java} > Client Version: version.Info{Major:"1", Minor:"19", GitVersion:"v1.19.2", > GitCommit:"f5743093fd1c663cb0cbc89748f730662345d44d", GitTreeState:"clean", > BuildDate:"2020-09-16T13:41:02Z", GoVersion:"go1.15", Compiler:"gc", > Platform:"linux/amd64"} > Server Version: version.Info{Major:"1", Minor:"13", GitVersion:"v1.13.5", > GitCommit:"2166946f41b36dea2c4626f90a77706f426cdea2", GitTreeState:"clean", > BuildDate:"2019-03-25T15:19:22Z", GoVersion:"go1.11.5", Compiler:"gc", > Platform:"linux/amd64"} > {code} >Reporter: Andy Grove >Priority: Major > > I am running automated benchmarks in k8s, using spark-submit in cluster mode, > so the driver runs in a pod. > When running with Spark 3.0.1 and 3.1.1 everything works as expected and I > see the Spark context being shut down after the job completes. > However, when running with Spark 3.0.2 I do not see the context get shut down > and the driver pod is stuck in the Running state indefinitely. > This is the output I see after job completion with 3.0.1 and 3.1.1 and this > output does not appear with 3.0.2. With 3.0.2 there is no output at all after > the job completes. > {code:java} > 2021-03-05 20:09:24,576 INFO spark.SparkContext: Invoking stop() from > shutdown hook > 2021-03-05 20:09:24,592 INFO server.AbstractConnector: Stopped > Spark@784499d0{HTTP/1.1, (http/1.1)}{0.0.0.0:4040} > 2021-03-05 20:09:24,594 INFO ui.SparkUI: Stopped Spark web UI at > http://benchmark-runner-3e8a38780400e0d1-driver-svc.default.svc:4040 > 2021-03-05 20:09:24,599 INFO k8s.KubernetesClusterSchedulerBackend: Shutting > down all executors > 2021-03-05 20:09:24,600 INFO > k8s.KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: Asking
[jira] [Commented] (SPARK-34645) [K8S] Driver pod stuck in Running state after job completes
[ https://issues.apache.org/jira/browse/SPARK-34645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17296358#comment-17296358 ] Andy Grove commented on SPARK-34645: I just confirmed that this only happens with JDK 11 and works fine with JDK 8. > [K8S] Driver pod stuck in Running state after job completes > --- > > Key: SPARK-34645 > URL: https://issues.apache.org/jira/browse/SPARK-34645 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 3.0.2 > Environment: Kubernetes: > {code:java} > Client Version: version.Info{Major:"1", Minor:"19", GitVersion:"v1.19.2", > GitCommit:"f5743093fd1c663cb0cbc89748f730662345d44d", GitTreeState:"clean", > BuildDate:"2020-09-16T13:41:02Z", GoVersion:"go1.15", Compiler:"gc", > Platform:"linux/amd64"} > Server Version: version.Info{Major:"1", Minor:"13", GitVersion:"v1.13.5", > GitCommit:"2166946f41b36dea2c4626f90a77706f426cdea2", GitTreeState:"clean", > BuildDate:"2019-03-25T15:19:22Z", GoVersion:"go1.11.5", Compiler:"gc", > Platform:"linux/amd64"} > {code} >Reporter: Andy Grove >Priority: Major > > I am running automated benchmarks in k8s, using spark-submit in cluster mode, > so the driver runs in a pod. > When running with Spark 3.0.1 and 3.1.1 everything works as expected and I > see the Spark context being shut down after the job completes. > However, when running with Spark 3.0.2 I do not see the context get shut down > and the driver pod is stuck in the Running state indefinitely. > This is the output I see after job completion with 3.0.1 and 3.1.1 and this > output does not appear with 3.0.2. With 3.0.2 there is no output at all after > the job completes. > {code:java} > 2021-03-05 20:09:24,576 INFO spark.SparkContext: Invoking stop() from > shutdown hook > 2021-03-05 20:09:24,592 INFO server.AbstractConnector: Stopped > Spark@784499d0{HTTP/1.1, (http/1.1)}{0.0.0.0:4040} > 2021-03-05 20:09:24,594 INFO ui.SparkUI: Stopped Spark web UI at > http://benchmark-runner-3e8a38780400e0d1-driver-svc.default.svc:4040 > 2021-03-05 20:09:24,599 INFO k8s.KubernetesClusterSchedulerBackend: Shutting > down all executors > 2021-03-05 20:09:24,600 INFO > k8s.KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: Asking each > executor to shut down > 2021-03-05 20:09:24,609 WARN k8s.ExecutorPodsWatchSnapshotSource: Kubernetes > client has been closed (this is expected if the application is shutting down.) > 2021-03-05 20:09:24,719 INFO spark.MapOutputTrackerMasterEndpoint: > MapOutputTrackerMasterEndpoint stopped! > 2021-03-05 20:09:24,736 INFO memory.MemoryStore: MemoryStore cleared > 2021-03-05 20:09:24,738 INFO storage.BlockManager: BlockManager stopped > 2021-03-05 20:09:24,744 INFO storage.BlockManagerMaster: BlockManagerMaster > stopped > 2021-03-05 20:09:24,752 INFO > scheduler.OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: > OutputCommitCoordinator stopped! > 2021-03-05 20:09:24,768 INFO spark.SparkContext: Successfully stopped > SparkContext > 2021-03-05 20:09:24,768 INFO util.ShutdownHookManager: Shutdown hook called > 2021-03-05 20:09:24,769 INFO util.ShutdownHookManager: Deleting directory > /var/data/spark-67fa44df-e86c-463a-a149-25d95817ff8e/spark-a5476c14-c103-4108-b733-961400485d8a > 2021-03-05 20:09:24,772 INFO util.ShutdownHookManager: Deleting directory > /tmp/spark-9d6261f5-4394-472b-9c9a-e22bde877814 > 2021-03-05 20:09:24,778 INFO impl.MetricsSystemImpl: Stopping s3a-file-system > metrics system... > 2021-03-05 20:09:24,779 INFO impl.MetricsSystemImpl: s3a-file-system metrics > system stopped. > 2021-03-05 20:09:24,779 INFO impl.MetricsSystemImpl: s3a-file-system metrics > system shutdown complete. > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34645) [K8S] Driver pod stuck in Running state after job completes
[ https://issues.apache.org/jira/browse/SPARK-34645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17296356#comment-17296356 ] Dongjoon Hyun commented on SPARK-34645: --- So, it happens only at Apache Spark 3.0.2? Let me take a look, [~tgraves]. Thank you for reporting, [~andygrove]. > [K8S] Driver pod stuck in Running state after job completes > --- > > Key: SPARK-34645 > URL: https://issues.apache.org/jira/browse/SPARK-34645 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 3.0.2 > Environment: Kubernetes: > {code:java} > Client Version: version.Info{Major:"1", Minor:"19", GitVersion:"v1.19.2", > GitCommit:"f5743093fd1c663cb0cbc89748f730662345d44d", GitTreeState:"clean", > BuildDate:"2020-09-16T13:41:02Z", GoVersion:"go1.15", Compiler:"gc", > Platform:"linux/amd64"} > Server Version: version.Info{Major:"1", Minor:"13", GitVersion:"v1.13.5", > GitCommit:"2166946f41b36dea2c4626f90a77706f426cdea2", GitTreeState:"clean", > BuildDate:"2019-03-25T15:19:22Z", GoVersion:"go1.11.5", Compiler:"gc", > Platform:"linux/amd64"} > {code} >Reporter: Andy Grove >Priority: Major > > I am running automated benchmarks in k8s, using spark-submit in cluster mode, > so the driver runs in a pod. > When running with Spark 3.0.1 and 3.1.1 everything works as expected and I > see the Spark context being shut down after the job completes. > However, when running with Spark 3.0.2 I do not see the context get shut down > and the driver pod is stuck in the Running state indefinitely. > This is the output I see after job completion with 3.0.1 and 3.1.1 and this > output does not appear with 3.0.2. With 3.0.2 there is no output at all after > the job completes. > {code:java} > 2021-03-05 20:09:24,576 INFO spark.SparkContext: Invoking stop() from > shutdown hook > 2021-03-05 20:09:24,592 INFO server.AbstractConnector: Stopped > Spark@784499d0{HTTP/1.1, (http/1.1)}{0.0.0.0:4040} > 2021-03-05 20:09:24,594 INFO ui.SparkUI: Stopped Spark web UI at > http://benchmark-runner-3e8a38780400e0d1-driver-svc.default.svc:4040 > 2021-03-05 20:09:24,599 INFO k8s.KubernetesClusterSchedulerBackend: Shutting > down all executors > 2021-03-05 20:09:24,600 INFO > k8s.KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: Asking each > executor to shut down > 2021-03-05 20:09:24,609 WARN k8s.ExecutorPodsWatchSnapshotSource: Kubernetes > client has been closed (this is expected if the application is shutting down.) > 2021-03-05 20:09:24,719 INFO spark.MapOutputTrackerMasterEndpoint: > MapOutputTrackerMasterEndpoint stopped! > 2021-03-05 20:09:24,736 INFO memory.MemoryStore: MemoryStore cleared > 2021-03-05 20:09:24,738 INFO storage.BlockManager: BlockManager stopped > 2021-03-05 20:09:24,744 INFO storage.BlockManagerMaster: BlockManagerMaster > stopped > 2021-03-05 20:09:24,752 INFO > scheduler.OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: > OutputCommitCoordinator stopped! > 2021-03-05 20:09:24,768 INFO spark.SparkContext: Successfully stopped > SparkContext > 2021-03-05 20:09:24,768 INFO util.ShutdownHookManager: Shutdown hook called > 2021-03-05 20:09:24,769 INFO util.ShutdownHookManager: Deleting directory > /var/data/spark-67fa44df-e86c-463a-a149-25d95817ff8e/spark-a5476c14-c103-4108-b733-961400485d8a > 2021-03-05 20:09:24,772 INFO util.ShutdownHookManager: Deleting directory > /tmp/spark-9d6261f5-4394-472b-9c9a-e22bde877814 > 2021-03-05 20:09:24,778 INFO impl.MetricsSystemImpl: Stopping s3a-file-system > metrics system... > 2021-03-05 20:09:24,779 INFO impl.MetricsSystemImpl: s3a-file-system metrics > system stopped. > 2021-03-05 20:09:24,779 INFO impl.MetricsSystemImpl: s3a-file-system metrics > system shutdown complete. > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-34646) TreeNode bind issue for duplicate column name.
[ https://issues.apache.org/jira/browse/SPARK-34646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] loc nguyen updated SPARK-34646: --- Description: I received a Spark {{TreeNodeException}} executing a union of two data frames. When I assign the union results to a DataFrame that will be returned by a function, this error occurs. However, I am able to assign the union results to a DataFrame that will not be returned. I have examined the schema for all the data frames participating in the code. The PT_Id is being duplicated. The PT_Id is duplicated and results in the failed search. {{21/03/04 19:58:28 ERROR Executor: Exception in task 2.0 in stage 2281.0 (TID 5557) org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding attribute, tree: PT_ID#140575 at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56) at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:79) at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:78) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:256) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:256) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:255) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:261) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:261) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:326) at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187) at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:324) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:261) at org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:245) at org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReference(BoundAttribute.scala:78) at org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$.bind(GeneratePredicate.scala:45) at org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$.bind(GeneratePredicate.scala:40) at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.generate(CodeGenerator.scala:1190) at org.apache.spark.sql.execution.SparkPlan.newPredicate(SparkPlan.scala:403) at org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec.org$apache$spark$sql$execution$joins$BroadcastNestedLoopJoinExec$$boundCondition$lzycompute(BroadcastNestedLoopJoinExec.scala:87) at org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec.org$apache$spark$sql$execution$joins$BroadcastNestedLoopJoinExec$$boundCondition(BroadcastNestedLoopJoinExec.scala:85) at org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec$$anonfun$4$$anonfun$apply$2$$anonfun$apply$3.apply(BroadcastNestedLoopJoinExec.scala:191) at org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec$$anonfun$4$$anonfun$apply$2$$anonfun$apply$3.apply(BroadcastNestedLoopJoinExec.scala:191) at scala.collection.IndexedSeqOptimized$class.prefixLengthImpl(IndexedSeqOptimized.scala:38) at scala.collection.IndexedSeqOptimized$class.exists(IndexedSeqOptimized.scala:46) at scala.collection.mutable.ArrayOps$ofRef.exists(ArrayOps.scala:186) at org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec$$anonfun$4$$anonfun$apply$2.apply(BroadcastNestedLoopJoinExec.scala:191) at org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec$$anonfun$4$$anonfun$apply$2.apply(BroadcastNestedLoopJoinExec.scala:190) at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:464) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409) at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55) at org.apache.spark.scheduler.Task.run(Task.scala:121) at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748)Caused by: java.lang.RuntimeException: Couldn't find PT_Id#140575 in [Name#34180,PT_Id#34181,PT_Id#127|#34180,PT_Id#34181,PT_Id#127] at
[jira] [Updated] (SPARK-34646) TreeNode bind issue for duplicate column name.
[ https://issues.apache.org/jira/browse/SPARK-34646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] loc nguyen updated SPARK-34646: --- Description: I received a Spark {{TreeNodeException}} executing a union of two data frames. When I assign the union results to a DataFrame that will be returned by a function, this error occurs. However, I am able to assign the union results to a DataFrame that will not be returned. I have examined schema for all the data frames participating in the code. The PT_Id is being duplicated. The PT_Id is duplicated and results in the failed search. {{21/03/04 19:58:28 ERROR Executor: Exception in task 2.0 in stage 2281.0 (TID 5557) org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding attribute, tree: PT_ID#140575 at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56) at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:79) at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:78) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:256) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:256) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:255) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:261) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:261) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:326) at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187) at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:324) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:261) at org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:245) at org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReference(BoundAttribute.scala:78) at org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$.bind(GeneratePredicate.scala:45) at org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$.bind(GeneratePredicate.scala:40) at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.generate(CodeGenerator.scala:1190) at org.apache.spark.sql.execution.SparkPlan.newPredicate(SparkPlan.scala:403) at org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec.org$apache$spark$sql$execution$joins$BroadcastNestedLoopJoinExec$$boundCondition$lzycompute(BroadcastNestedLoopJoinExec.scala:87) at org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec.org$apache$spark$sql$execution$joins$BroadcastNestedLoopJoinExec$$boundCondition(BroadcastNestedLoopJoinExec.scala:85) at org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec$$anonfun$4$$anonfun$apply$2$$anonfun$apply$3.apply(BroadcastNestedLoopJoinExec.scala:191) at org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec$$anonfun$4$$anonfun$apply$2$$anonfun$apply$3.apply(BroadcastNestedLoopJoinExec.scala:191) at scala.collection.IndexedSeqOptimized$class.prefixLengthImpl(IndexedSeqOptimized.scala:38) at scala.collection.IndexedSeqOptimized$class.exists(IndexedSeqOptimized.scala:46) at scala.collection.mutable.ArrayOps$ofRef.exists(ArrayOps.scala:186) at org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec$$anonfun$4$$anonfun$apply$2.apply(BroadcastNestedLoopJoinExec.scala:191) at org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec$$anonfun$4$$anonfun$apply$2.apply(BroadcastNestedLoopJoinExec.scala:190) at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:464) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409) at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55) at org.apache.spark.scheduler.Task.run(Task.scala:121) at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748)Caused by: java.lang.RuntimeException: Couldn't find PT_Id#140575 in [Name#34180,PT_Id#34181,PT_Id#127|#34180,PT_Id#34181,PT_Id#127] at
[jira] [Updated] (SPARK-34646) TreeNode bind issue for duplicate column name.
[ https://issues.apache.org/jira/browse/SPARK-34646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] loc nguyen updated SPARK-34646: --- Description: I received a Spark {{TreeNodeException}} executing a union of two data frames. When I assign the union results to a DataFrame that will be returned from by a function, this error occurs. However, I am able to assign the union results to a DataFrame that will not be returned. I have examined schema for all the data frames participating in the code. The PT_Id is being duplicated. The PT_Id is duplicated and results in the failed search. {{21/03/04 19:58:28 ERROR Executor: Exception in task 2.0 in stage 2281.0 (TID 5557) org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding attribute, tree: PT_ID#140575 at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56) at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:79) at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:78) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:256) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:256) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:255) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:261) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:261) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:326) at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187) at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:324) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:261) at org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:245) at org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReference(BoundAttribute.scala:78) at org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$.bind(GeneratePredicate.scala:45) at org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$.bind(GeneratePredicate.scala:40) at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.generate(CodeGenerator.scala:1190) at org.apache.spark.sql.execution.SparkPlan.newPredicate(SparkPlan.scala:403) at org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec.org$apache$spark$sql$execution$joins$BroadcastNestedLoopJoinExec$$boundCondition$lzycompute(BroadcastNestedLoopJoinExec.scala:87) at org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec.org$apache$spark$sql$execution$joins$BroadcastNestedLoopJoinExec$$boundCondition(BroadcastNestedLoopJoinExec.scala:85) at org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec$$anonfun$4$$anonfun$apply$2$$anonfun$apply$3.apply(BroadcastNestedLoopJoinExec.scala:191) at org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec$$anonfun$4$$anonfun$apply$2$$anonfun$apply$3.apply(BroadcastNestedLoopJoinExec.scala:191) at scala.collection.IndexedSeqOptimized$class.prefixLengthImpl(IndexedSeqOptimized.scala:38) at scala.collection.IndexedSeqOptimized$class.exists(IndexedSeqOptimized.scala:46) at scala.collection.mutable.ArrayOps$ofRef.exists(ArrayOps.scala:186) at org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec$$anonfun$4$$anonfun$apply$2.apply(BroadcastNestedLoopJoinExec.scala:191) at org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec$$anonfun$4$$anonfun$apply$2.apply(BroadcastNestedLoopJoinExec.scala:190) at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:464) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409) at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55) at org.apache.spark.scheduler.Task.run(Task.scala:121) at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748)Caused by: java.lang.RuntimeException: Couldn't find PT_Id#140575 in [Name#34180,PT_Id#34181,PT_Id#127|#34180,PT_Id#34181,PT_Id#127] at
[jira] [Updated] (SPARK-34646) TreeNode bind issue for duplicate column name.
[ https://issues.apache.org/jira/browse/SPARK-34646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] loc nguyen updated SPARK-34646: --- Summary: TreeNode bind issue for duplicate column name. (was: TreeNode bind issue) > TreeNode bind issue for duplicate column name. > -- > > Key: SPARK-34646 > URL: https://issues.apache.org/jira/browse/SPARK-34646 > Project: Spark > Issue Type: Bug > Components: Spark Submit >Affects Versions: 2.4.3 > Environment: Spark 2.4.3, Scala 2.11.8, Hadoop 3.2.1 >Reporter: loc nguyen >Priority: Major > Labels: spark > > I received a Spark {{TreeNodeException}} executing a union of two data > frames. When I assign the union results to a DataFrame that will be returned > from by a function, this error occurs. However, I am able to assign the union > results to a DataFrame that will not be returned. I have examined schema for > all the data frames participating in the code. The PT_Id is being duplicated. > The PT_Id is duplicated and results in the failed search. > > > {{21/03/04 19:58:28 ERROR Executor: Exception in task 2.0 in stage 2281.0 > (TID 5557) > org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding > attribute, tree: DP_Acct_Identifier#140575 at > org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56) > at > org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:79) > at > org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:78) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:256) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:256) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:255) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:261) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:261) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:326) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:324) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:261) > at org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:245) > at > org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReference(BoundAttribute.scala:78) > at > org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$.bind(GeneratePredicate.scala:45) > at > org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$.bind(GeneratePredicate.scala:40) > at > org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.generate(CodeGenerator.scala:1190) > at org.apache.spark.sql.execution.SparkPlan.newPredicate(SparkPlan.scala:403) > at > org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec.org$apache$spark$sql$execution$joins$BroadcastNestedLoopJoinExec$$boundCondition$lzycompute(BroadcastNestedLoopJoinExec.scala:87) > at > org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec.org$apache$spark$sql$execution$joins$BroadcastNestedLoopJoinExec$$boundCondition(BroadcastNestedLoopJoinExec.scala:85) > at > org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec$$anonfun$4$$anonfun$apply$2$$anonfun$apply$3.apply(BroadcastNestedLoopJoinExec.scala:191) > at > org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec$$anonfun$4$$anonfun$apply$2$$anonfun$apply$3.apply(BroadcastNestedLoopJoinExec.scala:191) > at > scala.collection.IndexedSeqOptimized$class.prefixLengthImpl(IndexedSeqOptimized.scala:38) > at > scala.collection.IndexedSeqOptimized$class.exists(IndexedSeqOptimized.scala:46) > at scala.collection.mutable.ArrayOps$ofRef.exists(ArrayOps.scala:186) > at > org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec$$anonfun$4$$anonfun$apply$2.apply(BroadcastNestedLoopJoinExec.scala:191) > at > org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec$$anonfun$4$$anonfun$apply$2.apply(BroadcastNestedLoopJoinExec.scala:190) > at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:464) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409) > at > org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125) > at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99) > at
[jira] [Updated] (SPARK-34646) TreeNode bind issue
[ https://issues.apache.org/jira/browse/SPARK-34646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] loc nguyen updated SPARK-34646: --- Description: I received a Spark {{TreeNodeException}} executing a union of two data frames. When I assign the union results to a DataFrame that will be returned from by a function, this error occurs. However, I am able to assign the union results to a DataFrame that will not be returned. I have examined schema for all the data frames participating in the code. The PT_Id is being duplicated. The PT_Id is duplicated and results in the failed search. {{21/03/04 19:58:28 ERROR Executor: Exception in task 2.0 in stage 2281.0 (TID 5557) org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding attribute, tree: DP_Acct_Identifier#140575 at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56) at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:79) at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:78) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:256) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:256) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:255) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:261) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:261) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:326) at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187) at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:324) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:261) at org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:245) at org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReference(BoundAttribute.scala:78) at org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$.bind(GeneratePredicate.scala:45) at org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$.bind(GeneratePredicate.scala:40) at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.generate(CodeGenerator.scala:1190) at org.apache.spark.sql.execution.SparkPlan.newPredicate(SparkPlan.scala:403) at org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec.org$apache$spark$sql$execution$joins$BroadcastNestedLoopJoinExec$$boundCondition$lzycompute(BroadcastNestedLoopJoinExec.scala:87) at org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec.org$apache$spark$sql$execution$joins$BroadcastNestedLoopJoinExec$$boundCondition(BroadcastNestedLoopJoinExec.scala:85) at org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec$$anonfun$4$$anonfun$apply$2$$anonfun$apply$3.apply(BroadcastNestedLoopJoinExec.scala:191) at org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec$$anonfun$4$$anonfun$apply$2$$anonfun$apply$3.apply(BroadcastNestedLoopJoinExec.scala:191) at scala.collection.IndexedSeqOptimized$class.prefixLengthImpl(IndexedSeqOptimized.scala:38) at scala.collection.IndexedSeqOptimized$class.exists(IndexedSeqOptimized.scala:46) at scala.collection.mutable.ArrayOps$ofRef.exists(ArrayOps.scala:186) at org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec$$anonfun$4$$anonfun$apply$2.apply(BroadcastNestedLoopJoinExec.scala:191) at org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec$$anonfun$4$$anonfun$apply$2.apply(BroadcastNestedLoopJoinExec.scala:190) at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:464) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409) at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55) at org.apache.spark.scheduler.Task.run(Task.scala:121) at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748)Caused by: java.lang.RuntimeException: Couldn't find PT_Id#140575 in
[jira] [Updated] (SPARK-34646) TreeNode bind issue
[ https://issues.apache.org/jira/browse/SPARK-34646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] loc nguyen updated SPARK-34646: --- Description: I received a Spark {{TreeNodeException}} executing a union of two data frames. When I assign the union results to a DataFrame that will be returned from by a function, this error occurs. However, I am able to assign the union results to a DataFrame that will not be returned. I have examined schema for all the data frames participating in the code. The PT_Id is being duplicated. I am not sure why the PT_Id is duplicated and results in the failed search. {{21/03/04 19:58:28 ERROR Executor: Exception in task 2.0 in stage 2281.0 (TID 5557) org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding attribute, tree: DP_Acct_Identifier#140575 at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56) at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:79) at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:78) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:256) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:256) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:255) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:261) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:261) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:326) at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187) at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:324) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:261) at org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:245) at org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReference(BoundAttribute.scala:78) at org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$.bind(GeneratePredicate.scala:45) at org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$.bind(GeneratePredicate.scala:40) at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.generate(CodeGenerator.scala:1190) at org.apache.spark.sql.execution.SparkPlan.newPredicate(SparkPlan.scala:403) at org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec.org$apache$spark$sql$execution$joins$BroadcastNestedLoopJoinExec$$boundCondition$lzycompute(BroadcastNestedLoopJoinExec.scala:87) at org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec.org$apache$spark$sql$execution$joins$BroadcastNestedLoopJoinExec$$boundCondition(BroadcastNestedLoopJoinExec.scala:85) at org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec$$anonfun$4$$anonfun$apply$2$$anonfun$apply$3.apply(BroadcastNestedLoopJoinExec.scala:191) at org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec$$anonfun$4$$anonfun$apply$2$$anonfun$apply$3.apply(BroadcastNestedLoopJoinExec.scala:191) at scala.collection.IndexedSeqOptimized$class.prefixLengthImpl(IndexedSeqOptimized.scala:38) at scala.collection.IndexedSeqOptimized$class.exists(IndexedSeqOptimized.scala:46) at scala.collection.mutable.ArrayOps$ofRef.exists(ArrayOps.scala:186) at org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec$$anonfun$4$$anonfun$apply$2.apply(BroadcastNestedLoopJoinExec.scala:191) at org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec$$anonfun$4$$anonfun$apply$2.apply(BroadcastNestedLoopJoinExec.scala:190) at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:464) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409) at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55) at org.apache.spark.scheduler.Task.run(Task.scala:121) at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748)Caused by: java.lang.RuntimeException: Couldn't find PT_Id#140575 in
[jira] [Created] (SPARK-34646) TreeNode bind issue
loc nguyen created SPARK-34646: -- Summary: TreeNode bind issue Key: SPARK-34646 URL: https://issues.apache.org/jira/browse/SPARK-34646 Project: Spark Issue Type: Bug Components: Spark Submit Affects Versions: 2.4.3 Environment: Spark 2.4.3, Scala 2.11.8, Hadoop 3.2.1 Reporter: loc nguyen I received a Spark {{TreeNodeException}} executing a union of two data frames. When I assign the union results to a DataFrame that will be returned from by a function, this error occurs. However, I am able to assign the union results to a DataFrame that will not be returned. I have examined schema for all the data frames participating in the code. They PT_Id is being duplicated. I am not sure why the PT_Id is duplicated and results in the failed search. {{21/03/04 19:58:28 ERROR Executor: Exception in task 2.0 in stage 2281.0 (TID 5557) org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding attribute, tree: DP_Acct_Identifier#140575at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56) at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:79) at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:78) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:256) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:256) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:255) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:261) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:261) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:326) at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187) at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:324) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:261) at org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:245) at org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReference(BoundAttribute.scala:78) at org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$.bind(GeneratePredicate.scala:45) at org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$.bind(GeneratePredicate.scala:40) at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.generate(CodeGenerator.scala:1190) at org.apache.spark.sql.execution.SparkPlan.newPredicate(SparkPlan.scala:403) at org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec.org$apache$spark$sql$execution$joins$BroadcastNestedLoopJoinExec$$boundCondition$lzycompute(BroadcastNestedLoopJoinExec.scala:87) at org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec.org$apache$spark$sql$execution$joins$BroadcastNestedLoopJoinExec$$boundCondition(BroadcastNestedLoopJoinExec.scala:85) at org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec$$anonfun$4$$anonfun$apply$2$$anonfun$apply$3.apply(BroadcastNestedLoopJoinExec.scala:191) at org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec$$anonfun$4$$anonfun$apply$2$$anonfun$apply$3.apply(BroadcastNestedLoopJoinExec.scala:191) at scala.collection.IndexedSeqOptimized$class.prefixLengthImpl(IndexedSeqOptimized.scala:38) at scala.collection.IndexedSeqOptimized$class.exists(IndexedSeqOptimized.scala:46) at scala.collection.mutable.ArrayOps$ofRef.exists(ArrayOps.scala:186) at org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec$$anonfun$4$$anonfun$apply$2.apply(BroadcastNestedLoopJoinExec.scala:191) at org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec$$anonfun$4$$anonfun$apply$2.apply(BroadcastNestedLoopJoinExec.scala:190) at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:464) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409) at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55) at org.apache.spark.scheduler.Task.run(Task.scala:121) at
[jira] [Commented] (SPARK-34645) [K8S] Driver pod stuck in Running state after job completes
[ https://issues.apache.org/jira/browse/SPARK-34645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17296340#comment-17296340 ] Thomas Graves commented on SPARK-34645: --- [~dongjoon] [~holden]. have you seen this at all? > [K8S] Driver pod stuck in Running state after job completes > --- > > Key: SPARK-34645 > URL: https://issues.apache.org/jira/browse/SPARK-34645 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 3.0.2 > Environment: Kubernetes: > {code:java} > Client Version: version.Info{Major:"1", Minor:"19", GitVersion:"v1.19.2", > GitCommit:"f5743093fd1c663cb0cbc89748f730662345d44d", GitTreeState:"clean", > BuildDate:"2020-09-16T13:41:02Z", GoVersion:"go1.15", Compiler:"gc", > Platform:"linux/amd64"} > Server Version: version.Info{Major:"1", Minor:"13", GitVersion:"v1.13.5", > GitCommit:"2166946f41b36dea2c4626f90a77706f426cdea2", GitTreeState:"clean", > BuildDate:"2019-03-25T15:19:22Z", GoVersion:"go1.11.5", Compiler:"gc", > Platform:"linux/amd64"} > {code} >Reporter: Andy Grove >Priority: Major > > I am running automated benchmarks in k8s, using spark-submit in cluster mode, > so the driver runs in a pod. > When running with Spark 3.0.1 and 3.1.1 everything works as expected and I > see the Spark context being shut down after the job completes. > However, when running with Spark 3.0.2 I do not see the context get shut down > and the driver pod is stuck in the Running state indefinitely. > This is the output I see after job completion with 3.0.1 and 3.1.1 and this > output does not appear with 3.0.2. With 3.0.2 there is no output at all after > the job completes. > {code:java} > 2021-03-05 20:09:24,576 INFO spark.SparkContext: Invoking stop() from > shutdown hook > 2021-03-05 20:09:24,592 INFO server.AbstractConnector: Stopped > Spark@784499d0{HTTP/1.1, (http/1.1)}{0.0.0.0:4040} > 2021-03-05 20:09:24,594 INFO ui.SparkUI: Stopped Spark web UI at > http://benchmark-runner-3e8a38780400e0d1-driver-svc.default.svc:4040 > 2021-03-05 20:09:24,599 INFO k8s.KubernetesClusterSchedulerBackend: Shutting > down all executors > 2021-03-05 20:09:24,600 INFO > k8s.KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: Asking each > executor to shut down > 2021-03-05 20:09:24,609 WARN k8s.ExecutorPodsWatchSnapshotSource: Kubernetes > client has been closed (this is expected if the application is shutting down.) > 2021-03-05 20:09:24,719 INFO spark.MapOutputTrackerMasterEndpoint: > MapOutputTrackerMasterEndpoint stopped! > 2021-03-05 20:09:24,736 INFO memory.MemoryStore: MemoryStore cleared > 2021-03-05 20:09:24,738 INFO storage.BlockManager: BlockManager stopped > 2021-03-05 20:09:24,744 INFO storage.BlockManagerMaster: BlockManagerMaster > stopped > 2021-03-05 20:09:24,752 INFO > scheduler.OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: > OutputCommitCoordinator stopped! > 2021-03-05 20:09:24,768 INFO spark.SparkContext: Successfully stopped > SparkContext > 2021-03-05 20:09:24,768 INFO util.ShutdownHookManager: Shutdown hook called > 2021-03-05 20:09:24,769 INFO util.ShutdownHookManager: Deleting directory > /var/data/spark-67fa44df-e86c-463a-a149-25d95817ff8e/spark-a5476c14-c103-4108-b733-961400485d8a > 2021-03-05 20:09:24,772 INFO util.ShutdownHookManager: Deleting directory > /tmp/spark-9d6261f5-4394-472b-9c9a-e22bde877814 > 2021-03-05 20:09:24,778 INFO impl.MetricsSystemImpl: Stopping s3a-file-system > metrics system... > 2021-03-05 20:09:24,779 INFO impl.MetricsSystemImpl: s3a-file-system metrics > system stopped. > 2021-03-05 20:09:24,779 INFO impl.MetricsSystemImpl: s3a-file-system metrics > system shutdown complete. > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-34645) [K8S] Driver pod stuck in Running state after job completes
Andy Grove created SPARK-34645: -- Summary: [K8S] Driver pod stuck in Running state after job completes Key: SPARK-34645 URL: https://issues.apache.org/jira/browse/SPARK-34645 Project: Spark Issue Type: Bug Components: Kubernetes Affects Versions: 3.0.2 Environment: Kubernetes: {code:java} Client Version: version.Info{Major:"1", Minor:"19", GitVersion:"v1.19.2", GitCommit:"f5743093fd1c663cb0cbc89748f730662345d44d", GitTreeState:"clean", BuildDate:"2020-09-16T13:41:02Z", GoVersion:"go1.15", Compiler:"gc", Platform:"linux/amd64"} Server Version: version.Info{Major:"1", Minor:"13", GitVersion:"v1.13.5", GitCommit:"2166946f41b36dea2c4626f90a77706f426cdea2", GitTreeState:"clean", BuildDate:"2019-03-25T15:19:22Z", GoVersion:"go1.11.5", Compiler:"gc", Platform:"linux/amd64"} {code} Reporter: Andy Grove I am running automated benchmarks in k8s, using spark-submit in cluster mode, so the driver runs in a pod. When running with Spark 3.0.1 and 3.1.1 everything works as expected and I see the Spark context being shut down after the job completes. However, when running with Spark 3.0.2 I do not see the context get shut down and the driver pod is stuck in the Running state indefinitely. This is the output I see after job completion with 3.0.1 and 3.1.1 and this output does not appear with 3.0.2. With 3.0.2 there is no output at all after the job completes. {code:java} 2021-03-05 20:09:24,576 INFO spark.SparkContext: Invoking stop() from shutdown hook 2021-03-05 20:09:24,592 INFO server.AbstractConnector: Stopped Spark@784499d0{HTTP/1.1, (http/1.1)}{0.0.0.0:4040} 2021-03-05 20:09:24,594 INFO ui.SparkUI: Stopped Spark web UI at http://benchmark-runner-3e8a38780400e0d1-driver-svc.default.svc:4040 2021-03-05 20:09:24,599 INFO k8s.KubernetesClusterSchedulerBackend: Shutting down all executors 2021-03-05 20:09:24,600 INFO k8s.KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: Asking each executor to shut down 2021-03-05 20:09:24,609 WARN k8s.ExecutorPodsWatchSnapshotSource: Kubernetes client has been closed (this is expected if the application is shutting down.) 2021-03-05 20:09:24,719 INFO spark.MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped! 2021-03-05 20:09:24,736 INFO memory.MemoryStore: MemoryStore cleared 2021-03-05 20:09:24,738 INFO storage.BlockManager: BlockManager stopped 2021-03-05 20:09:24,744 INFO storage.BlockManagerMaster: BlockManagerMaster stopped 2021-03-05 20:09:24,752 INFO scheduler.OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped! 2021-03-05 20:09:24,768 INFO spark.SparkContext: Successfully stopped SparkContext 2021-03-05 20:09:24,768 INFO util.ShutdownHookManager: Shutdown hook called 2021-03-05 20:09:24,769 INFO util.ShutdownHookManager: Deleting directory /var/data/spark-67fa44df-e86c-463a-a149-25d95817ff8e/spark-a5476c14-c103-4108-b733-961400485d8a 2021-03-05 20:09:24,772 INFO util.ShutdownHookManager: Deleting directory /tmp/spark-9d6261f5-4394-472b-9c9a-e22bde877814 2021-03-05 20:09:24,778 INFO impl.MetricsSystemImpl: Stopping s3a-file-system metrics system... 2021-03-05 20:09:24,779 INFO impl.MetricsSystemImpl: s3a-file-system metrics system stopped. 2021-03-05 20:09:24,779 INFO impl.MetricsSystemImpl: s3a-file-system metrics system shutdown complete. {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-34295) Allow option similar to mapreduce.job.hdfs-servers.token-renewal.exclude
[ https://issues.apache.org/jira/browse/SPARK-34295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-34295: Assignee: L. C. Hsieh (was: Apache Spark) > Allow option similar to mapreduce.job.hdfs-servers.token-renewal.exclude > > > Key: SPARK-34295 > URL: https://issues.apache.org/jira/browse/SPARK-34295 > Project: Spark > Issue Type: Improvement > Components: YARN >Affects Versions: 3.2.0 >Reporter: L. C. Hsieh >Assignee: L. C. Hsieh >Priority: Major > > MapReduce jobs can instruct YARN to skip renewal of tokens obtained from > certain hosts by specifying the hosts with configuration > mapreduce.job.hdfs-servers.token-renewal.exclude=,,..,. > But seems Spark lacks of similar option. So the job submission fails if YARN > fails to renew DelegationToken for any of the remote HDFS cluster. The > failure in DT renewal can happen due to many reason like Remote HDFS does not > trust Kerberos identity of YARN etc. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-34295) Allow option similar to mapreduce.job.hdfs-servers.token-renewal.exclude
[ https://issues.apache.org/jira/browse/SPARK-34295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-34295: Assignee: Apache Spark (was: L. C. Hsieh) > Allow option similar to mapreduce.job.hdfs-servers.token-renewal.exclude > > > Key: SPARK-34295 > URL: https://issues.apache.org/jira/browse/SPARK-34295 > Project: Spark > Issue Type: Improvement > Components: YARN >Affects Versions: 3.2.0 >Reporter: L. C. Hsieh >Assignee: Apache Spark >Priority: Major > > MapReduce jobs can instruct YARN to skip renewal of tokens obtained from > certain hosts by specifying the hosts with configuration > mapreduce.job.hdfs-servers.token-renewal.exclude=,,..,. > But seems Spark lacks of similar option. So the job submission fails if YARN > fails to renew DelegationToken for any of the remote HDFS cluster. The > failure in DT renewal can happen due to many reason like Remote HDFS does not > trust Kerberos identity of YARN etc. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34295) Allow option similar to mapreduce.job.hdfs-servers.token-renewal.exclude
[ https://issues.apache.org/jira/browse/SPARK-34295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17296324#comment-17296324 ] Apache Spark commented on SPARK-34295: -- User 'viirya' has created a pull request for this issue: https://github.com/apache/spark/pull/31761 > Allow option similar to mapreduce.job.hdfs-servers.token-renewal.exclude > > > Key: SPARK-34295 > URL: https://issues.apache.org/jira/browse/SPARK-34295 > Project: Spark > Issue Type: Improvement > Components: YARN >Affects Versions: 3.2.0 >Reporter: L. C. Hsieh >Assignee: L. C. Hsieh >Priority: Major > > MapReduce jobs can instruct YARN to skip renewal of tokens obtained from > certain hosts by specifying the hosts with configuration > mapreduce.job.hdfs-servers.token-renewal.exclude=,,..,. > But seems Spark lacks of similar option. So the job submission fails if YARN > fails to renew DelegationToken for any of the remote HDFS cluster. The > failure in DT renewal can happen due to many reason like Remote HDFS does not > trust Kerberos identity of YARN etc. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34295) Allow option similar to mapreduce.job.hdfs-servers.token-renewal.exclude
[ https://issues.apache.org/jira/browse/SPARK-34295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17296327#comment-17296327 ] Apache Spark commented on SPARK-34295: -- User 'viirya' has created a pull request for this issue: https://github.com/apache/spark/pull/31761 > Allow option similar to mapreduce.job.hdfs-servers.token-renewal.exclude > > > Key: SPARK-34295 > URL: https://issues.apache.org/jira/browse/SPARK-34295 > Project: Spark > Issue Type: Improvement > Components: YARN >Affects Versions: 3.2.0 >Reporter: L. C. Hsieh >Assignee: L. C. Hsieh >Priority: Major > > MapReduce jobs can instruct YARN to skip renewal of tokens obtained from > certain hosts by specifying the hosts with configuration > mapreduce.job.hdfs-servers.token-renewal.exclude=,,..,. > But seems Spark lacks of similar option. So the job submission fails if YARN > fails to renew DelegationToken for any of the remote HDFS cluster. The > failure in DT renewal can happen due to many reason like Remote HDFS does not > trust Kerberos identity of YARN etc. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-34644) UDF returning array followed by explode returns wrong results
[ https://issues.apache.org/jira/browse/SPARK-34644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gavrilescu Laurentiu updated SPARK-34644: - Description: *Applying an UDF followed by explode looks to be calling the UDF twice.* Imagine having the following scenario: 1. you have a dataframe with some string columns 2. you have an expensive function that creates a score based on some string input 3. you want to get all the distinct values from all the columns and their score - there is an executor level cache that holds the score values for strings to minimize the execution of the expensive function consider the following code to reproduce it: {code:java} case class RowWithStrings(c1: String, c2: String, c3: String) case class ValueScore(value: String, score: Double) object Bug { val columns: List[String] = List("c1", "c2", "c3") def score(input: String): Double = { // insert expensive function here input.toDouble } def main(args: Array[String]) { lazy val sparkSession: SparkSession = { val sparkSession = SparkSession.builder.master("local[4]") .getOrCreate() sparkSession } // some cache over expensive operation val cache: TrieMap[String, Double] = TrieMap[String, Double]() // get scores for all columns in the row val body = (row: Row) => { val arr = ArrayBuffer[ValueScore]() columns foreach { column => val value = row.getAs[String](column) if (!cache.contains(value)) { val computedScore = score(value) arr += ValueScore(value, computedScore) cache(value) = computedScore } } arr } val basicUdf = udf(body) val values = (1 to 5) map { idx => // repeated values RowWithStrings(idx.toString, idx.toString, idx.toString) } import sparkSession.implicits._ val df = values.toDF("c1", "c2", "c3").persist() val allCols = df.columns.map(col) df.select(basicUdf(struct(allCols: _*)).as("valuesScore")) .select(explode(col("valuesScore"))) .distinct() .show() } } {code} this shows: {code:java} +---+ |col| +---+ +---+ {code} When adding persist before explode, the result is correct: {code:java} df.select(basicUdf(struct(allCols: _*)).as("valuesScore")) .persist() .select(explode(col("valuesScore"))) .distinct() .show() {code} => {code:java} ++ | col| ++ |{2, 2.0}| |{4, 4.0}| |{3, 3.0}| |{5, 5.0}| |{1, 1.0}| ++ {code} This is not reproducible using 3.0.2 version. was: *Applying an UDF followed by explode looks to be calling the UDF twice.* Imagine having the following scenario: 1. you have a dataframe with some string columns 2. you have an expensive function that creates a score based on some string input 3. you want to get all the distinct values from all the columns and their score - there is an executor level cache that holds the score values for strings to minimize the execution of the expensive function consider the following code to reproduce it: {code:java} case class RowWithStrings(c1: String, c2: String, c3: String) case class ValueScore(value: String, score: Double) object Bug { val columns: List[String] = List("c1", "c2", "c3") def score(input: String): Double = { // insert expensive function here input.toDouble } def main(args: Array[String]) { lazy val sparkSession: SparkSession = { val sparkSession = SparkSession.builder.master("local[4]") .getOrCreate() sparkSession } // some cache over expensive operation val cache: TrieMap[String, Double] = TrieMap[String, Double]() // get scores for all columns in the row val body = (row: Row) => { val arr = ArrayBuffer[ValueScore]() columns foreach { column => val value = row.getAs[String](column) if (!cache.contains(value)) { val computedScore = score(value) arr += ValueScore(value, computedScore) cache(value) = computedScore } } arr } val basicUdf = udf(body) val values = (1 to 5) map { idx => // repeated values RowWithStrings(idx.toString, idx.toString, idx.toString) } import sparkSession.implicits._ val df = values.toDF("c1", "c2", "c3").persist() val allCols = df.columns.map(col) df.select(basicUdf(struct(allCols: _*)).as("valuesScore")) .select(explode(col("valuesScore"))) .distinct() .show() } } {code} this shows: {code:java} +---+ |col| +---+ +---+ {code} When adding persist before explode, the result is correct: {code:java} df.select(basicUdf(struct(allCols: _*)).as("valuesScore")) .persist() .select(explode(col("valuesScore"))) .distinct() .show() {code} => {code:java} ++ | col| ++ |{2, 2.0}| |{4, 4.0}| |{3, 3.0}| |{5, 5.0}|
[jira] [Created] (SPARK-34644) UDF returning array followed by explode returns wrong results
Gavrilescu Laurentiu created SPARK-34644: Summary: UDF returning array followed by explode returns wrong results Key: SPARK-34644 URL: https://issues.apache.org/jira/browse/SPARK-34644 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.1.1 Reporter: Gavrilescu Laurentiu *Applying an UDF followed by explode looks to be calling the UDF twice.* Imagine having the following scenario: 1. you have a dataframe with some string columns 2. you have an expensive function that creates a score based on some string input 3. you want to get all the distinct values from all the columns and their score - there is an executor level cache that holds the score values for strings to minimize the execution of the expensive function consider the following code to reproduce it: {code:java} case class RowWithStrings(c1: String, c2: String, c3: String) case class ValueScore(value: String, score: Double) object Bug { val columns: List[String] = List("c1", "c2", "c3") def score(input: String): Double = { // insert expensive function here input.toDouble } def main(args: Array[String]) { lazy val sparkSession: SparkSession = { val sparkSession = SparkSession.builder.master("local[4]") .getOrCreate() sparkSession } // some cache over expensive operation val cache: TrieMap[String, Double] = TrieMap[String, Double]() // get scores for all columns in the row val body = (row: Row) => { val arr = ArrayBuffer[ValueScore]() columns foreach { column => val value = row.getAs[String](column) if (!cache.contains(value)) { val computedScore = score(value) arr += ValueScore(value, computedScore) cache(value) = computedScore } } arr } val basicUdf = udf(body) val values = (1 to 5) map { idx => // repeated values RowWithStrings(idx.toString, idx.toString, idx.toString) } import sparkSession.implicits._ val df = values.toDF("c1", "c2", "c3").persist() val allCols = df.columns.map(col) df.select(basicUdf(struct(allCols: _*)).as("valuesScore")) .select(explode(col("valuesScore"))) .distinct() .show() } } {code} this shows: {code:java} +---+ |col| +---+ +---+ {code} When adding persist before explode, the result is correct: {code:java} df.select(basicUdf(struct(allCols: _*)).as("valuesScore")) .persist() .select(explode(col("valuesScore"))) .distinct() .show() {code} => {code:java} ++ | col| ++ |{2, 2.0}| |{4, 4.0}| |{3, 3.0}| |{5, 5.0}| |{1, 1.0}| ++ {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-34392) Invalid ID for offset-based ZoneId since Spark 3.0
[ https://issues.apache.org/jira/browse/SPARK-34392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen updated SPARK-34392: - Fix Version/s: 3.0.3 > Invalid ID for offset-based ZoneId since Spark 3.0 > -- > > Key: SPARK-34392 > URL: https://issues.apache.org/jira/browse/SPARK-34392 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0, 3.0.1 >Reporter: Yuming Wang >Assignee: karl wang >Priority: Major > Fix For: 3.2.0, 3.1.2, 3.0.3 > > > How to reproduce this issue: > {code:sql} > select to_utc_timestamp("2020-02-07 16:00:00", "GMT+8:00"); > {code} > Spark 2.4: > {noformat} > spark-sql> select to_utc_timestamp("2020-02-07 16:00:00", "GMT+8:00"); > 2020-02-07 08:00:00 > Time taken: 0.089 seconds, Fetched 1 row(s) > {noformat} > Spark 3.x: > {noformat} > spark-sql> select to_utc_timestamp("2020-02-07 16:00:00", "GMT+8:00"); > 21/02/07 01:24:32 ERROR SparkSQLDriver: Failed in [select > to_utc_timestamp("2020-02-07 16:00:00", "GMT+8:00")] > java.time.DateTimeException: Invalid ID for offset-based ZoneId: GMT+8:00 > at java.time.ZoneId.ofWithPrefix(ZoneId.java:437) > at java.time.ZoneId.of(ZoneId.java:407) > at java.time.ZoneId.of(ZoneId.java:359) > at java.time.ZoneId.of(ZoneId.java:315) > at > org.apache.spark.sql.catalyst.util.DateTimeUtils$.getZoneId(DateTimeUtils.scala:53) > at > org.apache.spark.sql.catalyst.util.DateTimeUtils$.toUTCTime(DateTimeUtils.scala:814) > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-34563) Checkpointing a union with another checkpoint fails
[ https://issues.apache.org/jira/browse/SPARK-34563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Kamprath updated SPARK-34563: - Affects Version/s: 3.1.1 > Checkpointing a union with another checkpoint fails > --- > > Key: SPARK-34563 > URL: https://issues.apache.org/jira/browse/SPARK-34563 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.0.2, 3.1.1 > Environment: I am running Spark 3.0.2 in stand alone cluster mode, > built for Hadoop 2.7, and Scala 2.12.12. I am using QFS 2.2.2 (Quantcast File > System) as the underlying DFS. The nodes run on Debian Stretch, and Java is > openjdk version "1.8.0_275". >Reporter: Michael Kamprath >Priority: Major > > I have some PySpark code that periodically checkpoints a data frame that I > am building in pieces by union-ing those pieces together as they are > constructed. (Py)Spark fails on the second checkpoint, which would be a union > of a new piece of the desired data frame with a previously checkpointed > piece. Some simplified PySpark code that will trigger this problem is: > > {code:java} > RANGE_STEP = 1 > PARTITIONS = 5 > COUNT_UNIONS = 20 > df = spark.range(1, RANGE_STEP+1, numPartitions=PARTITIONS) > for i in range(1, COUNT_UNIONS+1): > print('Processing i = {0}'.format(i)) > new_df = spark.range(RANGE_STEP*i + 1, RANGE_STEP*(i+1) + 1, > numPartitions=PARTITIONS) > df = df.union(new_df).checkpoint() > df.count() > {code} > When this code gets to the checkpoint on the second loop iteration (i=2) the > job fails with an error: > > {code:java} > Py4JJavaError: An error occurred while calling o119.checkpoint. > : org.apache.spark.SparkException: Job aborted due to stage failure: Task 9 > in stage 10.0 failed 4 times, most recent failure: Lost task 9.3 in stage > 10.0 (TID 264, 10.20.30.13, executor 0): > com.esotericsoftware.kryo.KryoException: Encountered unregistered class ID: > 9062 > at > com.esotericsoftware.kryo.util.DefaultClassResolver.readClass(DefaultClassResolver.java:137) > at com.esotericsoftware.kryo.Kryo.readClass(Kryo.java:693) > at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:804) > at > org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:296) > at > org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:168) > at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:729) > at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) > at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1804) > at org.apache.spark.rdd.RDD.$anonfun$count$1(RDD.scala:1227) > at org.apache.spark.rdd.RDD.$anonfun$count$1$adapted(RDD.scala:1227) > at > org.apache.spark.SparkContext.$anonfun$runJob$5(SparkContext.scala:2154) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) > at org.apache.spark.scheduler.Task.run(Task.scala:127) > at > org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:462) > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:465) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > Driver stacktrace: > at > org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2059) > at > org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2008) > at > org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2007) > at > scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) > at > scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) > at > org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2007) > at > org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:973) > at > org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:973) > at
[jira] [Commented] (SPARK-34563) Checkpointing a union with another checkpoint fails
[ https://issues.apache.org/jira/browse/SPARK-34563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17296250#comment-17296250 ] Michael Kamprath commented on SPARK-34563: -- I just tested this under Spark 3.1.1 keep everything else in my set up the same, and it fails at the same point. However, the exception thrown looks slightly different: {code:java} Py4JJavaError Traceback (most recent call last) in 8 print('Processing i = {0}'.format(i)) 9 new_df = spark.range(RANGE_STEP*i + 1, RANGE_STEP*(i+1) + 1, numPartitions=PARTITIONS) ---> 10 df = df.union(new_df).checkpoint() 11 12 df.count() /usr/spark-3.1.1/python/pyspark/sql/dataframe.py in checkpoint(self, eager) 544 This API is experimental. 545 """ --> 546 jdf = self._jdf.checkpoint(eager) 547 return DataFrame(jdf, self.sql_ctx) 548 /usr/spark-3.1.1/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py in __call__(self, *args) 1303 answer = self.gateway_client.send_command(command) 1304 return_value = get_return_value( -> 1305 answer, self.gateway_client, self.target_id, self.name) 1306 1307 for temp_arg in temp_args: /usr/spark-3.1.1/python/pyspark/sql/utils.py in deco(*a, **kw) 109 def deco(*a, **kw): 110 try: --> 111 return f(*a, **kw) 112 except py4j.protocol.Py4JJavaError as e: 113 converted = convert_exception(e.java_exception) /usr/spark-3.1.1/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name) 326 raise Py4JJavaError( 327 "An error occurred while calling {0}{1}{2}.\n". --> 328 format(target_id, ".", name), value) 329 else: 330 raise Py4JError( Py4JJavaError: An error occurred while calling o65.checkpoint. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 8 in stage 2.0 failed 4 times, most recent failure: Lost task 8.3 in stage 2.0 (TID 50) (10.20.30.17 executor 3): java.lang.IndexOutOfBoundsException: Index: 61, Size: 0 at java.util.ArrayList.rangeCheck(ArrayList.java:659) at java.util.ArrayList.get(ArrayList.java:435) at com.esotericsoftware.kryo.util.MapReferenceResolver.getReadObject(MapReferenceResolver.java:60) at com.esotericsoftware.kryo.Kryo.readReferenceOrNull(Kryo.java:857) at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:811) at org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:296) at org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:168) at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:755) at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1866) at org.apache.spark.rdd.RDD.$anonfun$count$1(RDD.scala:1253) at org.apache.spark.rdd.RDD.$anonfun$count$1$adapted(RDD.scala:1253) at org.apache.spark.SparkContext.$anonfun$runJob$5(SparkContext.scala:2242) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at org.apache.spark.scheduler.Task.run(Task.scala:131) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2253) at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2202) at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2201) at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2201) at
[jira] [Assigned] (SPARK-34643) Use CRAN URL in canonical form
[ https://issues.apache.org/jira/browse/SPARK-34643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-34643: - Assignee: Hyukjin Kwon > Use CRAN URL in canonical form > -- > > Key: SPARK-34643 > URL: https://issues.apache.org/jira/browse/SPARK-34643 > Project: Spark > Issue Type: Bug > Components: R >Affects Versions: 3.2.0 >Reporter: Dongjoon Hyun >Assignee: Hyukjin Kwon >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-34643) Use CRAN URL in canonical form
[ https://issues.apache.org/jira/browse/SPARK-34643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-34643. --- Fix Version/s: 3.1.2 3.2.0 Resolution: Fixed Issue resolved by pull request 31759 [https://github.com/apache/spark/pull/31759] > Use CRAN URL in canonical form > -- > > Key: SPARK-34643 > URL: https://issues.apache.org/jira/browse/SPARK-34643 > Project: Spark > Issue Type: Bug > Components: R >Affects Versions: 3.2.0 >Reporter: Dongjoon Hyun >Assignee: Hyukjin Kwon >Priority: Major > Fix For: 3.2.0, 3.1.2 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-34624) Filter non-jar dependencies from ivy/maven coordinates
[ https://issues.apache.org/jira/browse/SPARK-34624?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-34624. --- Fix Version/s: 3.2.0 Resolution: Fixed Issue resolved by pull request 31741 [https://github.com/apache/spark/pull/31741] > Filter non-jar dependencies from ivy/maven coordinates > -- > > Key: SPARK-34624 > URL: https://issues.apache.org/jira/browse/SPARK-34624 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.1.1 >Reporter: Shardul Mahadik >Assignee: Shardul Mahadik >Priority: Major > Fix For: 3.2.0 > > > Some maven artifacts define non-jar dependencies. One such example is > {{hive-exec}}'s dependency on the {{pom}} of {{apache-curator}} > https://repo1.maven.org/maven2/org/apache/hive/hive-exec/2.3.8/hive-exec-2.3.8.pom > Today trying to depend on such an artifact using {{--packages}} will print an > error but continue without including the non-jar dependency. > {code} > 1/03/04 09:46:49 ERROR SparkContext: Failed to add > file:/Users/smahadik/.ivy2/jars/org.apache.curator_apache-curator-2.7.1.jar > to Spark environment > java.io.FileNotFoundException: Jar > /Users/shardul/.ivy2/jars/org.apache.curator_apache-curator-2.7.1.jar not > found > at > org.apache.spark.SparkContext.addLocalJarFile$1(SparkContext.scala:1935) > at org.apache.spark.SparkContext.addJar(SparkContext.scala:1990) > at org.apache.spark.SparkContext.$anonfun$new$12(SparkContext.scala:501) > at > org.apache.spark.SparkContext.$anonfun$new$12$adapted(SparkContext.scala:501) > at > scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) > {code} > Doing the same using {{spark.sql("ADD JAR > ivy://org.apache.hive:hive-exec:2.3.8?exclude=org.pentaho:pentaho-aggdesigner-algorithm")}} > will cause a failure > {code} > ADD JAR /Users/smahadik/.ivy2/jars/org.apache.curator_apache-curator-2.7.1.jar > /Users/smahadik/.ivy2/jars/org.apache.curator_apache-curator-2.7.1.jar does > not exist > == > END HIVE FAILURE OUTPUT > == > org.apache.spark.sql.execution.QueryExecutionException: > /Users/smahadik/.ivy2/jars/org.apache.curator_apache-curator-2.7.1.jar does > not exist > at > org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$runHive$1(HiveClientImpl.scala:841) > at > org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$withHiveState$1(HiveClientImpl.scala:291) > at > org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:224) > at > org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:223) > at > org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:273) > at > org.apache.spark.sql.hive.client.HiveClientImpl.runHive(HiveClientImpl.scala:800) > at > org.apache.spark.sql.hive.client.HiveClientImpl.runSqlHive(HiveClientImpl.scala:787) > at > org.apache.spark.sql.hive.client.HiveClientImpl.addJar(HiveClientImpl.scala:947) > at > org.apache.spark.sql.hive.HiveSessionResourceLoader.$anonfun$addJar$1(HiveSessionStateBuilder.scala:130) > at > org.apache.spark.sql.hive.HiveSessionResourceLoader.$anonfun$addJar$1$adapted(HiveSessionStateBuilder.scala:129) > at scala.collection.mutable.ArraySeq.foreach(ArraySeq.scala:75) > at > org.apache.spark.sql.hive.HiveSessionResourceLoader.addJar(HiveSessionStateBuilder.scala:129) > at > org.apache.spark.sql.execution.command.AddJarCommand.run(resources.scala:40) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:79) > at org.apache.spark.sql.Dataset.$anonfun$logicalPlan$1(Dataset.scala:228) > at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3705) > at > org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103) > at > org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163) > at > org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90) > at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:772) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64) > at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3703) > at org.apache.spark.sql.Dataset.(Dataset.scala:228) > at org.apache.spark.sql.Dataset$.$anonfun$ofRows$2(Dataset.scala:99) > at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:772) >
[jira] [Assigned] (SPARK-34624) Filter non-jar dependencies from ivy/maven coordinates
[ https://issues.apache.org/jira/browse/SPARK-34624?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-34624: - Assignee: Shardul Mahadik > Filter non-jar dependencies from ivy/maven coordinates > -- > > Key: SPARK-34624 > URL: https://issues.apache.org/jira/browse/SPARK-34624 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.1.1 >Reporter: Shardul Mahadik >Assignee: Shardul Mahadik >Priority: Major > > Some maven artifacts define non-jar dependencies. One such example is > {{hive-exec}}'s dependency on the {{pom}} of {{apache-curator}} > https://repo1.maven.org/maven2/org/apache/hive/hive-exec/2.3.8/hive-exec-2.3.8.pom > Today trying to depend on such an artifact using {{--packages}} will print an > error but continue without including the non-jar dependency. > {code} > 1/03/04 09:46:49 ERROR SparkContext: Failed to add > file:/Users/smahadik/.ivy2/jars/org.apache.curator_apache-curator-2.7.1.jar > to Spark environment > java.io.FileNotFoundException: Jar > /Users/shardul/.ivy2/jars/org.apache.curator_apache-curator-2.7.1.jar not > found > at > org.apache.spark.SparkContext.addLocalJarFile$1(SparkContext.scala:1935) > at org.apache.spark.SparkContext.addJar(SparkContext.scala:1990) > at org.apache.spark.SparkContext.$anonfun$new$12(SparkContext.scala:501) > at > org.apache.spark.SparkContext.$anonfun$new$12$adapted(SparkContext.scala:501) > at > scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) > {code} > Doing the same using {{spark.sql("ADD JAR > ivy://org.apache.hive:hive-exec:2.3.8?exclude=org.pentaho:pentaho-aggdesigner-algorithm")}} > will cause a failure > {code} > ADD JAR /Users/smahadik/.ivy2/jars/org.apache.curator_apache-curator-2.7.1.jar > /Users/smahadik/.ivy2/jars/org.apache.curator_apache-curator-2.7.1.jar does > not exist > == > END HIVE FAILURE OUTPUT > == > org.apache.spark.sql.execution.QueryExecutionException: > /Users/smahadik/.ivy2/jars/org.apache.curator_apache-curator-2.7.1.jar does > not exist > at > org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$runHive$1(HiveClientImpl.scala:841) > at > org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$withHiveState$1(HiveClientImpl.scala:291) > at > org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:224) > at > org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:223) > at > org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:273) > at > org.apache.spark.sql.hive.client.HiveClientImpl.runHive(HiveClientImpl.scala:800) > at > org.apache.spark.sql.hive.client.HiveClientImpl.runSqlHive(HiveClientImpl.scala:787) > at > org.apache.spark.sql.hive.client.HiveClientImpl.addJar(HiveClientImpl.scala:947) > at > org.apache.spark.sql.hive.HiveSessionResourceLoader.$anonfun$addJar$1(HiveSessionStateBuilder.scala:130) > at > org.apache.spark.sql.hive.HiveSessionResourceLoader.$anonfun$addJar$1$adapted(HiveSessionStateBuilder.scala:129) > at scala.collection.mutable.ArraySeq.foreach(ArraySeq.scala:75) > at > org.apache.spark.sql.hive.HiveSessionResourceLoader.addJar(HiveSessionStateBuilder.scala:129) > at > org.apache.spark.sql.execution.command.AddJarCommand.run(resources.scala:40) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:79) > at org.apache.spark.sql.Dataset.$anonfun$logicalPlan$1(Dataset.scala:228) > at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3705) > at > org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103) > at > org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163) > at > org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90) > at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:772) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64) > at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3703) > at org.apache.spark.sql.Dataset.(Dataset.scala:228) > at org.apache.spark.sql.Dataset$.$anonfun$ofRows$2(Dataset.scala:99) > at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:772) > at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:96) > at
[jira] [Commented] (SPARK-34641) ARM CI failed due to download hive-exec failed
[ https://issues.apache.org/jira/browse/SPARK-34641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17296239#comment-17296239 ] Shane Knapp commented on SPARK-34641: - looks like hive-exec downloaded fine in my recent build: [https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-arm/565/console] {code:java} Downloading from gcs-maven-central-mirror: https://maven-central.storage-download.googleapis.com/maven2/org/apache/hive/hive-exec/2.3.8/hive-exec-2.3.8.pom Progress (1): 3.1/31 kB Progress (1): 5.8/31 kB Progress (1): 8.6/31 kB Progress (1): 11/31 kB Progress (1): 14/31 kB Progress (1): 16/31 kB Progress (1): 19/31 kB Progress (1): 22/31 kB Progress (1): 25/31 kB Progress (1): 28/31 kB Progress (1): 30/31 kB Progress (1): 31 kB Downloaded from gcs-maven-central-mirror: https://maven-central.storage-download.googleapis.com/maven2/org/apache/hive/hive-exec/2.3.8/hive-exec-2.3.8.pom (31 kB at 22 kB/s)r {code} we'll wait and see how this build turns out. > ARM CI failed due to download hive-exec failed > -- > > Key: SPARK-34641 > URL: https://issues.apache.org/jira/browse/SPARK-34641 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 3.1.2 >Reporter: Yikun Jiang >Priority: Major > > The download failed (org.apache.hive#hive-exec;3.0.0!hive-exec.jar) was > happened in recent spark-master-test-maven-arm build tests. > But it's not reproduced and all hive test passed in my arm and x86 local env. > and the jenkins x86 test like [2] is also unstable due to other reason, so > looks like we can't find a valid reference. > [1] [https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-arm/] > [2] > [https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-2.7/] > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34641) ARM CI failed due to download hive-exec failed
[ https://issues.apache.org/jira/browse/SPARK-34641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17296235#comment-17296235 ] Shane Knapp commented on SPARK-34641: - no clue, tbh. i'll try and take a closer look at the build logs later, but for now i wiped the ivy and maven caches on the VM and triggered a fresh build. > ARM CI failed due to download hive-exec failed > -- > > Key: SPARK-34641 > URL: https://issues.apache.org/jira/browse/SPARK-34641 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 3.1.2 >Reporter: Yikun Jiang >Priority: Major > > The download failed (org.apache.hive#hive-exec;3.0.0!hive-exec.jar) was > happened in recent spark-master-test-maven-arm build tests. > But it's not reproduced and all hive test passed in my arm and x86 local env. > and the jenkins x86 test like [2] is also unstable due to other reason, so > looks like we can't find a valid reference. > [1] [https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-arm/] > [2] > [https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-2.7/] > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-34643) Use CRAN URL in canonical form
[ https://issues.apache.org/jira/browse/SPARK-34643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-34643: Assignee: Apache Spark > Use CRAN URL in canonical form > -- > > Key: SPARK-34643 > URL: https://issues.apache.org/jira/browse/SPARK-34643 > Project: Spark > Issue Type: Bug > Components: R >Affects Versions: 3.2.0 >Reporter: Dongjoon Hyun >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-34643) Use CRAN URL in canonical form
[ https://issues.apache.org/jira/browse/SPARK-34643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-34643: Assignee: (was: Apache Spark) > Use CRAN URL in canonical form > -- > > Key: SPARK-34643 > URL: https://issues.apache.org/jira/browse/SPARK-34643 > Project: Spark > Issue Type: Bug > Components: R >Affects Versions: 3.2.0 >Reporter: Dongjoon Hyun >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34643) Use CRAN URL in canonical form
[ https://issues.apache.org/jira/browse/SPARK-34643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17296227#comment-17296227 ] Apache Spark commented on SPARK-34643: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/31759 > Use CRAN URL in canonical form > -- > > Key: SPARK-34643 > URL: https://issues.apache.org/jira/browse/SPARK-34643 > Project: Spark > Issue Type: Bug > Components: R >Affects Versions: 3.2.0 >Reporter: Dongjoon Hyun >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34642) TypeError in Pyspark Linear Regression docs
[ https://issues.apache.org/jira/browse/SPARK-34642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17296226#comment-17296226 ] Apache Spark commented on SPARK-34642: -- User 'srowen' has created a pull request for this issue: https://github.com/apache/spark/pull/31760 > TypeError in Pyspark Linear Regression docs > --- > > Key: SPARK-34642 > URL: https://issues.apache.org/jira/browse/SPARK-34642 > Project: Spark > Issue Type: Bug > Components: Documentation, ML, PySpark >Affects Versions: 3.1.1 >Reporter: Sean R. Owen >Priority: Minor > > The documentation for Linear Regression in Pyspark includes an example, which > contains a TypeError: > https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.regression.LinearRegression.html?highlight=linearregression#pyspark.ml.regression.LinearRegression > {code} > >>> abs(model.transform(test1).head().newPrediction - 1.0) < 0.001 > True > >>> lr.setParams("vector") > Traceback (most recent call last): > ... > TypeError: Method setParams forces keyword arguments. > {code} > I'm pretty sure we don't intend this, and it can be resolved by just changing > the function call to use keyword args. > (HT Brooke Wenig for flagging this) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-34642) TypeError in Pyspark Linear Regression docs
[ https://issues.apache.org/jira/browse/SPARK-34642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-34642: Assignee: Apache Spark > TypeError in Pyspark Linear Regression docs > --- > > Key: SPARK-34642 > URL: https://issues.apache.org/jira/browse/SPARK-34642 > Project: Spark > Issue Type: Bug > Components: Documentation, ML, PySpark >Affects Versions: 3.1.1 >Reporter: Sean R. Owen >Assignee: Apache Spark >Priority: Minor > > The documentation for Linear Regression in Pyspark includes an example, which > contains a TypeError: > https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.regression.LinearRegression.html?highlight=linearregression#pyspark.ml.regression.LinearRegression > {code} > >>> abs(model.transform(test1).head().newPrediction - 1.0) < 0.001 > True > >>> lr.setParams("vector") > Traceback (most recent call last): > ... > TypeError: Method setParams forces keyword arguments. > {code} > I'm pretty sure we don't intend this, and it can be resolved by just changing > the function call to use keyword args. > (HT Brooke Wenig for flagging this) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-34642) TypeError in Pyspark Linear Regression docs
[ https://issues.apache.org/jira/browse/SPARK-34642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-34642: Assignee: (was: Apache Spark) > TypeError in Pyspark Linear Regression docs > --- > > Key: SPARK-34642 > URL: https://issues.apache.org/jira/browse/SPARK-34642 > Project: Spark > Issue Type: Bug > Components: Documentation, ML, PySpark >Affects Versions: 3.1.1 >Reporter: Sean R. Owen >Priority: Minor > > The documentation for Linear Regression in Pyspark includes an example, which > contains a TypeError: > https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.regression.LinearRegression.html?highlight=linearregression#pyspark.ml.regression.LinearRegression > {code} > >>> abs(model.transform(test1).head().newPrediction - 1.0) < 0.001 > True > >>> lr.setParams("vector") > Traceback (most recent call last): > ... > TypeError: Method setParams forces keyword arguments. > {code} > I'm pretty sure we don't intend this, and it can be resolved by just changing > the function call to use keyword args. > (HT Brooke Wenig for flagging this) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-34592) Mark indeterminate RDD in Web UI
[ https://issues.apache.org/jira/browse/SPARK-34592?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] L. C. Hsieh resolved SPARK-34592. - Fix Version/s: 3.2.0 Resolution: Fixed Issue resolved by pull request 31707 [https://github.com/apache/spark/pull/31707] > Mark indeterminate RDD in Web UI > > > Key: SPARK-34592 > URL: https://issues.apache.org/jira/browse/SPARK-34592 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 3.2.0 >Reporter: L. C. Hsieh >Assignee: L. C. Hsieh >Priority: Major > Fix For: 3.2.0 > > > It is somehow hard to track which part is indeterminate in a graph of RDDs. > In some cases we may need to track indeterminate RDDs. For example, > indeterminate map stage fails and Spark is unable to fallback some parent > stages. If Web UI can show up indeterminate RDD like cached RDD, it could be > useful to track it. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-34643) Use CRAN URL in canonical form
Dongjoon Hyun created SPARK-34643: - Summary: Use CRAN URL in canonical form Key: SPARK-34643 URL: https://issues.apache.org/jira/browse/SPARK-34643 Project: Spark Issue Type: Bug Components: R Affects Versions: 3.2.0 Reporter: Dongjoon Hyun -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-34642) TypeError in Pyspark Linear Regression docs
Sean R. Owen created SPARK-34642: Summary: TypeError in Pyspark Linear Regression docs Key: SPARK-34642 URL: https://issues.apache.org/jira/browse/SPARK-34642 Project: Spark Issue Type: Bug Components: Documentation, ML, PySpark Affects Versions: 3.1.1 Reporter: Sean R. Owen The documentation for Linear Regression in Pyspark includes an example, which contains a TypeError: https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.regression.LinearRegression.html?highlight=linearregression#pyspark.ml.regression.LinearRegression {code} >>> abs(model.transform(test1).head().newPrediction - 1.0) < 0.001 True >>> lr.setParams("vector") Traceback (most recent call last): ... TypeError: Method setParams forces keyword arguments. {code} I'm pretty sure we don't intend this, and it can be resolved by just changing the function call to use keyword args. (HT Brooke Wenig for flagging this) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-34636) sql method in UnresolvedAttribute, AttributeReference and Alias don't quote qualified names properly.
[ https://issues.apache.org/jira/browse/SPARK-34636?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kousuke Saruta updated SPARK-34636: --- Description: UnresolvedReference, AttributeReference and Alias which take qualifier don't quote qualified names properly. One instance is reported in SPARK-34626. Other instances are like as follows. {code} UnresolvedAttribute("a`b"::"c.d"::Nil).sql a`b.`c.d` // expected: `a``b`.`c.d` AttributeReference("a.b", IntegerType)(qualifier = "c.d"::Nil).sql c.d.`a.b` // expected: `c.d`.`a.b` Alias(AttributeReference("a", IntegerType)(), "b.c")(qualifier = "d.e"::Nil).sql `a` AS d.e.`b.c` // expected: `a` AS `d.e`.`b.c` {code} was: UnresolvedReference, AttributeReference and Alias which take qualifier don't quote qualified names properly. One instance is reported in SPARK-34626. Other instances are like as follows. {code} UnresolvedAttribute("a`b"::"c.d"::Nil).sql a`b.`c.d` // expected: `a``b`.`c.d` AttributeReference("a.b", IntegerType)(qualifier = "c.d"::Nil) c.d.`a.b` // expected: `c.d`.`a.b` Alias(AttributeReference("a", IntegerType)(), "b.c")(qualifier = "d.e"::Nil).sql `a` AS d.e.`b.c` // expected: `a` AS `d.e`.`b.c` {code} > sql method in UnresolvedAttribute, AttributeReference and Alias don't quote > qualified names properly. > - > > Key: SPARK-34636 > URL: https://issues.apache.org/jira/browse/SPARK-34636 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.2, 3.2.0, 3.1.1 >Reporter: Kousuke Saruta >Assignee: Kousuke Saruta >Priority: Major > > UnresolvedReference, AttributeReference and Alias which take qualifier don't > quote qualified names properly. > One instance is reported in SPARK-34626. > Other instances are like as follows. > {code} > UnresolvedAttribute("a`b"::"c.d"::Nil).sql > a`b.`c.d` // expected: `a``b`.`c.d` > AttributeReference("a.b", IntegerType)(qualifier = "c.d"::Nil).sql > c.d.`a.b` // expected: `c.d`.`a.b` > Alias(AttributeReference("a", IntegerType)(), "b.c")(qualifier = > "d.e"::Nil).sql > `a` AS d.e.`b.c` // expected: `a` AS `d.e`.`b.c` > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33288) Support k8s cluster manager with stage level scheduling
[ https://issues.apache.org/jira/browse/SPARK-33288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17296125#comment-17296125 ] Itay Bittan commented on SPARK-33288: - [~tgraves] you are right! I used my compiled version (3.0.1) instead of the official 3.1.1! thanks! > Support k8s cluster manager with stage level scheduling > --- > > Key: SPARK-33288 > URL: https://issues.apache.org/jira/browse/SPARK-33288 > Project: Spark > Issue Type: Sub-task > Components: Kubernetes, Spark Core >Affects Versions: 3.1.0 >Reporter: Thomas Graves >Assignee: Thomas Graves >Priority: Major > Fix For: 3.1.0 > > > Kubernetes supports dynamic allocation via the > {{spark.dynamicAllocation.shuffleTracking.enabled}} > {{config, we can add support for stage level scheduling when that is turned > on. }} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34641) ARM CI failed due to download hive-exec failed
[ https://issues.apache.org/jira/browse/SPARK-34641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17296072#comment-17296072 ] Yikun Jiang commented on SPARK-34641: - [~shaneknapp] Do you have any idea on it? > ARM CI failed due to download hive-exec failed > -- > > Key: SPARK-34641 > URL: https://issues.apache.org/jira/browse/SPARK-34641 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 3.1.2 >Reporter: Yikun Jiang >Priority: Major > > The download failed (org.apache.hive#hive-exec;3.0.0!hive-exec.jar) was > happened in recent spark-master-test-maven-arm build tests. > But it's not reproduced and all hive test passed in my arm and x86 local env. > and the jenkins x86 test like [2] is also unstable due to other reason, so > looks like we can't find a valid reference. > [1] [https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-arm/] > [2] > [https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-2.7/] > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-34641) ARM CI failed due to download hive-exec failed
Yikun Jiang created SPARK-34641: --- Summary: ARM CI failed due to download hive-exec failed Key: SPARK-34641 URL: https://issues.apache.org/jira/browse/SPARK-34641 Project: Spark Issue Type: Bug Components: Build Affects Versions: 3.1.2 Reporter: Yikun Jiang The download failed (org.apache.hive#hive-exec;3.0.0!hive-exec.jar) was happened in recent spark-master-test-maven-arm build tests. But it's not reproduced and all hive test passed in my arm and x86 local env. and the jenkins x86 test like [2] is also unstable due to other reason, so looks like we can't find a valid reference. [1] [https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-arm/] [2] [https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-2.7/] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33288) Support k8s cluster manager with stage level scheduling
[ https://issues.apache.org/jira/browse/SPARK-33288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17296060#comment-17296060 ] Thomas Graves commented on SPARK-33288: --- It looks like the version you are trying to launch against is incompatible. ie your backend is using spark 3.0 and your client is using spark 3.1.1. Use the same version for both. > Support k8s cluster manager with stage level scheduling > --- > > Key: SPARK-33288 > URL: https://issues.apache.org/jira/browse/SPARK-33288 > Project: Spark > Issue Type: Sub-task > Components: Kubernetes, Spark Core >Affects Versions: 3.1.0 >Reporter: Thomas Graves >Assignee: Thomas Graves >Priority: Major > Fix For: 3.1.0 > > > Kubernetes supports dynamic allocation via the > {{spark.dynamicAllocation.shuffleTracking.enabled}} > {{config, we can add support for stage level scheduling when that is turned > on. }} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33288) Support k8s cluster manager with stage level scheduling
[ https://issues.apache.org/jira/browse/SPARK-33288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17296027#comment-17296027 ] Itay Bittan commented on SPARK-33288: - Hi! I just upgraded from spark 3.0.1 to 3.1.1 and I'm having an issue with the resourceProfileId. I saw that this [PR|https://github.com/apache/spark/pull/30204/files] added this argument but I didn't found documentation about that. I'm running a simple spark application in client-mode via Jupyter (as the driver pod) + pyspark. I also asked in [SO|https://stackoverflow.com/questions/66482218/spark-executors-fails-on-kubernetes-resourceprofileid-is-missing]. I'll appreciate any clue > Support k8s cluster manager with stage level scheduling > --- > > Key: SPARK-33288 > URL: https://issues.apache.org/jira/browse/SPARK-33288 > Project: Spark > Issue Type: Sub-task > Components: Kubernetes, Spark Core >Affects Versions: 3.1.0 >Reporter: Thomas Graves >Assignee: Thomas Graves >Priority: Major > Fix For: 3.1.0 > > > Kubernetes supports dynamic allocation via the > {{spark.dynamicAllocation.shuffleTracking.enabled}} > {{config, we can add support for stage level scheduling when that is turned > on. }} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-34640) unable to access grouping column after groupBy
Jiri Humpolicek created SPARK-34640: --- Summary: unable to access grouping column after groupBy Key: SPARK-34640 URL: https://issues.apache.org/jira/browse/SPARK-34640 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.1.1 Reporter: Jiri Humpolicek When I group by nested column, I am unable to reference it after groupBy operation. Example: 1) Preparing dataframe with nested column: {code:scala} case class Sub(a2: String) case class Top(a1: String, s: Sub) val s = Seq( Top("r1", Sub("s1")), Top("r2", Sub("s3")) ) val df = s.toDF df.printSchema // root // |-- a1: string (nullable = true) // |-- s: struct (nullable = true) // ||-- a2: string (nullable = true) {code} 2) try to access grouping column after groupBy: {code:scala} df.groupBy($"s.a2").count.select('a2) // org.apache.spark.sql.AnalysisException: cannot resolve '`a2`' given input columns: [count, s.a2]; {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-34634) Self-join with script transformation failed to resolve attribute correctly
[ https://issues.apache.org/jira/browse/SPARK-34634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] EdisonWang updated SPARK-34634: --- Affects Version/s: 2.4.0 2.4.1 2.4.2 2.4.3 2.4.4 2.4.5 2.4.6 2.4.7 > Self-join with script transformation failed to resolve attribute correctly > -- > > Key: SPARK-34634 > URL: https://issues.apache.org/jira/browse/SPARK-34634 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0, 2.4.1, 2.4.2, 2.4.3, 2.4.4, 2.4.5, 2.4.6, 2.4.7, > 3.0.0, 3.0.1, 3.0.2, 3.1.0, 3.1.1 >Reporter: EdisonWang >Assignee: Apache Spark >Priority: Minor > > To reproduce, > {code:java} > // code placeholder > create temporary view t as select * from values 0, 1, 2 as t(a); > WITH temp AS ( > SELECT TRANSFORM(a) USING 'cat' AS (b string) FROM t > ) > SELECT t1.b FROM temp t1 JOIN temp t2 ON t1.b = t2.b > {code} > > Spark will throw AnalysisException > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34639) always remove unnecessary Alias in Analyzer.resolveExpression
[ https://issues.apache.org/jira/browse/SPARK-34639?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17295901#comment-17295901 ] Apache Spark commented on SPARK-34639: -- User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/31758 > always remove unnecessary Alias in Analyzer.resolveExpression > - > > Key: SPARK-34639 > URL: https://issues.apache.org/jira/browse/SPARK-34639 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2 >Reporter: Wenchen Fan >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-34639) always remove unnecessary Alias in Analyzer.resolveExpression
[ https://issues.apache.org/jira/browse/SPARK-34639?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-34639: Assignee: (was: Apache Spark) > always remove unnecessary Alias in Analyzer.resolveExpression > - > > Key: SPARK-34639 > URL: https://issues.apache.org/jira/browse/SPARK-34639 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2 >Reporter: Wenchen Fan >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-34639) always remove unnecessary Alias in Analyzer.resolveExpression
[ https://issues.apache.org/jira/browse/SPARK-34639?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-34639: Assignee: Apache Spark > always remove unnecessary Alias in Analyzer.resolveExpression > - > > Key: SPARK-34639 > URL: https://issues.apache.org/jira/browse/SPARK-34639 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2 >Reporter: Wenchen Fan >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33602) Group exception messages in execution/datasources
[ https://issues.apache.org/jira/browse/SPARK-33602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17295900#comment-17295900 ] Apache Spark commented on SPARK-33602: -- User 'beliefer' has created a pull request for this issue: https://github.com/apache/spark/pull/31757 > Group exception messages in execution/datasources > - > > Key: SPARK-33602 > URL: https://issues.apache.org/jira/browse/SPARK-33602 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: Allison Wang >Priority: Major > > '/core/src/main/scala/org/apache/spark/sql/execution/datasources' > || Filename|| Count || > | DataSource.scala| 9 | > | DataSourceStrategy.scala| 1 | > | DataSourceUtils.scala | 2 | > | FileFormat.scala| 1 | > | FileFormatWriter.scala | 3 | > | FileScanRDD.scala | 2 | > | InsertIntoHadoopFsRelationCommand.scala | 2 | > | PartitioningAwareFileIndex.scala| 1 | > | PartitioningUtils.scala | 3 | > | RecordReaderIterator.scala | 1 | > | rules.scala | 4 | > '/core/src/main/scala/org/apache/spark/sql/execution/datasources/binaryfile' > || Filename || Count || > | BinaryFileFormat.scala | 2 | > '/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc' > || Filename || Count || > | JDBCOptions.scala | 2 | > | JdbcUtils.scala | 6 | > '/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc' > || Filename || Count || > | OrcDeserializer.scala | 1 | > | OrcFilters.scala | 1 | > | OrcSerializer.scala | 1 | > | OrcUtils.scala| 2 | > '/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet' > || Filename || Count || > | ParquetFileFormat.scala | 2 | > | ParquetReadSupport.scala | 1 | > | ParquetSchemaConverter.scala | 6 | -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-34639) always remove unnecessary Alias in Analyzer.resolveExpression
Wenchen Fan created SPARK-34639: --- Summary: always remove unnecessary Alias in Analyzer.resolveExpression Key: SPARK-34639 URL: https://issues.apache.org/jira/browse/SPARK-34639 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.2 Reporter: Wenchen Fan -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29721) Spark SQL reads unnecessary nested fields after using explode
[ https://issues.apache.org/jira/browse/SPARK-29721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17295859#comment-17295859 ] Jiri Humpolicek commented on SPARK-29721: - here it is [SPARK-34638|https://issues.apache.org/jira/browse/SPARK-34638] > Spark SQL reads unnecessary nested fields after using explode > - > > Key: SPARK-29721 > URL: https://issues.apache.org/jira/browse/SPARK-29721 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Kai Kang >Assignee: L. C. Hsieh >Priority: Major > Fix For: 3.1.0 > > > This is a follow up for SPARK-4502. SPARK-4502 correctly addressed column > pruning for nested structures. However, when explode() is called on a nested > field, all columns for that nested structure is still fetched from data > source. > We are working on a project to create a parquet store for a big pre-joined > table between two tables that has one-to-many relationship, and this is a > blocking issue for us. > > The following code illustrates the issue. > Part 1: loading some nested data > {noformat} > val jsonStr = """{ > "items": [ >{"itemId": 1, "itemData": "a"}, >{"itemId": 2, "itemData": "b"} > ] > }""" > val df = spark.read.json(Seq(jsonStr).toDS) > df.write.format("parquet").mode("overwrite").saveAsTable("persisted") > {noformat} > > Part 2: reading it back and explaining the queries > {noformat} > val read = spark.table("persisted") > spark.conf.set("spark.sql.optimizer.nestedSchemaPruning.enabled", true) > // pruned, only loading itemId > // ReadSchema: struct>> > read.select($"items.itemId").explain(true) > // not pruned, loading both itemId > // ReadSchema: struct>> > read.select(explode($"items.itemId")).explain(true) and itemData > {noformat} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-34638) Spark SQL reads unnecessary nested fields (another type of pruning case)
Jiri Humpolicek created SPARK-34638: --- Summary: Spark SQL reads unnecessary nested fields (another type of pruning case) Key: SPARK-34638 URL: https://issues.apache.org/jira/browse/SPARK-34638 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.1.1 Reporter: Jiri Humpolicek Based on this [SPARK-29721|https://issues.apache.org/jira/browse/SPARK-29721] I found another nested fields pruning case. Example: 1) Loading data {code:scala} val jsonStr = """{ "items": [ {"itemId": 1, "itemData": "a"}, {"itemId": 2, "itemData": "b"} ] }""" val df = spark.read.json(Seq(jsonStr).toDS) df.write.format("parquet").mode("overwrite").saveAsTable("persisted") {code} 2) read query with explain {code:scala} val read = spark.table("persisted") spark.conf.set("spark.sql.optimizer.nestedSchemaPruning.enabled", true) read.select(explode($"items").as('item)).select($"item.itemId").explain(true) // ReadSchema: struct>> {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org