date:20210305

[jira] [Commented] (SPARK-34648) Reading Parquet Files in Spark Extremely Slow for Large Number of Files?

2021-03-05 Thread Pankaj Bhootra (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17296471#comment-17296471
 ] 

Pankaj Bhootra commented on SPARK-34648:


[~marmbrus] - I understand you were involved in discussions on SPARK-9347. Will 
you be able to help with this?

> Reading Parquet Files in Spark Extremely Slow for Large Number of Files?
> 
>
> Key: SPARK-34648
> URL: https://issues.apache.org/jira/browse/SPARK-34648
> Project: Spark
>  Issue Type: Question
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Pankaj Bhootra
>Priority: Major
>
> Hello Team
> I am new to Spark and this question may be a possible duplicate of the issue 
> highlighted here: https://issues.apache.org/jira/browse/SPARK-9347 
> We have a large dataset partitioned by calendar date, and within each date 
> partition, we are storing the data as *parquet* files in 128 parts.
> We are trying to run aggregation on this dataset for 366 dates at a time with 
> Spark SQL on spark version 2.3.0, hence our Spark job is reading 
> 366*128=46848 partitions, all of which are parquet files. There is currently 
> no *_metadata* or *_common_metadata* file(s) available for this dataset.
> The problem we are facing is that when we try to run *spark.read.parquet* on 
> the above 46848 partitions, our data reads are extremely slow. It takes a 
> long time to run even a simple map task (no shuffling) without any 
> aggregation or group by.
> I read through the above issue and I think I perhaps generally understand the 
> ideas around *_common_metadata* file. But the above issue was raised for 
> Spark 1.3.1 and for Spark 2.3.0, I have not found any documentation related 
> to this metadata file so far.
> I would like to clarify:
>  # What's the latest, best practice for reading large number of parquet files 
> efficiently?
>  # Does this involve using any additional options with spark.read.parquet? 
> How would that work?
>  # Are there other possible reasons for slow data reads apart from reading 
> metadata for every part? We are basically trying to migrate our existing 
> spark pipeline from using csv files to parquet, but from my hands-on so far, 
> it seems that parquet's read time is slower than csv? This seems 
> contradictory to popular opinion that parquet performs better in terms of 
> both computation and storage?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-34648) Reading Parquet Files in Spark Extremely Slow for Large Number of Files?

2021-03-05 Thread Pankaj Bhootra (Jira)

Pankaj Bhootra created SPARK-34648:
--

 Summary: Reading Parquet Files in Spark Extremely Slow for Large 
Number of Files?
 Key: SPARK-34648
 URL: https://issues.apache.org/jira/browse/SPARK-34648
 Project: Spark
  Issue Type: Question
  Components: SQL
Affects Versions: 2.3.0
Reporter: Pankaj Bhootra


Hello Team

I am new to Spark and this question may be a possible duplicate of the issue 
highlighted here: https://issues.apache.org/jira/browse/SPARK-9347 

We have a large dataset partitioned by calendar date, and within each date 
partition, we are storing the data as *parquet* files in 128 parts.

We are trying to run aggregation on this dataset for 366 dates at a time with 
Spark SQL on spark version 2.3.0, hence our Spark job is reading 366*128=46848 
partitions, all of which are parquet files. There is currently no *_metadata* 
or *_common_metadata* file(s) available for this dataset.

The problem we are facing is that when we try to run *spark.read.parquet* on 
the above 46848 partitions, our data reads are extremely slow. It takes a long 
time to run even a simple map task (no shuffling) without any aggregation or 
group by.

I read through the above issue and I think I perhaps generally understand the 
ideas around *_common_metadata* file. But the above issue was raised for Spark 
1.3.1 and for Spark 2.3.0, I have not found any documentation related to this 
metadata file so far.

I would like to clarify:
 # What's the latest, best practice for reading large number of parquet files 
efficiently?
 # Does this involve using any additional options with spark.read.parquet? How 
would that work?
 # Are there other possible reasons for slow data reads apart from reading 
metadata for every part? We are basically trying to migrate our existing spark 
pipeline from using csv files to parquet, but from my hands-on so far, it seems 
that parquet's read time is slower than csv? This seems contradictory to 
popular opinion that parquet performs better in terms of both computation and 
storage?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-34647) Upgrade ZSTD-JNI to 1.4.8-7 and use NoFinalizer classes

2021-03-05 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34647?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34647:


Assignee: Apache Spark

> Upgrade ZSTD-JNI to 1.4.8-7 and use NoFinalizer classes
> ---
>
> Key: SPARK-34647
> URL: https://issues.apache.org/jira/browse/SPARK-34647
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.2.0
>Reporter: Dongjoon Hyun
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34647) Upgrade ZSTD-JNI to 1.4.8-7 and use NoFinalizer classes

2021-03-05 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34647?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17296443#comment-17296443
 ] 

Apache Spark commented on SPARK-34647:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/31762

> Upgrade ZSTD-JNI to 1.4.8-7 and use NoFinalizer classes
> ---
>
> Key: SPARK-34647
> URL: https://issues.apache.org/jira/browse/SPARK-34647
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.2.0
>Reporter: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-34647) Upgrade ZSTD-JNI to 1.4.8-7 and use NoFinalizer classes

2021-03-05 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34647?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34647:


Assignee: (was: Apache Spark)

> Upgrade ZSTD-JNI to 1.4.8-7 and use NoFinalizer classes
> ---
>
> Key: SPARK-34647
> URL: https://issues.apache.org/jira/browse/SPARK-34647
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.2.0
>Reporter: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-34647) Upgrade ZSTD-JNI to 1.4.8-7 and use NoFinalizer classes

2021-03-05 Thread Dongjoon Hyun (Jira)

Dongjoon Hyun created SPARK-34647:
-

 Summary: Upgrade ZSTD-JNI to 1.4.8-7 and use NoFinalizer classes
 Key: SPARK-34647
 URL: https://issues.apache.org/jira/browse/SPARK-34647
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 3.2.0
Reporter: Dongjoon Hyun






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-30303) Percentile will hit match err when percentage is null

2021-03-05 Thread Yuming Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang resolved SPARK-30303.
-
Fix Version/s: 3.0.0
   Resolution: Fixed

> Percentile will hit match err when percentage is null
> -
>
> Key: SPARK-30303
> URL: https://issues.apache.org/jira/browse/SPARK-30303
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Kent Yao
>Priority: Major
> Fix For: 3.0.0
>
>
> {code:java}
> spark-sql> select percentile(1, null);
> 19/12/19 16:35:04 ERROR SparkSQLDriver: Failed in [select percentile(1, null)]
> scala.MatchError: null
> spark-sql> select percentile(1, array(0.0, null)); -- null is illegal but 
> treat as zero
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-33539) Standardize exception messages in Spark

2021-03-05 Thread Allison Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33539?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Allison Wang updated SPARK-33539:
-
Description: 
In the SPIP: Standardize Exception Messages in Spark, there are three major 
improvements proposed:
 # Group error messages in dedicated files.
 # Establish an error message guideline for developers.
 # Improve error message quality.

The first step is to centralize error messages for each component into its own 
dedicated file(s). This can help with auditing error messages and subsequent 
tasks to establish a guideline and improve message quality in the future. 

A general rule of thumb for grouping exceptions: 
 * AnalysisException => QueryCompilationErrors
 * SparkException, RuntimeException(UnsupportedOperationException, 
IllegalStateException...) => QueryExecutionErrors
 * ParseException => QueryParsingErrors

(A special case for {{Command}}: since commands are executed eagerly, users can 
immediately see the errors even if the exceptions are thrown in 
{{SparkPlan.execute}}. In this case, we should group the exceptions in commands 
as QueryCompilationErrors.)

Here is an example RP to group all `AnalysisExcpetion` in Analyzer into 
QueryCompilationErrors: [SPARK-32670|https://github.com/apache/spark/pull/29497]

Please see the 
[SPIP|https://docs.google.com/document/d/1XGj1o3xAFh8BA7RCn3DtwIPC6--hIFOaNUNSlpaOIZs/edit?usp=sharing]
 for more details.

Exceptions per component:
||component||exception||
|sql|714|
|core|334|
|mllib|161|
|streaming|43|
|resource-managers|42|
|external|35|
|examples|20|
|mllib-local|10|
|graphx|6|
|repl|5|
|hadoop-cloud|1|

  was:
In the SPIP: Standardize Exception Messages in Spark, there are three major 
improvements proposed:
 # Group error messages in dedicated files.
 # Establish an error message guideline for developers.
 # Improve error message quality.

The first step is to centralize error messages for each component into its own 
dedicated file(s). This can help with auditing error messages and subsequent 
tasks to establish a guideline and improve message quality in the future. 

A general rule of thumb for grouping exceptions: 
 * AnalysisException => QueryCompilationErrors
 * SparkException, RuntimeException(UnsupportedOperationException, 
IllegalStateException...) => QueryExecutionErrors
 * ParseException => QueryParsingErrors

(A special case for `Command`: since commands are executed eagerly, users can 
immediately see the errors even if the exceptions are thrown in 
`SparkPlan.execute`. In this case, we should group the exceptions in commands 
as QueryCompilationErrors.)

Here is an example RP to group all `AnalysisExcpetion` in Analyzer into 
QueryCompilationErrors: [SPARK-32670|https://github.com/apache/spark/pull/29497]

Please see the 
[SPIP|https://docs.google.com/document/d/1XGj1o3xAFh8BA7RCn3DtwIPC6--hIFOaNUNSlpaOIZs/edit?usp=sharing]
 for more details.

Exceptions per component:
||component||exception||
|sql|714|
|core|334|
|mllib|161|
|streaming|43|
|resource-managers|42|
|external|35|
|examples|20|
|mllib-local|10|
|graphx|6|
|repl|5|
|hadoop-cloud|1|


> Standardize exception messages in Spark
> ---
>
> Key: SPARK-33539
> URL: https://issues.apache.org/jira/browse/SPARK-33539
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 3.1.0
>Reporter: Allison Wang
>Priority: Major
>
> In the SPIP: Standardize Exception Messages in Spark, there are three major 
> improvements proposed:
>  # Group error messages in dedicated files.
>  # Establish an error message guideline for developers.
>  # Improve error message quality.
> The first step is to centralize error messages for each component into its 
> own dedicated file(s). This can help with auditing error messages and 
> subsequent tasks to establish a guideline and improve message quality in the 
> future. 
> A general rule of thumb for grouping exceptions: 
>  * AnalysisException => QueryCompilationErrors
>  * SparkException, RuntimeException(UnsupportedOperationException, 
> IllegalStateException...) => QueryExecutionErrors
>  * ParseException => QueryParsingErrors
> (A special case for {{Command}}: since commands are executed eagerly, users 
> can immediately see the errors even if the exceptions are thrown in 
> {{SparkPlan.execute}}. In this case, we should group the exceptions in 
> commands as QueryCompilationErrors.)
> Here is an example RP to group all `AnalysisExcpetion` in Analyzer into 
> QueryCompilationErrors: 
> [SPARK-32670|https://github.com/apache/spark/pull/29497]
> Please see the 
> [SPIP|https://docs.google.com/document/d/1XGj1o3xAFh8BA7RCn3DtwIPC6--hIFOaNUNSlpaOIZs/edit?usp=sharing]
>  for more details.
> Exceptions per component:
> ||component||exception||
> |sql|714|
> |core|334|
> |mllib|161|
>

[jira] [Updated] (SPARK-33539) Standardize exception messages in Spark

2021-03-05 Thread Allison Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33539?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Allison Wang updated SPARK-33539:
-
Description: 
In the SPIP: Standardize Exception Messages in Spark, there are three major 
improvements proposed:
 # Group error messages in dedicated files.
 # Establish an error message guideline for developers.
 # Improve error message quality.

The first step is to centralize error messages for each component into its own 
dedicated file(s). This can help with auditing error messages and subsequent 
tasks to establish a guideline and improve message quality in the future. 

A general rule of thumb for grouping exceptions: 
 * AnalysisException => QueryCompilationErrors
 * SparkException, RuntimeException(UnsupportedOperationException, 
IllegalStateException...) => QueryExecutionErrors
 * ParseException => QueryParsingErrors

(A special case for `Command`: since commands are executed eagerly, users can 
immediately see the errors even if the exceptions are thrown in 
`SparkPlan.execute`. In this case, we should group the exceptions in commands 
as QueryCompilationErrors.)

Here is an example RP to group all `AnalysisExcpetion` in Analyzer into 
QueryCompilationErrors: [SPARK-32670|https://github.com/apache/spark/pull/29497]

Please see the 
[SPIP|https://docs.google.com/document/d/1XGj1o3xAFh8BA7RCn3DtwIPC6--hIFOaNUNSlpaOIZs/edit?usp=sharing]
 for more details.

Exceptions per component:
||component||exception||
|sql|714|
|core|334|
|mllib|161|
|streaming|43|
|resource-managers|42|
|external|35|
|examples|20|
|mllib-local|10|
|graphx|6|
|repl|5|
|hadoop-cloud|1|

  was:
In the SPIP: Standardize Exception Messages in Spark, there are three major 
improvements proposed:
 # Group error messages in dedicated files.
 # Establish an error message guideline for developers.
 # Improve error message quality.

The first step is to centralize error messages for each component into its own 
dedicated file(s). This can help with auditing error messages and subsequent 
tasks to establish a guideline and improve message quality in the future. 

A general rule of thumb for grouping exceptions:
 * AnalysisException => QueryCompilationErrors
 * SparkException, RuntimeException(UnsupportedOperationException, 
IllegalStateException...) => QueryExecutionErrors
 * ParseException => QueryParsingErrors

Here is an example RP to group all `AnalysisExcpetion` in Analyzer into 
QueryCompilationErrors: [SPARK-32670|https://github.com/apache/spark/pull/29497]

Please see the 
[SPIP|https://docs.google.com/document/d/1XGj1o3xAFh8BA7RCn3DtwIPC6--hIFOaNUNSlpaOIZs/edit?usp=sharing]
 for more details.

Exceptions per component:
||component||exception||
|sql|714|
|core|334|
|mllib|161|
|streaming|43|
|resource-managers|42|
|external|35|
|examples|20|
|mllib-local|10|
|graphx|6|
|repl|5|
|hadoop-cloud|1|


> Standardize exception messages in Spark
> ---
>
> Key: SPARK-33539
> URL: https://issues.apache.org/jira/browse/SPARK-33539
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 3.1.0
>Reporter: Allison Wang
>Priority: Major
>
> In the SPIP: Standardize Exception Messages in Spark, there are three major 
> improvements proposed:
>  # Group error messages in dedicated files.
>  # Establish an error message guideline for developers.
>  # Improve error message quality.
> The first step is to centralize error messages for each component into its 
> own dedicated file(s). This can help with auditing error messages and 
> subsequent tasks to establish a guideline and improve message quality in the 
> future. 
> A general rule of thumb for grouping exceptions: 
>  * AnalysisException => QueryCompilationErrors
>  * SparkException, RuntimeException(UnsupportedOperationException, 
> IllegalStateException...) => QueryExecutionErrors
>  * ParseException => QueryParsingErrors
> (A special case for `Command`: since commands are executed eagerly, users can 
> immediately see the errors even if the exceptions are thrown in 
> `SparkPlan.execute`. In this case, we should group the exceptions in commands 
> as QueryCompilationErrors.)
> Here is an example RP to group all `AnalysisExcpetion` in Analyzer into 
> QueryCompilationErrors: 
> [SPARK-32670|https://github.com/apache/spark/pull/29497]
> Please see the 
> [SPIP|https://docs.google.com/document/d/1XGj1o3xAFh8BA7RCn3DtwIPC6--hIFOaNUNSlpaOIZs/edit?usp=sharing]
>  for more details.
> Exceptions per component:
> ||component||exception||
> |sql|714|
> |core|334|
> |mllib|161|
> |streaming|43|
> |resource-managers|42|
> |external|35|
> |examples|20|
> |mllib-local|10|
> |graphx|6|
> |repl|5|
> |hadoop-cloud|1|



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-

[jira] [Commented] (SPARK-34645) [K8S] Driver pod stuck in Running state after job completes

2021-03-05 Thread Dongjoon Hyun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17296393#comment-17296393
 ] 

Dongjoon Hyun commented on SPARK-34645:
---

Thank you for investigation. I'm looking forward to testing with it.

> [K8S] Driver pod stuck in Running state after job completes
> ---
>
> Key: SPARK-34645
> URL: https://issues.apache.org/jira/browse/SPARK-34645
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 3.0.2
> Environment: Kubernetes:
> {code:java}
> Client Version: version.Info{Major:"1", Minor:"19", GitVersion:"v1.19.2", 
> GitCommit:"f5743093fd1c663cb0cbc89748f730662345d44d", GitTreeState:"clean", 
> BuildDate:"2020-09-16T13:41:02Z", GoVersion:"go1.15", Compiler:"gc", 
> Platform:"linux/amd64"}
> Server Version: version.Info{Major:"1", Minor:"13", GitVersion:"v1.13.5", 
> GitCommit:"2166946f41b36dea2c4626f90a77706f426cdea2", GitTreeState:"clean", 
> BuildDate:"2019-03-25T15:19:22Z", GoVersion:"go1.11.5", Compiler:"gc", 
> Platform:"linux/amd64"}
>  {code}
>Reporter: Andy Grove
>Priority: Major
>
> I am running automated benchmarks in k8s, using spark-submit in cluster mode, 
> so the driver runs in a pod.
> When running with Spark 3.0.1 and 3.1.1 everything works as expected and I 
> see the Spark context being shut down after the job completes.
> However, when running with Spark 3.0.2 I do not see the context get shut down 
> and the driver pod is stuck in the Running state indefinitely.
> This is the output I see after job completion with 3.0.1 and 3.1.1 and this 
> output does not appear with 3.0.2. With 3.0.2 there is no output at all after 
> the job completes.
> {code:java}
> 2021-03-05 20:09:24,576 INFO spark.SparkContext: Invoking stop() from 
> shutdown hook
> 2021-03-05 20:09:24,592 INFO server.AbstractConnector: Stopped 
> Spark@784499d0{HTTP/1.1, (http/1.1)}{0.0.0.0:4040}
> 2021-03-05 20:09:24,594 INFO ui.SparkUI: Stopped Spark web UI at 
> http://benchmark-runner-3e8a38780400e0d1-driver-svc.default.svc:4040
> 2021-03-05 20:09:24,599 INFO k8s.KubernetesClusterSchedulerBackend: Shutting 
> down all executors
> 2021-03-05 20:09:24,600 INFO 
> k8s.KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: Asking each 
> executor to shut down
> 2021-03-05 20:09:24,609 WARN k8s.ExecutorPodsWatchSnapshotSource: Kubernetes 
> client has been closed (this is expected if the application is shutting down.)
> 2021-03-05 20:09:24,719 INFO spark.MapOutputTrackerMasterEndpoint: 
> MapOutputTrackerMasterEndpoint stopped!
> 2021-03-05 20:09:24,736 INFO memory.MemoryStore: MemoryStore cleared
> 2021-03-05 20:09:24,738 INFO storage.BlockManager: BlockManager stopped
> 2021-03-05 20:09:24,744 INFO storage.BlockManagerMaster: BlockManagerMaster 
> stopped
> 2021-03-05 20:09:24,752 INFO 
> scheduler.OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: 
> OutputCommitCoordinator stopped!
> 2021-03-05 20:09:24,768 INFO spark.SparkContext: Successfully stopped 
> SparkContext
> 2021-03-05 20:09:24,768 INFO util.ShutdownHookManager: Shutdown hook called
> 2021-03-05 20:09:24,769 INFO util.ShutdownHookManager: Deleting directory 
> /var/data/spark-67fa44df-e86c-463a-a149-25d95817ff8e/spark-a5476c14-c103-4108-b733-961400485d8a
> 2021-03-05 20:09:24,772 INFO util.ShutdownHookManager: Deleting directory 
> /tmp/spark-9d6261f5-4394-472b-9c9a-e22bde877814
> 2021-03-05 20:09:24,778 INFO impl.MetricsSystemImpl: Stopping s3a-file-system 
> metrics system...
> 2021-03-05 20:09:24,779 INFO impl.MetricsSystemImpl: s3a-file-system metrics 
> system stopped.
> 2021-03-05 20:09:24,779 INFO impl.MetricsSystemImpl: s3a-file-system metrics 
> system shutdown complete.
>  {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34645) [K8S] Driver pod stuck in Running state after job completes

2021-03-05 Thread Andy Grove (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17296391#comment-17296391
 ] 

Andy Grove commented on SPARK-34645:


I can reproduce this with Spark 3.1.1 + JDK 11 as well (and works fine with JDK 
8).

I will see if I can make a repro case.

> [K8S] Driver pod stuck in Running state after job completes
> ---
>
> Key: SPARK-34645
> URL: https://issues.apache.org/jira/browse/SPARK-34645
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 3.0.2
> Environment: Kubernetes:
> {code:java}
> Client Version: version.Info{Major:"1", Minor:"19", GitVersion:"v1.19.2", 
> GitCommit:"f5743093fd1c663cb0cbc89748f730662345d44d", GitTreeState:"clean", 
> BuildDate:"2020-09-16T13:41:02Z", GoVersion:"go1.15", Compiler:"gc", 
> Platform:"linux/amd64"}
> Server Version: version.Info{Major:"1", Minor:"13", GitVersion:"v1.13.5", 
> GitCommit:"2166946f41b36dea2c4626f90a77706f426cdea2", GitTreeState:"clean", 
> BuildDate:"2019-03-25T15:19:22Z", GoVersion:"go1.11.5", Compiler:"gc", 
> Platform:"linux/amd64"}
>  {code}
>Reporter: Andy Grove
>Priority: Major
>
> I am running automated benchmarks in k8s, using spark-submit in cluster mode, 
> so the driver runs in a pod.
> When running with Spark 3.0.1 and 3.1.1 everything works as expected and I 
> see the Spark context being shut down after the job completes.
> However, when running with Spark 3.0.2 I do not see the context get shut down 
> and the driver pod is stuck in the Running state indefinitely.
> This is the output I see after job completion with 3.0.1 and 3.1.1 and this 
> output does not appear with 3.0.2. With 3.0.2 there is no output at all after 
> the job completes.
> {code:java}
> 2021-03-05 20:09:24,576 INFO spark.SparkContext: Invoking stop() from 
> shutdown hook
> 2021-03-05 20:09:24,592 INFO server.AbstractConnector: Stopped 
> Spark@784499d0{HTTP/1.1, (http/1.1)}{0.0.0.0:4040}
> 2021-03-05 20:09:24,594 INFO ui.SparkUI: Stopped Spark web UI at 
> http://benchmark-runner-3e8a38780400e0d1-driver-svc.default.svc:4040
> 2021-03-05 20:09:24,599 INFO k8s.KubernetesClusterSchedulerBackend: Shutting 
> down all executors
> 2021-03-05 20:09:24,600 INFO 
> k8s.KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: Asking each 
> executor to shut down
> 2021-03-05 20:09:24,609 WARN k8s.ExecutorPodsWatchSnapshotSource: Kubernetes 
> client has been closed (this is expected if the application is shutting down.)
> 2021-03-05 20:09:24,719 INFO spark.MapOutputTrackerMasterEndpoint: 
> MapOutputTrackerMasterEndpoint stopped!
> 2021-03-05 20:09:24,736 INFO memory.MemoryStore: MemoryStore cleared
> 2021-03-05 20:09:24,738 INFO storage.BlockManager: BlockManager stopped
> 2021-03-05 20:09:24,744 INFO storage.BlockManagerMaster: BlockManagerMaster 
> stopped
> 2021-03-05 20:09:24,752 INFO 
> scheduler.OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: 
> OutputCommitCoordinator stopped!
> 2021-03-05 20:09:24,768 INFO spark.SparkContext: Successfully stopped 
> SparkContext
> 2021-03-05 20:09:24,768 INFO util.ShutdownHookManager: Shutdown hook called
> 2021-03-05 20:09:24,769 INFO util.ShutdownHookManager: Deleting directory 
> /var/data/spark-67fa44df-e86c-463a-a149-25d95817ff8e/spark-a5476c14-c103-4108-b733-961400485d8a
> 2021-03-05 20:09:24,772 INFO util.ShutdownHookManager: Deleting directory 
> /tmp/spark-9d6261f5-4394-472b-9c9a-e22bde877814
> 2021-03-05 20:09:24,778 INFO impl.MetricsSystemImpl: Stopping s3a-file-system 
> metrics system...
> 2021-03-05 20:09:24,779 INFO impl.MetricsSystemImpl: s3a-file-system metrics 
> system stopped.
> 2021-03-05 20:09:24,779 INFO impl.MetricsSystemImpl: s3a-file-system metrics 
> system shutdown complete.
>  {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-34601) Do not delete shuffle file on executor lost event when using remote shuffle service

2021-03-05 Thread BoYang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34601?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

BoYang resolved SPARK-34601.

Resolution: Won't Fix

> Do not delete shuffle file on executor lost event when using remote shuffle 
> service
> ---
>
> Key: SPARK-34601
> URL: https://issues.apache.org/jira/browse/SPARK-34601
> Project: Spark
>  Issue Type: New Feature
>  Components: Shuffle
>Affects Versions: 3.2.0
>Reporter: BoYang
>Priority: Major
>  Labels: shuffle
>
> There are multiple work going on with disaggregated/remote shuffle service 
> (e.g. [LinkedIn 
> shuffle|https://engineering.linkedin.com/blog/2020/introducing-magnet], 
> [Facebook shuffle 
> service|https://databricks.com/session/cosco-an-efficient-facebook-scale-shuffle-service],
>  [Uber shuffle service|https://github.com/uber/RemoteShuffleService]). Such 
> remote shuffle service is not Spark External Shuffle Service. It could be 
> third party shuffle solution and user uses it by setting 
> spark.shuffle.manager. In those systems, shuffle data will be stored on 
> different server other than executor. Spark should not mark shuffle data lost 
> when the executor is lost. We could add a Spark configuration to control this 
> behavior. By default, Spark still mark shuffle file lost. For 
> disaggregated/remote shuffle service, people could set the configure to not 
> mark shuffle file lost.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34601) Do not delete shuffle file on executor lost event when using remote shuffle service

2021-03-05 Thread BoYang (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34601?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17296390#comment-17296390
 ] 

BoYang commented on SPARK-34601:


It looks Spark already checks and matches the executor id when it tries to 
remove map output. I will close this ticket.

> Do not delete shuffle file on executor lost event when using remote shuffle 
> service
> ---
>
> Key: SPARK-34601
> URL: https://issues.apache.org/jira/browse/SPARK-34601
> Project: Spark
>  Issue Type: New Feature
>  Components: Shuffle
>Affects Versions: 3.2.0
>Reporter: BoYang
>Priority: Major
>  Labels: shuffle
>
> There are multiple work going on with disaggregated/remote shuffle service 
> (e.g. [LinkedIn 
> shuffle|https://engineering.linkedin.com/blog/2020/introducing-magnet], 
> [Facebook shuffle 
> service|https://databricks.com/session/cosco-an-efficient-facebook-scale-shuffle-service],
>  [Uber shuffle service|https://github.com/uber/RemoteShuffleService]). Such 
> remote shuffle service is not Spark External Shuffle Service. It could be 
> third party shuffle solution and user uses it by setting 
> spark.shuffle.manager. In those systems, shuffle data will be stored on 
> different server other than executor. Spark should not mark shuffle data lost 
> when the executor is lost. We could add a Spark configuration to control this 
> behavior. By default, Spark still mark shuffle file lost. For 
> disaggregated/remote shuffle service, people could set the configure to not 
> mark shuffle file lost.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34645) [K8S] Driver pod stuck in Running state after job completes

2021-03-05 Thread Dongjoon Hyun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17296385#comment-17296385
 ] 

Dongjoon Hyun commented on SPARK-34645:
---

BTW, since Apache Spark 3.0.2 is released before Apache Spark 3.1.1, Apache 
Spark 3.1.1 contains all changes in Apache Spark 3.0.2.

> [K8S] Driver pod stuck in Running state after job completes
> ---
>
> Key: SPARK-34645
> URL: https://issues.apache.org/jira/browse/SPARK-34645
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 3.0.2
> Environment: Kubernetes:
> {code:java}
> Client Version: version.Info{Major:"1", Minor:"19", GitVersion:"v1.19.2", 
> GitCommit:"f5743093fd1c663cb0cbc89748f730662345d44d", GitTreeState:"clean", 
> BuildDate:"2020-09-16T13:41:02Z", GoVersion:"go1.15", Compiler:"gc", 
> Platform:"linux/amd64"}
> Server Version: version.Info{Major:"1", Minor:"13", GitVersion:"v1.13.5", 
> GitCommit:"2166946f41b36dea2c4626f90a77706f426cdea2", GitTreeState:"clean", 
> BuildDate:"2019-03-25T15:19:22Z", GoVersion:"go1.11.5", Compiler:"gc", 
> Platform:"linux/amd64"}
>  {code}
>Reporter: Andy Grove
>Priority: Major
>
> I am running automated benchmarks in k8s, using spark-submit in cluster mode, 
> so the driver runs in a pod.
> When running with Spark 3.0.1 and 3.1.1 everything works as expected and I 
> see the Spark context being shut down after the job completes.
> However, when running with Spark 3.0.2 I do not see the context get shut down 
> and the driver pod is stuck in the Running state indefinitely.
> This is the output I see after job completion with 3.0.1 and 3.1.1 and this 
> output does not appear with 3.0.2. With 3.0.2 there is no output at all after 
> the job completes.
> {code:java}
> 2021-03-05 20:09:24,576 INFO spark.SparkContext: Invoking stop() from 
> shutdown hook
> 2021-03-05 20:09:24,592 INFO server.AbstractConnector: Stopped 
> Spark@784499d0{HTTP/1.1, (http/1.1)}{0.0.0.0:4040}
> 2021-03-05 20:09:24,594 INFO ui.SparkUI: Stopped Spark web UI at 
> http://benchmark-runner-3e8a38780400e0d1-driver-svc.default.svc:4040
> 2021-03-05 20:09:24,599 INFO k8s.KubernetesClusterSchedulerBackend: Shutting 
> down all executors
> 2021-03-05 20:09:24,600 INFO 
> k8s.KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: Asking each 
> executor to shut down
> 2021-03-05 20:09:24,609 WARN k8s.ExecutorPodsWatchSnapshotSource: Kubernetes 
> client has been closed (this is expected if the application is shutting down.)
> 2021-03-05 20:09:24,719 INFO spark.MapOutputTrackerMasterEndpoint: 
> MapOutputTrackerMasterEndpoint stopped!
> 2021-03-05 20:09:24,736 INFO memory.MemoryStore: MemoryStore cleared
> 2021-03-05 20:09:24,738 INFO storage.BlockManager: BlockManager stopped
> 2021-03-05 20:09:24,744 INFO storage.BlockManagerMaster: BlockManagerMaster 
> stopped
> 2021-03-05 20:09:24,752 INFO 
> scheduler.OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: 
> OutputCommitCoordinator stopped!
> 2021-03-05 20:09:24,768 INFO spark.SparkContext: Successfully stopped 
> SparkContext
> 2021-03-05 20:09:24,768 INFO util.ShutdownHookManager: Shutdown hook called
> 2021-03-05 20:09:24,769 INFO util.ShutdownHookManager: Deleting directory 
> /var/data/spark-67fa44df-e86c-463a-a149-25d95817ff8e/spark-a5476c14-c103-4108-b733-961400485d8a
> 2021-03-05 20:09:24,772 INFO util.ShutdownHookManager: Deleting directory 
> /tmp/spark-9d6261f5-4394-472b-9c9a-e22bde877814
> 2021-03-05 20:09:24,778 INFO impl.MetricsSystemImpl: Stopping s3a-file-system 
> metrics system...
> 2021-03-05 20:09:24,779 INFO impl.MetricsSystemImpl: s3a-file-system metrics 
> system stopped.
> 2021-03-05 20:09:24,779 INFO impl.MetricsSystemImpl: s3a-file-system metrics 
> system shutdown complete.
>  {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34645) [K8S] Driver pod stuck in Running state after job completes

2021-03-05 Thread Dongjoon Hyun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17296374#comment-17296374
 ] 

Dongjoon Hyun commented on SPARK-34645:
---

Could you provide a reproducer for this?

> [K8S] Driver pod stuck in Running state after job completes
> ---
>
> Key: SPARK-34645
> URL: https://issues.apache.org/jira/browse/SPARK-34645
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 3.0.2
> Environment: Kubernetes:
> {code:java}
> Client Version: version.Info{Major:"1", Minor:"19", GitVersion:"v1.19.2", 
> GitCommit:"f5743093fd1c663cb0cbc89748f730662345d44d", GitTreeState:"clean", 
> BuildDate:"2020-09-16T13:41:02Z", GoVersion:"go1.15", Compiler:"gc", 
> Platform:"linux/amd64"}
> Server Version: version.Info{Major:"1", Minor:"13", GitVersion:"v1.13.5", 
> GitCommit:"2166946f41b36dea2c4626f90a77706f426cdea2", GitTreeState:"clean", 
> BuildDate:"2019-03-25T15:19:22Z", GoVersion:"go1.11.5", Compiler:"gc", 
> Platform:"linux/amd64"}
>  {code}
>Reporter: Andy Grove
>Priority: Major
>
> I am running automated benchmarks in k8s, using spark-submit in cluster mode, 
> so the driver runs in a pod.
> When running with Spark 3.0.1 and 3.1.1 everything works as expected and I 
> see the Spark context being shut down after the job completes.
> However, when running with Spark 3.0.2 I do not see the context get shut down 
> and the driver pod is stuck in the Running state indefinitely.
> This is the output I see after job completion with 3.0.1 and 3.1.1 and this 
> output does not appear with 3.0.2. With 3.0.2 there is no output at all after 
> the job completes.
> {code:java}
> 2021-03-05 20:09:24,576 INFO spark.SparkContext: Invoking stop() from 
> shutdown hook
> 2021-03-05 20:09:24,592 INFO server.AbstractConnector: Stopped 
> Spark@784499d0{HTTP/1.1, (http/1.1)}{0.0.0.0:4040}
> 2021-03-05 20:09:24,594 INFO ui.SparkUI: Stopped Spark web UI at 
> http://benchmark-runner-3e8a38780400e0d1-driver-svc.default.svc:4040
> 2021-03-05 20:09:24,599 INFO k8s.KubernetesClusterSchedulerBackend: Shutting 
> down all executors
> 2021-03-05 20:09:24,600 INFO 
> k8s.KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: Asking each 
> executor to shut down
> 2021-03-05 20:09:24,609 WARN k8s.ExecutorPodsWatchSnapshotSource: Kubernetes 
> client has been closed (this is expected if the application is shutting down.)
> 2021-03-05 20:09:24,719 INFO spark.MapOutputTrackerMasterEndpoint: 
> MapOutputTrackerMasterEndpoint stopped!
> 2021-03-05 20:09:24,736 INFO memory.MemoryStore: MemoryStore cleared
> 2021-03-05 20:09:24,738 INFO storage.BlockManager: BlockManager stopped
> 2021-03-05 20:09:24,744 INFO storage.BlockManagerMaster: BlockManagerMaster 
> stopped
> 2021-03-05 20:09:24,752 INFO 
> scheduler.OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: 
> OutputCommitCoordinator stopped!
> 2021-03-05 20:09:24,768 INFO spark.SparkContext: Successfully stopped 
> SparkContext
> 2021-03-05 20:09:24,768 INFO util.ShutdownHookManager: Shutdown hook called
> 2021-03-05 20:09:24,769 INFO util.ShutdownHookManager: Deleting directory 
> /var/data/spark-67fa44df-e86c-463a-a149-25d95817ff8e/spark-a5476c14-c103-4108-b733-961400485d8a
> 2021-03-05 20:09:24,772 INFO util.ShutdownHookManager: Deleting directory 
> /tmp/spark-9d6261f5-4394-472b-9c9a-e22bde877814
> 2021-03-05 20:09:24,778 INFO impl.MetricsSystemImpl: Stopping s3a-file-system 
> metrics system...
> 2021-03-05 20:09:24,779 INFO impl.MetricsSystemImpl: s3a-file-system metrics 
> system stopped.
> 2021-03-05 20:09:24,779 INFO impl.MetricsSystemImpl: s3a-file-system metrics 
> system shutdown complete.
>  {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34645) [K8S] Driver pod stuck in Running state after job completes

2021-03-05 Thread Dongjoon Hyun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17296370#comment-17296370
 ] 

Dongjoon Hyun commented on SPARK-34645:
---

The above my result and my local K8s IT integration test is using JDK11.
{code:java}
$ docker run -it --rm $ECR:3.0.2 java --version | tail -n3
openjdk 11.0.10 2021-01-19
OpenJDK Runtime Environment 18.9 (build 11.0.10+9)
OpenJDK 64-Bit Server VM 18.9 (build 11.0.10+9, mixed mode, sharing) {code}

> [K8S] Driver pod stuck in Running state after job completes
> ---
>
> Key: SPARK-34645
> URL: https://issues.apache.org/jira/browse/SPARK-34645
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 3.0.2
> Environment: Kubernetes:
> {code:java}
> Client Version: version.Info{Major:"1", Minor:"19", GitVersion:"v1.19.2", 
> GitCommit:"f5743093fd1c663cb0cbc89748f730662345d44d", GitTreeState:"clean", 
> BuildDate:"2020-09-16T13:41:02Z", GoVersion:"go1.15", Compiler:"gc", 
> Platform:"linux/amd64"}
> Server Version: version.Info{Major:"1", Minor:"13", GitVersion:"v1.13.5", 
> GitCommit:"2166946f41b36dea2c4626f90a77706f426cdea2", GitTreeState:"clean", 
> BuildDate:"2019-03-25T15:19:22Z", GoVersion:"go1.11.5", Compiler:"gc", 
> Platform:"linux/amd64"}
>  {code}
>Reporter: Andy Grove
>Priority: Major
>
> I am running automated benchmarks in k8s, using spark-submit in cluster mode, 
> so the driver runs in a pod.
> When running with Spark 3.0.1 and 3.1.1 everything works as expected and I 
> see the Spark context being shut down after the job completes.
> However, when running with Spark 3.0.2 I do not see the context get shut down 
> and the driver pod is stuck in the Running state indefinitely.
> This is the output I see after job completion with 3.0.1 and 3.1.1 and this 
> output does not appear with 3.0.2. With 3.0.2 there is no output at all after 
> the job completes.
> {code:java}
> 2021-03-05 20:09:24,576 INFO spark.SparkContext: Invoking stop() from 
> shutdown hook
> 2021-03-05 20:09:24,592 INFO server.AbstractConnector: Stopped 
> Spark@784499d0{HTTP/1.1, (http/1.1)}{0.0.0.0:4040}
> 2021-03-05 20:09:24,594 INFO ui.SparkUI: Stopped Spark web UI at 
> http://benchmark-runner-3e8a38780400e0d1-driver-svc.default.svc:4040
> 2021-03-05 20:09:24,599 INFO k8s.KubernetesClusterSchedulerBackend: Shutting 
> down all executors
> 2021-03-05 20:09:24,600 INFO 
> k8s.KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: Asking each 
> executor to shut down
> 2021-03-05 20:09:24,609 WARN k8s.ExecutorPodsWatchSnapshotSource: Kubernetes 
> client has been closed (this is expected if the application is shutting down.)
> 2021-03-05 20:09:24,719 INFO spark.MapOutputTrackerMasterEndpoint: 
> MapOutputTrackerMasterEndpoint stopped!
> 2021-03-05 20:09:24,736 INFO memory.MemoryStore: MemoryStore cleared
> 2021-03-05 20:09:24,738 INFO storage.BlockManager: BlockManager stopped
> 2021-03-05 20:09:24,744 INFO storage.BlockManagerMaster: BlockManagerMaster 
> stopped
> 2021-03-05 20:09:24,752 INFO 
> scheduler.OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: 
> OutputCommitCoordinator stopped!
> 2021-03-05 20:09:24,768 INFO spark.SparkContext: Successfully stopped 
> SparkContext
> 2021-03-05 20:09:24,768 INFO util.ShutdownHookManager: Shutdown hook called
> 2021-03-05 20:09:24,769 INFO util.ShutdownHookManager: Deleting directory 
> /var/data/spark-67fa44df-e86c-463a-a149-25d95817ff8e/spark-a5476c14-c103-4108-b733-961400485d8a
> 2021-03-05 20:09:24,772 INFO util.ShutdownHookManager: Deleting directory 
> /tmp/spark-9d6261f5-4394-472b-9c9a-e22bde877814
> 2021-03-05 20:09:24,778 INFO impl.MetricsSystemImpl: Stopping s3a-file-system 
> metrics system...
> 2021-03-05 20:09:24,779 INFO impl.MetricsSystemImpl: s3a-file-system metrics 
> system stopped.
> 2021-03-05 20:09:24,779 INFO impl.MetricsSystemImpl: s3a-file-system metrics 
> system shutdown complete.
>  {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34645) [K8S] Driver pod stuck in Running state after job completes

2021-03-05 Thread Dongjoon Hyun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17296368#comment-17296368
 ] 

Dongjoon Hyun commented on SPARK-34645:
---

First of all, both Spark 3.0.1/3.0.2 has the same K8s library. So, I guess we 
can exclude K8s server/client version compatibility issue.
{code:java}
spark-release:$ ls -al spark-3.0.2-bin-hadoop3.2/jars/kubernetes-*
-rw-r--r--  1 dongjoon  staff775174 Feb 15 22:32 
spark-3.0.2-bin-hadoop3.2/jars/kubernetes-client-4.9.2.jar
-rw-r--r--  1 dongjoon  staff  11908731 Feb 15 22:32 
spark-3.0.2-bin-hadoop3.2/jars/kubernetes-model-4.9.2.jar
-rw-r--r--  1 dongjoon  staff  3954 Feb 15 22:32 
spark-3.0.2-bin-hadoop3.2/jars/kubernetes-model-common-4.9.2.jar
spark-release:$ ls -al spark-3.0.1-bin-hadoop3.2/jars/kubernetes-*
-rw-r--r--  1 dongjoon  staff775174 Aug 28  2020 
spark-3.0.1-bin-hadoop3.2/jars/kubernetes-client-4.9.2.jar
-rw-r--r--  1 dongjoon  staff  11908731 Aug 28  2020 
spark-3.0.1-bin-hadoop3.2/jars/kubernetes-model-4.9.2.jar
-rw-r--r--  1 dongjoon  staff  3954 Aug 28  2020 
spark-3.0.1-bin-hadoop3.2/jars/kubernetes-model-common-4.9.2.jar {code}
The following is the output I get from Apache Spark 3.0.2 on EKS v1.18 as of 
today. It works for me. Also, Apache Spark 3.0.2 was tested on EKS and Minikube 
during RC period and the K8s IT test coverage on `branch-3.0` has been healthy.
{code:java}
Pi is roughly 3.141517357075868
21/03/05 22:17:19 INFO SparkUI: Stopped Spark web UI at 
http://pi-81b8dc7804756ea4-driver-svc.default.svc:4040
21/03/05 22:17:19 INFO KubernetesClusterSchedulerBackend: Shutting down all 
executors
21/03/05 22:17:19 INFO 
KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: Asking each 
executor to shut down
21/03/05 22:17:19 WARN ExecutorPodsWatchSnapshotSource: Kubernetes client has 
been closed (this is expected if the application is shutting down.)
21/03/05 22:17:20 INFO MapOutputTrackerMasterEndpoint: 
MapOutputTrackerMasterEndpoint stopped!
21/03/05 22:17:20 INFO MemoryStore: MemoryStore cleared
21/03/05 22:17:20 INFO BlockManager: BlockManager stopped
21/03/05 22:17:20 INFO BlockManagerMaster: BlockManagerMaster stopped
21/03/05 22:17:20 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: 
OutputCommitCoordinator stopped!
21/03/05 22:17:20 INFO SparkContext: Successfully stopped SparkContext
21/03/05 22:17:20 INFO ShutdownHookManager: Shutdown hook called
21/03/05 22:17:20 INFO ShutdownHookManager: Deleting directory 
/tmp/spark-053d86cb-d3ed-4823-986b-f9dd65dcce6a
21/03/05 22:17:20 INFO ShutdownHookManager: Deleting directory 
/var/data/spark-eb14ad59-0e8b-431f-a138-0d5138376194/spark-e5844bae-114c-402d-83ba-c9b41dda0bbf
 {code}
 

So, could you describe more about your situation, [~andygrove] and [~tgraves]?

> [K8S] Driver pod stuck in Running state after job completes
> ---
>
> Key: SPARK-34645
> URL: https://issues.apache.org/jira/browse/SPARK-34645
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 3.0.2
> Environment: Kubernetes:
> {code:java}
> Client Version: version.Info{Major:"1", Minor:"19", GitVersion:"v1.19.2", 
> GitCommit:"f5743093fd1c663cb0cbc89748f730662345d44d", GitTreeState:"clean", 
> BuildDate:"2020-09-16T13:41:02Z", GoVersion:"go1.15", Compiler:"gc", 
> Platform:"linux/amd64"}
> Server Version: version.Info{Major:"1", Minor:"13", GitVersion:"v1.13.5", 
> GitCommit:"2166946f41b36dea2c4626f90a77706f426cdea2", GitTreeState:"clean", 
> BuildDate:"2019-03-25T15:19:22Z", GoVersion:"go1.11.5", Compiler:"gc", 
> Platform:"linux/amd64"}
>  {code}
>Reporter: Andy Grove
>Priority: Major
>
> I am running automated benchmarks in k8s, using spark-submit in cluster mode, 
> so the driver runs in a pod.
> When running with Spark 3.0.1 and 3.1.1 everything works as expected and I 
> see the Spark context being shut down after the job completes.
> However, when running with Spark 3.0.2 I do not see the context get shut down 
> and the driver pod is stuck in the Running state indefinitely.
> This is the output I see after job completion with 3.0.1 and 3.1.1 and this 
> output does not appear with 3.0.2. With 3.0.2 there is no output at all after 
> the job completes.
> {code:java}
> 2021-03-05 20:09:24,576 INFO spark.SparkContext: Invoking stop() from 
> shutdown hook
> 2021-03-05 20:09:24,592 INFO server.AbstractConnector: Stopped 
> Spark@784499d0{HTTP/1.1, (http/1.1)}{0.0.0.0:4040}
> 2021-03-05 20:09:24,594 INFO ui.SparkUI: Stopped Spark web UI at 
> http://benchmark-runner-3e8a38780400e0d1-driver-svc.default.svc:4040
> 2021-03-05 20:09:24,599 INFO k8s.KubernetesClusterSchedulerBackend: Shutting 
> down all executors
> 2021-03-05 20:09:24,600 INFO 
> k8s.KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: Asking

[jira] [Commented] (SPARK-34645) [K8S] Driver pod stuck in Running state after job completes

2021-03-05 Thread Andy Grove (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17296358#comment-17296358
 ] 

Andy Grove commented on SPARK-34645:


I just confirmed that this only happens with JDK 11 and works fine with JDK 8.

> [K8S] Driver pod stuck in Running state after job completes
> ---
>
> Key: SPARK-34645
> URL: https://issues.apache.org/jira/browse/SPARK-34645
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 3.0.2
> Environment: Kubernetes:
> {code:java}
> Client Version: version.Info{Major:"1", Minor:"19", GitVersion:"v1.19.2", 
> GitCommit:"f5743093fd1c663cb0cbc89748f730662345d44d", GitTreeState:"clean", 
> BuildDate:"2020-09-16T13:41:02Z", GoVersion:"go1.15", Compiler:"gc", 
> Platform:"linux/amd64"}
> Server Version: version.Info{Major:"1", Minor:"13", GitVersion:"v1.13.5", 
> GitCommit:"2166946f41b36dea2c4626f90a77706f426cdea2", GitTreeState:"clean", 
> BuildDate:"2019-03-25T15:19:22Z", GoVersion:"go1.11.5", Compiler:"gc", 
> Platform:"linux/amd64"}
>  {code}
>Reporter: Andy Grove
>Priority: Major
>
> I am running automated benchmarks in k8s, using spark-submit in cluster mode, 
> so the driver runs in a pod.
> When running with Spark 3.0.1 and 3.1.1 everything works as expected and I 
> see the Spark context being shut down after the job completes.
> However, when running with Spark 3.0.2 I do not see the context get shut down 
> and the driver pod is stuck in the Running state indefinitely.
> This is the output I see after job completion with 3.0.1 and 3.1.1 and this 
> output does not appear with 3.0.2. With 3.0.2 there is no output at all after 
> the job completes.
> {code:java}
> 2021-03-05 20:09:24,576 INFO spark.SparkContext: Invoking stop() from 
> shutdown hook
> 2021-03-05 20:09:24,592 INFO server.AbstractConnector: Stopped 
> Spark@784499d0{HTTP/1.1, (http/1.1)}{0.0.0.0:4040}
> 2021-03-05 20:09:24,594 INFO ui.SparkUI: Stopped Spark web UI at 
> http://benchmark-runner-3e8a38780400e0d1-driver-svc.default.svc:4040
> 2021-03-05 20:09:24,599 INFO k8s.KubernetesClusterSchedulerBackend: Shutting 
> down all executors
> 2021-03-05 20:09:24,600 INFO 
> k8s.KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: Asking each 
> executor to shut down
> 2021-03-05 20:09:24,609 WARN k8s.ExecutorPodsWatchSnapshotSource: Kubernetes 
> client has been closed (this is expected if the application is shutting down.)
> 2021-03-05 20:09:24,719 INFO spark.MapOutputTrackerMasterEndpoint: 
> MapOutputTrackerMasterEndpoint stopped!
> 2021-03-05 20:09:24,736 INFO memory.MemoryStore: MemoryStore cleared
> 2021-03-05 20:09:24,738 INFO storage.BlockManager: BlockManager stopped
> 2021-03-05 20:09:24,744 INFO storage.BlockManagerMaster: BlockManagerMaster 
> stopped
> 2021-03-05 20:09:24,752 INFO 
> scheduler.OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: 
> OutputCommitCoordinator stopped!
> 2021-03-05 20:09:24,768 INFO spark.SparkContext: Successfully stopped 
> SparkContext
> 2021-03-05 20:09:24,768 INFO util.ShutdownHookManager: Shutdown hook called
> 2021-03-05 20:09:24,769 INFO util.ShutdownHookManager: Deleting directory 
> /var/data/spark-67fa44df-e86c-463a-a149-25d95817ff8e/spark-a5476c14-c103-4108-b733-961400485d8a
> 2021-03-05 20:09:24,772 INFO util.ShutdownHookManager: Deleting directory 
> /tmp/spark-9d6261f5-4394-472b-9c9a-e22bde877814
> 2021-03-05 20:09:24,778 INFO impl.MetricsSystemImpl: Stopping s3a-file-system 
> metrics system...
> 2021-03-05 20:09:24,779 INFO impl.MetricsSystemImpl: s3a-file-system metrics 
> system stopped.
> 2021-03-05 20:09:24,779 INFO impl.MetricsSystemImpl: s3a-file-system metrics 
> system shutdown complete.
>  {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34645) [K8S] Driver pod stuck in Running state after job completes

2021-03-05 Thread Dongjoon Hyun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17296356#comment-17296356
 ] 

Dongjoon Hyun commented on SPARK-34645:
---

So, it happens only at Apache Spark 3.0.2? Let me take a look, [~tgraves]. 
Thank you for reporting, [~andygrove].

> [K8S] Driver pod stuck in Running state after job completes
> ---
>
> Key: SPARK-34645
> URL: https://issues.apache.org/jira/browse/SPARK-34645
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 3.0.2
> Environment: Kubernetes:
> {code:java}
> Client Version: version.Info{Major:"1", Minor:"19", GitVersion:"v1.19.2", 
> GitCommit:"f5743093fd1c663cb0cbc89748f730662345d44d", GitTreeState:"clean", 
> BuildDate:"2020-09-16T13:41:02Z", GoVersion:"go1.15", Compiler:"gc", 
> Platform:"linux/amd64"}
> Server Version: version.Info{Major:"1", Minor:"13", GitVersion:"v1.13.5", 
> GitCommit:"2166946f41b36dea2c4626f90a77706f426cdea2", GitTreeState:"clean", 
> BuildDate:"2019-03-25T15:19:22Z", GoVersion:"go1.11.5", Compiler:"gc", 
> Platform:"linux/amd64"}
>  {code}
>Reporter: Andy Grove
>Priority: Major
>
> I am running automated benchmarks in k8s, using spark-submit in cluster mode, 
> so the driver runs in a pod.
> When running with Spark 3.0.1 and 3.1.1 everything works as expected and I 
> see the Spark context being shut down after the job completes.
> However, when running with Spark 3.0.2 I do not see the context get shut down 
> and the driver pod is stuck in the Running state indefinitely.
> This is the output I see after job completion with 3.0.1 and 3.1.1 and this 
> output does not appear with 3.0.2. With 3.0.2 there is no output at all after 
> the job completes.
> {code:java}
> 2021-03-05 20:09:24,576 INFO spark.SparkContext: Invoking stop() from 
> shutdown hook
> 2021-03-05 20:09:24,592 INFO server.AbstractConnector: Stopped 
> Spark@784499d0{HTTP/1.1, (http/1.1)}{0.0.0.0:4040}
> 2021-03-05 20:09:24,594 INFO ui.SparkUI: Stopped Spark web UI at 
> http://benchmark-runner-3e8a38780400e0d1-driver-svc.default.svc:4040
> 2021-03-05 20:09:24,599 INFO k8s.KubernetesClusterSchedulerBackend: Shutting 
> down all executors
> 2021-03-05 20:09:24,600 INFO 
> k8s.KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: Asking each 
> executor to shut down
> 2021-03-05 20:09:24,609 WARN k8s.ExecutorPodsWatchSnapshotSource: Kubernetes 
> client has been closed (this is expected if the application is shutting down.)
> 2021-03-05 20:09:24,719 INFO spark.MapOutputTrackerMasterEndpoint: 
> MapOutputTrackerMasterEndpoint stopped!
> 2021-03-05 20:09:24,736 INFO memory.MemoryStore: MemoryStore cleared
> 2021-03-05 20:09:24,738 INFO storage.BlockManager: BlockManager stopped
> 2021-03-05 20:09:24,744 INFO storage.BlockManagerMaster: BlockManagerMaster 
> stopped
> 2021-03-05 20:09:24,752 INFO 
> scheduler.OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: 
> OutputCommitCoordinator stopped!
> 2021-03-05 20:09:24,768 INFO spark.SparkContext: Successfully stopped 
> SparkContext
> 2021-03-05 20:09:24,768 INFO util.ShutdownHookManager: Shutdown hook called
> 2021-03-05 20:09:24,769 INFO util.ShutdownHookManager: Deleting directory 
> /var/data/spark-67fa44df-e86c-463a-a149-25d95817ff8e/spark-a5476c14-c103-4108-b733-961400485d8a
> 2021-03-05 20:09:24,772 INFO util.ShutdownHookManager: Deleting directory 
> /tmp/spark-9d6261f5-4394-472b-9c9a-e22bde877814
> 2021-03-05 20:09:24,778 INFO impl.MetricsSystemImpl: Stopping s3a-file-system 
> metrics system...
> 2021-03-05 20:09:24,779 INFO impl.MetricsSystemImpl: s3a-file-system metrics 
> system stopped.
> 2021-03-05 20:09:24,779 INFO impl.MetricsSystemImpl: s3a-file-system metrics 
> system shutdown complete.
>  {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-34646) TreeNode bind issue for duplicate column name.

2021-03-05 Thread loc nguyen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

loc nguyen updated SPARK-34646:
---
Description: 
I received a Spark {{TreeNodeException}} executing a union of two data frames. 
When I assign the union results to a DataFrame that will be returned by a 
function, this error occurs. However, I am able to assign the union results to 
a DataFrame that will not be returned. I have examined the schema for all the 
data frames participating in the code. The PT_Id is being duplicated. The PT_Id 
is duplicated and results in the failed search.

 
  
 {{21/03/04 19:58:28 ERROR Executor: Exception in task 2.0 in stage 2281.0 (TID 
5557)
 org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding 
attribute, tree: PT_ID#140575 at 
org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56)
 at 
org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:79)
 at 
org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:78)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:256)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:256)
 at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:255)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:261)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:261)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:326)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
 at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:324)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:261)
 at org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:245)
 at 
org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReference(BoundAttribute.scala:78)
 at 
org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$.bind(GeneratePredicate.scala:45)
 at 
org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$.bind(GeneratePredicate.scala:40)
 at 
org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.generate(CodeGenerator.scala:1190)
 at org.apache.spark.sql.execution.SparkPlan.newPredicate(SparkPlan.scala:403)
 at 
org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec.org$apache$spark$sql$execution$joins$BroadcastNestedLoopJoinExec$$boundCondition$lzycompute(BroadcastNestedLoopJoinExec.scala:87)
 at 
org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec.org$apache$spark$sql$execution$joins$BroadcastNestedLoopJoinExec$$boundCondition(BroadcastNestedLoopJoinExec.scala:85)
 at 
org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec$$anonfun$4$$anonfun$apply$2$$anonfun$apply$3.apply(BroadcastNestedLoopJoinExec.scala:191)
 at 
org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec$$anonfun$4$$anonfun$apply$2$$anonfun$apply$3.apply(BroadcastNestedLoopJoinExec.scala:191)
 at 
scala.collection.IndexedSeqOptimized$class.prefixLengthImpl(IndexedSeqOptimized.scala:38)
 at 
scala.collection.IndexedSeqOptimized$class.exists(IndexedSeqOptimized.scala:46)
 at scala.collection.mutable.ArrayOps$ofRef.exists(ArrayOps.scala:186)
 at 
org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec$$anonfun$4$$anonfun$apply$2.apply(BroadcastNestedLoopJoinExec.scala:191)
 at 
org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec$$anonfun$4$$anonfun$apply$2.apply(BroadcastNestedLoopJoinExec.scala:190)
 at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:464)
 at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
 at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
 at 
org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
 at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
 at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)
 at org.apache.spark.scheduler.Task.run(Task.scala:121)
 at 
org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
 at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
 at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
 at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
 at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
 at java.lang.Thread.run(Thread.java:748)Caused by: java.lang.RuntimeException: 
Couldn't find PT_Id#140575 in 
[Name#34180,PT_Id#34181,PT_Id#127|#34180,PT_Id#34181,PT_Id#127]
 at

[jira] [Updated] (SPARK-34646) TreeNode bind issue for duplicate column name.

2021-03-05 Thread loc nguyen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

loc nguyen updated SPARK-34646:
---
Description: 
I received a Spark {{TreeNodeException}} executing a union of two data frames. 
When I assign the union results to a DataFrame that will be returned by a 
function, this error occurs. However, I am able to assign the union results to 
a DataFrame that will not be returned. I have examined schema for all the data 
frames participating in the code. The PT_Id is being duplicated. The PT_Id is 
duplicated and results in the failed search.

 
  
 {{21/03/04 19:58:28 ERROR Executor: Exception in task 2.0 in stage 2281.0 (TID 
5557)
 org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding 
attribute, tree: PT_ID#140575 at 
org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56)
 at 
org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:79)
 at 
org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:78)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:256)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:256)
 at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:255)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:261)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:261)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:326)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
 at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:324)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:261)
 at org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:245)
 at 
org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReference(BoundAttribute.scala:78)
 at 
org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$.bind(GeneratePredicate.scala:45)
 at 
org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$.bind(GeneratePredicate.scala:40)
 at 
org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.generate(CodeGenerator.scala:1190)
 at org.apache.spark.sql.execution.SparkPlan.newPredicate(SparkPlan.scala:403)
 at 
org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec.org$apache$spark$sql$execution$joins$BroadcastNestedLoopJoinExec$$boundCondition$lzycompute(BroadcastNestedLoopJoinExec.scala:87)
 at 
org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec.org$apache$spark$sql$execution$joins$BroadcastNestedLoopJoinExec$$boundCondition(BroadcastNestedLoopJoinExec.scala:85)
 at 
org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec$$anonfun$4$$anonfun$apply$2$$anonfun$apply$3.apply(BroadcastNestedLoopJoinExec.scala:191)
 at 
org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec$$anonfun$4$$anonfun$apply$2$$anonfun$apply$3.apply(BroadcastNestedLoopJoinExec.scala:191)
 at 
scala.collection.IndexedSeqOptimized$class.prefixLengthImpl(IndexedSeqOptimized.scala:38)
 at 
scala.collection.IndexedSeqOptimized$class.exists(IndexedSeqOptimized.scala:46)
 at scala.collection.mutable.ArrayOps$ofRef.exists(ArrayOps.scala:186)
 at 
org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec$$anonfun$4$$anonfun$apply$2.apply(BroadcastNestedLoopJoinExec.scala:191)
 at 
org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec$$anonfun$4$$anonfun$apply$2.apply(BroadcastNestedLoopJoinExec.scala:190)
 at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:464)
 at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
 at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
 at 
org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
 at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
 at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)
 at org.apache.spark.scheduler.Task.run(Task.scala:121)
 at 
org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
 at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
 at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
 at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
 at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
 at java.lang.Thread.run(Thread.java:748)Caused by: java.lang.RuntimeException: 
Couldn't find PT_Id#140575 in 
[Name#34180,PT_Id#34181,PT_Id#127|#34180,PT_Id#34181,PT_Id#127]
 at

[jira] [Updated] (SPARK-34646) TreeNode bind issue for duplicate column name.

2021-03-05 Thread loc nguyen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

loc nguyen updated SPARK-34646:
---
Description: 
I received a Spark {{TreeNodeException}} executing a union of two data frames. 
When I assign the union results to a DataFrame that will be returned from by a 
function, this error occurs. However, I am able to assign the union results to 
a DataFrame that will not be returned. I have examined schema for all the data 
frames participating in the code. The PT_Id is being duplicated. The PT_Id is 
duplicated and results in the failed search.

 
  
 {{21/03/04 19:58:28 ERROR Executor: Exception in task 2.0 in stage 2281.0 (TID 
5557)
 org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding 
attribute, tree: PT_ID#140575 at 
org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56)
 at 
org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:79)
 at 
org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:78)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:256)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:256)
 at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:255)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:261)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:261)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:326)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
 at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:324)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:261)
 at org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:245)
 at 
org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReference(BoundAttribute.scala:78)
 at 
org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$.bind(GeneratePredicate.scala:45)
 at 
org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$.bind(GeneratePredicate.scala:40)
 at 
org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.generate(CodeGenerator.scala:1190)
 at org.apache.spark.sql.execution.SparkPlan.newPredicate(SparkPlan.scala:403)
 at 
org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec.org$apache$spark$sql$execution$joins$BroadcastNestedLoopJoinExec$$boundCondition$lzycompute(BroadcastNestedLoopJoinExec.scala:87)
 at 
org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec.org$apache$spark$sql$execution$joins$BroadcastNestedLoopJoinExec$$boundCondition(BroadcastNestedLoopJoinExec.scala:85)
 at 
org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec$$anonfun$4$$anonfun$apply$2$$anonfun$apply$3.apply(BroadcastNestedLoopJoinExec.scala:191)
 at 
org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec$$anonfun$4$$anonfun$apply$2$$anonfun$apply$3.apply(BroadcastNestedLoopJoinExec.scala:191)
 at 
scala.collection.IndexedSeqOptimized$class.prefixLengthImpl(IndexedSeqOptimized.scala:38)
 at 
scala.collection.IndexedSeqOptimized$class.exists(IndexedSeqOptimized.scala:46)
 at scala.collection.mutable.ArrayOps$ofRef.exists(ArrayOps.scala:186)
 at 
org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec$$anonfun$4$$anonfun$apply$2.apply(BroadcastNestedLoopJoinExec.scala:191)
 at 
org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec$$anonfun$4$$anonfun$apply$2.apply(BroadcastNestedLoopJoinExec.scala:190)
 at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:464)
 at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
 at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
 at 
org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
 at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
 at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)
 at org.apache.spark.scheduler.Task.run(Task.scala:121)
 at 
org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
 at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
 at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
 at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
 at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
 at java.lang.Thread.run(Thread.java:748)Caused by: java.lang.RuntimeException: 
Couldn't find PT_Id#140575 in 
[Name#34180,PT_Id#34181,PT_Id#127|#34180,PT_Id#34181,PT_Id#127]
 at

[jira] [Updated] (SPARK-34646) TreeNode bind issue for duplicate column name.

2021-03-05 Thread loc nguyen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

loc nguyen updated SPARK-34646:
---
Summary: TreeNode bind issue for duplicate column name.  (was: TreeNode 
bind issue)

> TreeNode bind issue for duplicate column name.
> --
>
> Key: SPARK-34646
> URL: https://issues.apache.org/jira/browse/SPARK-34646
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 2.4.3
> Environment: Spark 2.4.3, Scala 2.11.8, Hadoop 3.2.1
>Reporter: loc nguyen
>Priority: Major
>  Labels: spark
>
> I received a Spark {{TreeNodeException}} executing a union of two data 
> frames. When I assign the union results to a DataFrame that will be returned 
> from by a function, this error occurs. However, I am able to assign the union 
> results to a DataFrame that will not be returned. I have examined schema for 
> all the data frames participating in the code. The PT_Id is being duplicated. 
> The PT_Id is duplicated and results in the failed search.
>  
>   
>  {{21/03/04 19:58:28 ERROR Executor: Exception in task 2.0 in stage 2281.0 
> (TID 5557)
>  org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding 
> attribute, tree: DP_Acct_Identifier#140575 at 
> org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56)
>  at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:79)
>  at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:78)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:256)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:256)
>  at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:255)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:261)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:261)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:326)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:324)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:261)
>  at org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:245)
>  at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReference(BoundAttribute.scala:78)
>  at 
> org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$.bind(GeneratePredicate.scala:45)
>  at 
> org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$.bind(GeneratePredicate.scala:40)
>  at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.generate(CodeGenerator.scala:1190)
>  at org.apache.spark.sql.execution.SparkPlan.newPredicate(SparkPlan.scala:403)
>  at 
> org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec.org$apache$spark$sql$execution$joins$BroadcastNestedLoopJoinExec$$boundCondition$lzycompute(BroadcastNestedLoopJoinExec.scala:87)
>  at 
> org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec.org$apache$spark$sql$execution$joins$BroadcastNestedLoopJoinExec$$boundCondition(BroadcastNestedLoopJoinExec.scala:85)
>  at 
> org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec$$anonfun$4$$anonfun$apply$2$$anonfun$apply$3.apply(BroadcastNestedLoopJoinExec.scala:191)
>  at 
> org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec$$anonfun$4$$anonfun$apply$2$$anonfun$apply$3.apply(BroadcastNestedLoopJoinExec.scala:191)
>  at 
> scala.collection.IndexedSeqOptimized$class.prefixLengthImpl(IndexedSeqOptimized.scala:38)
>  at 
> scala.collection.IndexedSeqOptimized$class.exists(IndexedSeqOptimized.scala:46)
>  at scala.collection.mutable.ArrayOps$ofRef.exists(ArrayOps.scala:186)
>  at 
> org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec$$anonfun$4$$anonfun$apply$2.apply(BroadcastNestedLoopJoinExec.scala:191)
>  at 
> org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec$$anonfun$4$$anonfun$apply$2.apply(BroadcastNestedLoopJoinExec.scala:190)
>  at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:464)
>  at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
>  at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
>  at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
>  at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
>  at

[jira] [Updated] (SPARK-34646) TreeNode bind issue

2021-03-05 Thread loc nguyen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

loc nguyen updated SPARK-34646:
---
Description: 
I received a Spark {{TreeNodeException}} executing a union of two data frames. 
When I assign the union results to a DataFrame that will be returned from by a 
function, this error occurs. However, I am able to assign the union results to 
a DataFrame that will not be returned. I have examined schema for all the data 
frames participating in the code. The PT_Id is being duplicated. The PT_Id is 
duplicated and results in the failed search.

 
  
 {{21/03/04 19:58:28 ERROR Executor: Exception in task 2.0 in stage 2281.0 (TID 
5557)
 org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding 
attribute, tree: DP_Acct_Identifier#140575 at 
org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56)
 at 
org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:79)
 at 
org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:78)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:256)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:256)
 at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:255)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:261)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:261)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:326)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
 at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:324)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:261)
 at org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:245)
 at 
org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReference(BoundAttribute.scala:78)
 at 
org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$.bind(GeneratePredicate.scala:45)
 at 
org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$.bind(GeneratePredicate.scala:40)
 at 
org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.generate(CodeGenerator.scala:1190)
 at org.apache.spark.sql.execution.SparkPlan.newPredicate(SparkPlan.scala:403)
 at 
org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec.org$apache$spark$sql$execution$joins$BroadcastNestedLoopJoinExec$$boundCondition$lzycompute(BroadcastNestedLoopJoinExec.scala:87)
 at 
org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec.org$apache$spark$sql$execution$joins$BroadcastNestedLoopJoinExec$$boundCondition(BroadcastNestedLoopJoinExec.scala:85)
 at 
org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec$$anonfun$4$$anonfun$apply$2$$anonfun$apply$3.apply(BroadcastNestedLoopJoinExec.scala:191)
 at 
org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec$$anonfun$4$$anonfun$apply$2$$anonfun$apply$3.apply(BroadcastNestedLoopJoinExec.scala:191)
 at 
scala.collection.IndexedSeqOptimized$class.prefixLengthImpl(IndexedSeqOptimized.scala:38)
 at 
scala.collection.IndexedSeqOptimized$class.exists(IndexedSeqOptimized.scala:46)
 at scala.collection.mutable.ArrayOps$ofRef.exists(ArrayOps.scala:186)
 at 
org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec$$anonfun$4$$anonfun$apply$2.apply(BroadcastNestedLoopJoinExec.scala:191)
 at 
org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec$$anonfun$4$$anonfun$apply$2.apply(BroadcastNestedLoopJoinExec.scala:190)
 at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:464)
 at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
 at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
 at 
org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
 at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
 at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)
 at org.apache.spark.scheduler.Task.run(Task.scala:121)
 at 
org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
 at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
 at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
 at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
 at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
 at java.lang.Thread.run(Thread.java:748)Caused by: java.lang.RuntimeException: 
Couldn't find PT_Id#140575 in

[jira] [Updated] (SPARK-34646) TreeNode bind issue

2021-03-05 Thread loc nguyen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

loc nguyen updated SPARK-34646:
---
Description: 
I received a Spark {{TreeNodeException}} executing a union of two data frames. 
When I assign the union results to a DataFrame that will be returned from by a 
function, this error occurs. However, I am able to assign the union results to 
a DataFrame that will not be returned. I have examined schema for all the data 
frames participating in the code. The PT_Id is being duplicated. I am not sure 
why the PT_Id is duplicated and results in the failed search.

 
  
 {{21/03/04 19:58:28 ERROR Executor: Exception in task 2.0 in stage 2281.0 (TID 
5557)
 org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding 
attribute, tree: DP_Acct_Identifier#140575 at 
org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56)
 at 
org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:79)
 at 
org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:78)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:256)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:256)
 at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:255)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:261)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:261)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:326)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
 at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:324)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:261)
 at org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:245)
 at 
org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReference(BoundAttribute.scala:78)
 at 
org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$.bind(GeneratePredicate.scala:45)
 at 
org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$.bind(GeneratePredicate.scala:40)
 at 
org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.generate(CodeGenerator.scala:1190)
 at org.apache.spark.sql.execution.SparkPlan.newPredicate(SparkPlan.scala:403)
 at 
org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec.org$apache$spark$sql$execution$joins$BroadcastNestedLoopJoinExec$$boundCondition$lzycompute(BroadcastNestedLoopJoinExec.scala:87)
 at 
org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec.org$apache$spark$sql$execution$joins$BroadcastNestedLoopJoinExec$$boundCondition(BroadcastNestedLoopJoinExec.scala:85)
 at 
org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec$$anonfun$4$$anonfun$apply$2$$anonfun$apply$3.apply(BroadcastNestedLoopJoinExec.scala:191)
 at 
org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec$$anonfun$4$$anonfun$apply$2$$anonfun$apply$3.apply(BroadcastNestedLoopJoinExec.scala:191)
 at 
scala.collection.IndexedSeqOptimized$class.prefixLengthImpl(IndexedSeqOptimized.scala:38)
 at 
scala.collection.IndexedSeqOptimized$class.exists(IndexedSeqOptimized.scala:46)
 at scala.collection.mutable.ArrayOps$ofRef.exists(ArrayOps.scala:186)
 at 
org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec$$anonfun$4$$anonfun$apply$2.apply(BroadcastNestedLoopJoinExec.scala:191)
 at 
org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec$$anonfun$4$$anonfun$apply$2.apply(BroadcastNestedLoopJoinExec.scala:190)
 at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:464)
 at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
 at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
 at 
org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
 at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
 at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)
 at org.apache.spark.scheduler.Task.run(Task.scala:121)
 at 
org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
 at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
 at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
 at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
 at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
 at java.lang.Thread.run(Thread.java:748)Caused by: java.lang.RuntimeException: 
Couldn't find PT_Id#140575 in

[jira] [Created] (SPARK-34646) TreeNode bind issue

2021-03-05 Thread loc nguyen (Jira)

loc nguyen created SPARK-34646:
--

 Summary: TreeNode bind issue
 Key: SPARK-34646
 URL: https://issues.apache.org/jira/browse/SPARK-34646
 Project: Spark
  Issue Type: Bug
  Components: Spark Submit
Affects Versions: 2.4.3
 Environment: Spark 2.4.3, Scala 2.11.8, Hadoop 3.2.1
Reporter: loc nguyen


I received a Spark {{TreeNodeException}} executing a union of two data frames. 
When I assign the union results to a DataFrame that will be returned from by a 
function, this error occurs. However, I am able to assign the union results to 
a DataFrame that will not be returned. I have examined schema for all the data 
frames participating in the code. They PT_Id is being duplicated. I am not sure 
why the PT_Id is duplicated and results in the failed search.

 
 
{{21/03/04 19:58:28 ERROR Executor: Exception in task 2.0 in stage 2281.0 (TID 
5557)
org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding 
attribute, tree: DP_Acct_Identifier#140575at 
org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56)
at 
org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:79)
at 
org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:78)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:256)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:256)
at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:255)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:261)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:261)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:326)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:324)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:261)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:245)
at 
org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReference(BoundAttribute.scala:78)
at 
org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$.bind(GeneratePredicate.scala:45)
at 
org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$.bind(GeneratePredicate.scala:40)
at 
org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.generate(CodeGenerator.scala:1190)
at 
org.apache.spark.sql.execution.SparkPlan.newPredicate(SparkPlan.scala:403)
at 
org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec.org$apache$spark$sql$execution$joins$BroadcastNestedLoopJoinExec$$boundCondition$lzycompute(BroadcastNestedLoopJoinExec.scala:87)
at 
org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec.org$apache$spark$sql$execution$joins$BroadcastNestedLoopJoinExec$$boundCondition(BroadcastNestedLoopJoinExec.scala:85)
at 
org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec$$anonfun$4$$anonfun$apply$2$$anonfun$apply$3.apply(BroadcastNestedLoopJoinExec.scala:191)
at 
org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec$$anonfun$4$$anonfun$apply$2$$anonfun$apply$3.apply(BroadcastNestedLoopJoinExec.scala:191)
at 
scala.collection.IndexedSeqOptimized$class.prefixLengthImpl(IndexedSeqOptimized.scala:38)
at 
scala.collection.IndexedSeqOptimized$class.exists(IndexedSeqOptimized.scala:46)
at scala.collection.mutable.ArrayOps$ofRef.exists(ArrayOps.scala:186)
at 
org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec$$anonfun$4$$anonfun$apply$2.apply(BroadcastNestedLoopJoinExec.scala:191)
at 
org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec$$anonfun$4$$anonfun$apply$2.apply(BroadcastNestedLoopJoinExec.scala:190)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:464)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
at 
org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)
at org.apache.spark.scheduler.Task.run(Task.scala:121)
at

[jira] [Commented] (SPARK-34645) [K8S] Driver pod stuck in Running state after job completes

2021-03-05 Thread Thomas Graves (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17296340#comment-17296340
 ] 

Thomas Graves commented on SPARK-34645:
---

[~dongjoon] [~holden]. have you seen this at all?

> [K8S] Driver pod stuck in Running state after job completes
> ---
>
> Key: SPARK-34645
> URL: https://issues.apache.org/jira/browse/SPARK-34645
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 3.0.2
> Environment: Kubernetes:
> {code:java}
> Client Version: version.Info{Major:"1", Minor:"19", GitVersion:"v1.19.2", 
> GitCommit:"f5743093fd1c663cb0cbc89748f730662345d44d", GitTreeState:"clean", 
> BuildDate:"2020-09-16T13:41:02Z", GoVersion:"go1.15", Compiler:"gc", 
> Platform:"linux/amd64"}
> Server Version: version.Info{Major:"1", Minor:"13", GitVersion:"v1.13.5", 
> GitCommit:"2166946f41b36dea2c4626f90a77706f426cdea2", GitTreeState:"clean", 
> BuildDate:"2019-03-25T15:19:22Z", GoVersion:"go1.11.5", Compiler:"gc", 
> Platform:"linux/amd64"}
>  {code}
>Reporter: Andy Grove
>Priority: Major
>
> I am running automated benchmarks in k8s, using spark-submit in cluster mode, 
> so the driver runs in a pod.
> When running with Spark 3.0.1 and 3.1.1 everything works as expected and I 
> see the Spark context being shut down after the job completes.
> However, when running with Spark 3.0.2 I do not see the context get shut down 
> and the driver pod is stuck in the Running state indefinitely.
> This is the output I see after job completion with 3.0.1 and 3.1.1 and this 
> output does not appear with 3.0.2. With 3.0.2 there is no output at all after 
> the job completes.
> {code:java}
> 2021-03-05 20:09:24,576 INFO spark.SparkContext: Invoking stop() from 
> shutdown hook
> 2021-03-05 20:09:24,592 INFO server.AbstractConnector: Stopped 
> Spark@784499d0{HTTP/1.1, (http/1.1)}{0.0.0.0:4040}
> 2021-03-05 20:09:24,594 INFO ui.SparkUI: Stopped Spark web UI at 
> http://benchmark-runner-3e8a38780400e0d1-driver-svc.default.svc:4040
> 2021-03-05 20:09:24,599 INFO k8s.KubernetesClusterSchedulerBackend: Shutting 
> down all executors
> 2021-03-05 20:09:24,600 INFO 
> k8s.KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: Asking each 
> executor to shut down
> 2021-03-05 20:09:24,609 WARN k8s.ExecutorPodsWatchSnapshotSource: Kubernetes 
> client has been closed (this is expected if the application is shutting down.)
> 2021-03-05 20:09:24,719 INFO spark.MapOutputTrackerMasterEndpoint: 
> MapOutputTrackerMasterEndpoint stopped!
> 2021-03-05 20:09:24,736 INFO memory.MemoryStore: MemoryStore cleared
> 2021-03-05 20:09:24,738 INFO storage.BlockManager: BlockManager stopped
> 2021-03-05 20:09:24,744 INFO storage.BlockManagerMaster: BlockManagerMaster 
> stopped
> 2021-03-05 20:09:24,752 INFO 
> scheduler.OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: 
> OutputCommitCoordinator stopped!
> 2021-03-05 20:09:24,768 INFO spark.SparkContext: Successfully stopped 
> SparkContext
> 2021-03-05 20:09:24,768 INFO util.ShutdownHookManager: Shutdown hook called
> 2021-03-05 20:09:24,769 INFO util.ShutdownHookManager: Deleting directory 
> /var/data/spark-67fa44df-e86c-463a-a149-25d95817ff8e/spark-a5476c14-c103-4108-b733-961400485d8a
> 2021-03-05 20:09:24,772 INFO util.ShutdownHookManager: Deleting directory 
> /tmp/spark-9d6261f5-4394-472b-9c9a-e22bde877814
> 2021-03-05 20:09:24,778 INFO impl.MetricsSystemImpl: Stopping s3a-file-system 
> metrics system...
> 2021-03-05 20:09:24,779 INFO impl.MetricsSystemImpl: s3a-file-system metrics 
> system stopped.
> 2021-03-05 20:09:24,779 INFO impl.MetricsSystemImpl: s3a-file-system metrics 
> system shutdown complete.
>  {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-34645) [K8S] Driver pod stuck in Running state after job completes

2021-03-05 Thread Andy Grove (Jira)

Andy Grove created SPARK-34645:
--

 Summary: [K8S] Driver pod stuck in Running state after job 
completes
 Key: SPARK-34645
 URL: https://issues.apache.org/jira/browse/SPARK-34645
 Project: Spark
  Issue Type: Bug
  Components: Kubernetes
Affects Versions: 3.0.2
 Environment: Kubernetes:


{code:java}
Client Version: version.Info{Major:"1", Minor:"19", GitVersion:"v1.19.2", 
GitCommit:"f5743093fd1c663cb0cbc89748f730662345d44d", GitTreeState:"clean", 
BuildDate:"2020-09-16T13:41:02Z", GoVersion:"go1.15", Compiler:"gc", 
Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"13", GitVersion:"v1.13.5", 
GitCommit:"2166946f41b36dea2c4626f90a77706f426cdea2", GitTreeState:"clean", 
BuildDate:"2019-03-25T15:19:22Z", GoVersion:"go1.11.5", Compiler:"gc", 
Platform:"linux/amd64"}
 {code}
Reporter: Andy Grove


I am running automated benchmarks in k8s, using spark-submit in cluster mode, 
so the driver runs in a pod.

When running with Spark 3.0.1 and 3.1.1 everything works as expected and I see 
the Spark context being shut down after the job completes.

However, when running with Spark 3.0.2 I do not see the context get shut down 
and the driver pod is stuck in the Running state indefinitely.

This is the output I see after job completion with 3.0.1 and 3.1.1 and this 
output does not appear with 3.0.2. With 3.0.2 there is no output at all after 
the job completes.
{code:java}
2021-03-05 20:09:24,576 INFO spark.SparkContext: Invoking stop() from shutdown 
hook
2021-03-05 20:09:24,592 INFO server.AbstractConnector: Stopped 
Spark@784499d0{HTTP/1.1, (http/1.1)}{0.0.0.0:4040}
2021-03-05 20:09:24,594 INFO ui.SparkUI: Stopped Spark web UI at 
http://benchmark-runner-3e8a38780400e0d1-driver-svc.default.svc:4040
2021-03-05 20:09:24,599 INFO k8s.KubernetesClusterSchedulerBackend: Shutting 
down all executors
2021-03-05 20:09:24,600 INFO 
k8s.KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: Asking each 
executor to shut down
2021-03-05 20:09:24,609 WARN k8s.ExecutorPodsWatchSnapshotSource: Kubernetes 
client has been closed (this is expected if the application is shutting down.)
2021-03-05 20:09:24,719 INFO spark.MapOutputTrackerMasterEndpoint: 
MapOutputTrackerMasterEndpoint stopped!
2021-03-05 20:09:24,736 INFO memory.MemoryStore: MemoryStore cleared
2021-03-05 20:09:24,738 INFO storage.BlockManager: BlockManager stopped
2021-03-05 20:09:24,744 INFO storage.BlockManagerMaster: BlockManagerMaster 
stopped
2021-03-05 20:09:24,752 INFO 
scheduler.OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: 
OutputCommitCoordinator stopped!
2021-03-05 20:09:24,768 INFO spark.SparkContext: Successfully stopped 
SparkContext
2021-03-05 20:09:24,768 INFO util.ShutdownHookManager: Shutdown hook called
2021-03-05 20:09:24,769 INFO util.ShutdownHookManager: Deleting directory 
/var/data/spark-67fa44df-e86c-463a-a149-25d95817ff8e/spark-a5476c14-c103-4108-b733-961400485d8a
2021-03-05 20:09:24,772 INFO util.ShutdownHookManager: Deleting directory 
/tmp/spark-9d6261f5-4394-472b-9c9a-e22bde877814
2021-03-05 20:09:24,778 INFO impl.MetricsSystemImpl: Stopping s3a-file-system 
metrics system...
2021-03-05 20:09:24,779 INFO impl.MetricsSystemImpl: s3a-file-system metrics 
system stopped.
2021-03-05 20:09:24,779 INFO impl.MetricsSystemImpl: s3a-file-system metrics 
system shutdown complete.
 {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-34295) Allow option similar to mapreduce.job.hdfs-servers.token-renewal.exclude

2021-03-05 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34295:


Assignee: L. C. Hsieh  (was: Apache Spark)

> Allow option similar to mapreduce.job.hdfs-servers.token-renewal.exclude
> 
>
> Key: SPARK-34295
> URL: https://issues.apache.org/jira/browse/SPARK-34295
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 3.2.0
>Reporter: L. C. Hsieh
>Assignee: L. C. Hsieh
>Priority: Major
>
> MapReduce jobs can instruct YARN to skip renewal of tokens obtained from 
> certain hosts by specifying the hosts with configuration 
> mapreduce.job.hdfs-servers.token-renewal.exclude=,,..,.
> But seems Spark lacks of similar option. So the job submission fails if YARN 
> fails to renew DelegationToken for any of the remote HDFS cluster.  The 
> failure in DT renewal can happen due to many reason like Remote HDFS does not 
> trust Kerberos identity of YARN etc.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-34295) Allow option similar to mapreduce.job.hdfs-servers.token-renewal.exclude

2021-03-05 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34295:


Assignee: Apache Spark  (was: L. C. Hsieh)

> Allow option similar to mapreduce.job.hdfs-servers.token-renewal.exclude
> 
>
> Key: SPARK-34295
> URL: https://issues.apache.org/jira/browse/SPARK-34295
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 3.2.0
>Reporter: L. C. Hsieh
>Assignee: Apache Spark
>Priority: Major
>
> MapReduce jobs can instruct YARN to skip renewal of tokens obtained from 
> certain hosts by specifying the hosts with configuration 
> mapreduce.job.hdfs-servers.token-renewal.exclude=,,..,.
> But seems Spark lacks of similar option. So the job submission fails if YARN 
> fails to renew DelegationToken for any of the remote HDFS cluster.  The 
> failure in DT renewal can happen due to many reason like Remote HDFS does not 
> trust Kerberos identity of YARN etc.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34295) Allow option similar to mapreduce.job.hdfs-servers.token-renewal.exclude

2021-03-05 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17296324#comment-17296324
 ] 

Apache Spark commented on SPARK-34295:
--

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/31761

> Allow option similar to mapreduce.job.hdfs-servers.token-renewal.exclude
> 
>
> Key: SPARK-34295
> URL: https://issues.apache.org/jira/browse/SPARK-34295
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 3.2.0
>Reporter: L. C. Hsieh
>Assignee: L. C. Hsieh
>Priority: Major
>
> MapReduce jobs can instruct YARN to skip renewal of tokens obtained from 
> certain hosts by specifying the hosts with configuration 
> mapreduce.job.hdfs-servers.token-renewal.exclude=,,..,.
> But seems Spark lacks of similar option. So the job submission fails if YARN 
> fails to renew DelegationToken for any of the remote HDFS cluster.  The 
> failure in DT renewal can happen due to many reason like Remote HDFS does not 
> trust Kerberos identity of YARN etc.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34295) Allow option similar to mapreduce.job.hdfs-servers.token-renewal.exclude

2021-03-05 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17296327#comment-17296327
 ] 

Apache Spark commented on SPARK-34295:
--

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/31761

> Allow option similar to mapreduce.job.hdfs-servers.token-renewal.exclude
> 
>
> Key: SPARK-34295
> URL: https://issues.apache.org/jira/browse/SPARK-34295
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 3.2.0
>Reporter: L. C. Hsieh
>Assignee: L. C. Hsieh
>Priority: Major
>
> MapReduce jobs can instruct YARN to skip renewal of tokens obtained from 
> certain hosts by specifying the hosts with configuration 
> mapreduce.job.hdfs-servers.token-renewal.exclude=,,..,.
> But seems Spark lacks of similar option. So the job submission fails if YARN 
> fails to renew DelegationToken for any of the remote HDFS cluster.  The 
> failure in DT renewal can happen due to many reason like Remote HDFS does not 
> trust Kerberos identity of YARN etc.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-34644) UDF returning array followed by explode returns wrong results

2021-03-05 Thread Gavrilescu Laurentiu (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gavrilescu Laurentiu updated SPARK-34644:
-
Description: 
*Applying an UDF followed by explode looks to be calling the UDF twice.*

Imagine having the following scenario:

1. you have a dataframe with some string columns
 2. you have an expensive function that creates a score based on some string 
input
 3. you want to get all the distinct values from all the columns and their 
score - there is an executor level cache that holds the score values for 
strings to minimize the execution of the expensive function

consider the following code to reproduce it:
{code:java}
case class RowWithStrings(c1: String, c2: String, c3: String)
case class ValueScore(value: String, score: Double)

object Bug {

  val columns: List[String] = List("c1", "c2", "c3")

  def score(input: String): Double = {
// insert expensive function here
input.toDouble
  }

  def main(args: Array[String]) {
lazy val sparkSession: SparkSession = {
  val sparkSession = SparkSession.builder.master("local[4]")
.getOrCreate()

  sparkSession
}

// some cache over expensive operation
val cache: TrieMap[String, Double] = TrieMap[String, Double]()

// get scores for all columns in the row
val body = (row: Row) => {
  val arr = ArrayBuffer[ValueScore]()
  columns foreach {
column =>
  val value = row.getAs[String](column)
  if (!cache.contains(value)) {
val computedScore = score(value)
arr += ValueScore(value, computedScore)
cache(value) = computedScore
  }
  }
  arr
}

val basicUdf = udf(body)

val values = (1 to 5) map {
  idx =>
// repeated values
RowWithStrings(idx.toString, idx.toString, idx.toString)
}

import sparkSession.implicits._
val df = values.toDF("c1", "c2", "c3").persist()
val allCols = df.columns.map(col)

df.select(basicUdf(struct(allCols: _*)).as("valuesScore"))
  .select(explode(col("valuesScore")))
  .distinct()
  .show()
  }
}
{code}
 this shows:
{code:java}
+---+
|col|
+---+
+---+
{code}
When adding persist before explode, the result is correct:
{code:java}
df.select(basicUdf(struct(allCols: _*)).as("valuesScore"))
  .persist()
  .select(explode(col("valuesScore")))
  .distinct()
  .show()
{code}
=>
{code:java}
++
| col|
++
|{2, 2.0}|
|{4, 4.0}|
|{3, 3.0}|
|{5, 5.0}|
|{1, 1.0}|
++
{code}
This is not reproducible using 3.0.2 version. 

  was:
*Applying an UDF followed by explode looks to be calling the UDF twice.*

Imagine having the following scenario:

1. you have a dataframe with some string columns
 2. you have an expensive function that creates a score based on some string 
input
 3. you want to get all the distinct values from all the columns and their 
score - there is an executor level cache that holds the score values for 
strings to minimize the execution of the expensive function

consider the following code to reproduce it:
{code:java}
case class RowWithStrings(c1: String, c2: String, c3: String)
case class ValueScore(value: String, score: Double)

object Bug {

  val columns: List[String] = List("c1", "c2", "c3")

  def score(input: String): Double = {
// insert expensive function here
input.toDouble
  }

  def main(args: Array[String]) {
lazy val sparkSession: SparkSession = {
  val sparkSession = SparkSession.builder.master("local[4]")
.getOrCreate()

  sparkSession
}

// some cache over expensive operation
val cache: TrieMap[String, Double] = TrieMap[String, Double]()

// get scores for all columns in the row
val body = (row: Row) => {
  val arr = ArrayBuffer[ValueScore]()
  columns foreach {
column =>
  val value = row.getAs[String](column)
  if (!cache.contains(value)) {
val computedScore = score(value)
arr += ValueScore(value, computedScore)
cache(value) = computedScore
  }
  }
  arr
}

val basicUdf = udf(body)

val values = (1 to 5) map {
  idx =>
// repeated values
RowWithStrings(idx.toString, idx.toString, idx.toString)
}

import sparkSession.implicits._
val df = values.toDF("c1", "c2", "c3").persist()
val allCols = df.columns.map(col)

df.select(basicUdf(struct(allCols: _*)).as("valuesScore"))
  .select(explode(col("valuesScore")))
  .distinct()
  .show()
  }
}
{code}
 this shows:
{code:java}
+---+
|col|
+---+
+---+
{code}
When adding persist before explode, the result is correct:
{code:java}
df.select(basicUdf(struct(allCols: _*)).as("valuesScore"))
  .persist()
  .select(explode(col("valuesScore")))
  .distinct()
  .show()
{code}
=>
{code:java}
++
| col|
++
|{2, 2.0}|
|{4, 4.0}|
|{3, 3.0}|
|{5, 5.0}|

[jira] [Created] (SPARK-34644) UDF returning array followed by explode returns wrong results

2021-03-05 Thread Gavrilescu Laurentiu (Jira)

Gavrilescu Laurentiu created SPARK-34644:


 Summary: UDF returning array followed by explode returns wrong 
results
 Key: SPARK-34644
 URL: https://issues.apache.org/jira/browse/SPARK-34644
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.1.1
Reporter: Gavrilescu Laurentiu


*Applying an UDF followed by explode looks to be calling the UDF twice.*

Imagine having the following scenario:

1. you have a dataframe with some string columns
 2. you have an expensive function that creates a score based on some string 
input
 3. you want to get all the distinct values from all the columns and their 
score - there is an executor level cache that holds the score values for 
strings to minimize the execution of the expensive function

consider the following code to reproduce it:
{code:java}
case class RowWithStrings(c1: String, c2: String, c3: String)
case class ValueScore(value: String, score: Double)

object Bug {

  val columns: List[String] = List("c1", "c2", "c3")

  def score(input: String): Double = {
// insert expensive function here
input.toDouble
  }

  def main(args: Array[String]) {
lazy val sparkSession: SparkSession = {
  val sparkSession = SparkSession.builder.master("local[4]")
.getOrCreate()

  sparkSession
}

// some cache over expensive operation
val cache: TrieMap[String, Double] = TrieMap[String, Double]()

// get scores for all columns in the row
val body = (row: Row) => {
  val arr = ArrayBuffer[ValueScore]()
  columns foreach {
column =>
  val value = row.getAs[String](column)
  if (!cache.contains(value)) {
val computedScore = score(value)
arr += ValueScore(value, computedScore)
cache(value) = computedScore
  }
  }
  arr
}

val basicUdf = udf(body)

val values = (1 to 5) map {
  idx =>
// repeated values
RowWithStrings(idx.toString, idx.toString, idx.toString)
}

import sparkSession.implicits._
val df = values.toDF("c1", "c2", "c3").persist()
val allCols = df.columns.map(col)

df.select(basicUdf(struct(allCols: _*)).as("valuesScore"))
  .select(explode(col("valuesScore")))
  .distinct()
  .show()
  }
}
{code}
 this shows:
{code:java}
+---+
|col|
+---+
+---+
{code}
When adding persist before explode, the result is correct:
{code:java}
df.select(basicUdf(struct(allCols: _*)).as("valuesScore"))
  .persist()
  .select(explode(col("valuesScore")))
  .distinct()
  .show()
{code}
=>
{code:java}
++
| col|
++
|{2, 2.0}|
|{4, 4.0}|
|{3, 3.0}|
|{5, 5.0}|
|{1, 1.0}|
++
{code}
 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-34392) Invalid ID for offset-based ZoneId since Spark 3.0

2021-03-05 Thread Sean R. Owen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen updated SPARK-34392:
-
Fix Version/s: 3.0.3

> Invalid ID for offset-based ZoneId since Spark 3.0
> --
>
> Key: SPARK-34392
> URL: https://issues.apache.org/jira/browse/SPARK-34392
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.0.1
>Reporter: Yuming Wang
>Assignee: karl wang
>Priority: Major
> Fix For: 3.2.0, 3.1.2, 3.0.3
>
>
> How to reproduce this issue:
> {code:sql}
> select to_utc_timestamp("2020-02-07 16:00:00", "GMT+8:00");
> {code}
> Spark 2.4:
> {noformat}
> spark-sql> select to_utc_timestamp("2020-02-07 16:00:00", "GMT+8:00");
> 2020-02-07 08:00:00
> Time taken: 0.089 seconds, Fetched 1 row(s)
> {noformat}
> Spark 3.x:
> {noformat}
> spark-sql> select to_utc_timestamp("2020-02-07 16:00:00", "GMT+8:00");
> 21/02/07 01:24:32 ERROR SparkSQLDriver: Failed in [select 
> to_utc_timestamp("2020-02-07 16:00:00", "GMT+8:00")]
> java.time.DateTimeException: Invalid ID for offset-based ZoneId: GMT+8:00
>   at java.time.ZoneId.ofWithPrefix(ZoneId.java:437)
>   at java.time.ZoneId.of(ZoneId.java:407)
>   at java.time.ZoneId.of(ZoneId.java:359)
>   at java.time.ZoneId.of(ZoneId.java:315)
>   at 
> org.apache.spark.sql.catalyst.util.DateTimeUtils$.getZoneId(DateTimeUtils.scala:53)
>   at 
> org.apache.spark.sql.catalyst.util.DateTimeUtils$.toUTCTime(DateTimeUtils.scala:814)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-34563) Checkpointing a union with another checkpoint fails

2021-03-05 Thread Michael Kamprath (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Kamprath updated SPARK-34563:
-
Affects Version/s: 3.1.1

> Checkpointing a union with another checkpoint fails
> ---
>
> Key: SPARK-34563
> URL: https://issues.apache.org/jira/browse/SPARK-34563
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.0.2, 3.1.1
> Environment: I am running Spark 3.0.2 in stand alone cluster mode, 
> built for Hadoop 2.7, and Scala 2.12.12. I am using QFS 2.2.2 (Quantcast File 
> System) as the underlying DFS. The nodes run on Debian Stretch, and Java is 
> openjdk version "1.8.0_275". 
>Reporter: Michael Kamprath
>Priority: Major
>
> I have some PySpark code that periodically checkpoints a data frame  that I 
> am building in pieces by union-ing those pieces together as they are 
> constructed. (Py)Spark fails on the second checkpoint, which would be a union 
> of a new piece of the desired data frame with a previously checkpointed 
> piece. Some simplified PySpark code that will trigger this problem is:
>  
> {code:java}
> RANGE_STEP = 1
> PARTITIONS = 5
> COUNT_UNIONS = 20
> df = spark.range(1, RANGE_STEP+1, numPartitions=PARTITIONS)
> for i in range(1, COUNT_UNIONS+1):
> print('Processing i = {0}'.format(i))
> new_df = spark.range(RANGE_STEP*i + 1, RANGE_STEP*(i+1) + 1, 
> numPartitions=PARTITIONS)
> df = df.union(new_df).checkpoint()
> df.count()
> {code}
> When this code gets to the checkpoint on the second loop iteration (i=2) the 
> job fails with an error:
>  
> {code:java}
> Py4JJavaError: An error occurred while calling o119.checkpoint.
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task 9 
> in stage 10.0 failed 4 times, most recent failure: Lost task 9.3 in stage 
> 10.0 (TID 264, 10.20.30.13, executor 0): 
> com.esotericsoftware.kryo.KryoException: Encountered unregistered class ID: 
> 9062
>   at 
> com.esotericsoftware.kryo.util.DefaultClassResolver.readClass(DefaultClassResolver.java:137)
>   at com.esotericsoftware.kryo.Kryo.readClass(Kryo.java:693)
>   at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:804)
>   at 
> org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:296)
>   at 
> org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:168)
>   at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:729)
>   at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
>   at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1804)
>   at org.apache.spark.rdd.RDD.$anonfun$count$1(RDD.scala:1227)
>   at org.apache.spark.rdd.RDD.$anonfun$count$1$adapted(RDD.scala:1227)
>   at 
> org.apache.spark.SparkContext.$anonfun$runJob$5(SparkContext.scala:2154)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
>   at org.apache.spark.scheduler.Task.run(Task.scala:127)
>   at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:462)
>   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:465)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> Driver stacktrace:
>   at 
> org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2059)
>   at 
> org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2008)
>   at 
> org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2007)
>   at 
> scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
>   at 
> scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
>   at 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2007)
>   at 
> org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:973)
>   at 
> org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:973)
>   at

[jira] [Commented] (SPARK-34563) Checkpointing a union with another checkpoint fails

2021-03-05 Thread Michael Kamprath (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17296250#comment-17296250
 ] 

Michael Kamprath commented on SPARK-34563:
--

I just tested this under Spark 3.1.1 keep everything else in my set up the 
same, and it fails at the same point. However, the exception thrown looks 
slightly different:

 
{code:java}
Py4JJavaError Traceback (most recent call last)
 in 
  8 print('Processing i = {0}'.format(i))
  9 new_df = spark.range(RANGE_STEP*i + 1, RANGE_STEP*(i+1) + 1, 
numPartitions=PARTITIONS)
---> 10 df = df.union(new_df).checkpoint()
 11 
 12 df.count()

/usr/spark-3.1.1/python/pyspark/sql/dataframe.py in checkpoint(self, eager)
544 This API is experimental.
545 """
--> 546 jdf = self._jdf.checkpoint(eager)
547 return DataFrame(jdf, self.sql_ctx)
548 

/usr/spark-3.1.1/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py in 
__call__(self, *args)
   1303 answer = self.gateway_client.send_command(command)
   1304 return_value = get_return_value(
-> 1305 answer, self.gateway_client, self.target_id, self.name)
   1306 
   1307 for temp_arg in temp_args:

/usr/spark-3.1.1/python/pyspark/sql/utils.py in deco(*a, **kw)
109 def deco(*a, **kw):
110 try:
--> 111 return f(*a, **kw)
112 except py4j.protocol.Py4JJavaError as e:
113 converted = convert_exception(e.java_exception)

/usr/spark-3.1.1/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py in 
get_return_value(answer, gateway_client, target_id, name)
326 raise Py4JJavaError(
327 "An error occurred while calling {0}{1}{2}.\n".
--> 328 format(target_id, ".", name), value)
329 else:
330 raise Py4JError(

Py4JJavaError: An error occurred while calling o65.checkpoint.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 8 in 
stage 2.0 failed 4 times, most recent failure: Lost task 8.3 in stage 2.0 (TID 
50) (10.20.30.17 executor 3): java.lang.IndexOutOfBoundsException: Index: 61, 
Size: 0
at java.util.ArrayList.rangeCheck(ArrayList.java:659)
at java.util.ArrayList.get(ArrayList.java:435)
at 
com.esotericsoftware.kryo.util.MapReferenceResolver.getReadObject(MapReferenceResolver.java:60)
at com.esotericsoftware.kryo.Kryo.readReferenceOrNull(Kryo.java:857)
at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:811)
at 
org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:296)
at 
org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:168)
at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
 Source)
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:755)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1866)
at org.apache.spark.rdd.RDD.$anonfun$count$1(RDD.scala:1253)
at org.apache.spark.rdd.RDD.$anonfun$count$1$adapted(RDD.scala:1253)
at 
org.apache.spark.SparkContext.$anonfun$runJob$5(SparkContext.scala:2242)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:131)
at 
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)

Driver stacktrace:
at 
org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2253)
at 
org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2202)
at 
org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2201)
at 
scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
at 
scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
at 
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2201)
at

[jira] [Assigned] (SPARK-34643) Use CRAN URL in canonical form

2021-03-05 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-34643:
-

Assignee: Hyukjin Kwon

> Use CRAN URL in canonical form
> --
>
> Key: SPARK-34643
> URL: https://issues.apache.org/jira/browse/SPARK-34643
> Project: Spark
>  Issue Type: Bug
>  Components: R
>Affects Versions: 3.2.0
>Reporter: Dongjoon Hyun
>Assignee: Hyukjin Kwon
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-34643) Use CRAN URL in canonical form

2021-03-05 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-34643.
---
Fix Version/s: 3.1.2
   3.2.0
   Resolution: Fixed

Issue resolved by pull request 31759
[https://github.com/apache/spark/pull/31759]

> Use CRAN URL in canonical form
> --
>
> Key: SPARK-34643
> URL: https://issues.apache.org/jira/browse/SPARK-34643
> Project: Spark
>  Issue Type: Bug
>  Components: R
>Affects Versions: 3.2.0
>Reporter: Dongjoon Hyun
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 3.2.0, 3.1.2
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-34624) Filter non-jar dependencies from ivy/maven coordinates

2021-03-05 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34624?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-34624.
---
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 31741
[https://github.com/apache/spark/pull/31741]

> Filter non-jar dependencies from ivy/maven coordinates
> --
>
> Key: SPARK-34624
> URL: https://issues.apache.org/jira/browse/SPARK-34624
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.1.1
>Reporter: Shardul Mahadik
>Assignee: Shardul Mahadik
>Priority: Major
> Fix For: 3.2.0
>
>
> Some maven artifacts define non-jar dependencies. One such example is 
> {{hive-exec}}'s dependency on the {{pom}} of {{apache-curator}} 
> https://repo1.maven.org/maven2/org/apache/hive/hive-exec/2.3.8/hive-exec-2.3.8.pom
> Today trying to depend on such an artifact using {{--packages}} will print an 
> error but continue without including the non-jar dependency.
> {code}
> 1/03/04 09:46:49 ERROR SparkContext: Failed to add 
> file:/Users/smahadik/.ivy2/jars/org.apache.curator_apache-curator-2.7.1.jar 
> to Spark environment
> java.io.FileNotFoundException: Jar 
> /Users/shardul/.ivy2/jars/org.apache.curator_apache-curator-2.7.1.jar not 
> found
>   at 
> org.apache.spark.SparkContext.addLocalJarFile$1(SparkContext.scala:1935)
>   at org.apache.spark.SparkContext.addJar(SparkContext.scala:1990)
>   at org.apache.spark.SparkContext.$anonfun$new$12(SparkContext.scala:501)
>   at 
> org.apache.spark.SparkContext.$anonfun$new$12$adapted(SparkContext.scala:501)
>   at 
> scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
> {code}
> Doing the same using {{spark.sql("ADD JAR 
> ivy://org.apache.hive:hive-exec:2.3.8?exclude=org.pentaho:pentaho-aggdesigner-algorithm")}}
>  will cause a failure
> {code}
> ADD JAR /Users/smahadik/.ivy2/jars/org.apache.curator_apache-curator-2.7.1.jar
> /Users/smahadik/.ivy2/jars/org.apache.curator_apache-curator-2.7.1.jar does 
> not exist
> ==
> END HIVE FAILURE OUTPUT
> ==
> org.apache.spark.sql.execution.QueryExecutionException: 
> /Users/smahadik/.ivy2/jars/org.apache.curator_apache-curator-2.7.1.jar does 
> not exist
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$runHive$1(HiveClientImpl.scala:841)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$withHiveState$1(HiveClientImpl.scala:291)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:224)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:223)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:273)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl.runHive(HiveClientImpl.scala:800)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl.runSqlHive(HiveClientImpl.scala:787)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl.addJar(HiveClientImpl.scala:947)
>   at 
> org.apache.spark.sql.hive.HiveSessionResourceLoader.$anonfun$addJar$1(HiveSessionStateBuilder.scala:130)
>   at 
> org.apache.spark.sql.hive.HiveSessionResourceLoader.$anonfun$addJar$1$adapted(HiveSessionStateBuilder.scala:129)
>   at scala.collection.mutable.ArraySeq.foreach(ArraySeq.scala:75)
>   at 
> org.apache.spark.sql.hive.HiveSessionResourceLoader.addJar(HiveSessionStateBuilder.scala:129)
>   at 
> org.apache.spark.sql.execution.command.AddJarCommand.run(resources.scala:40)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:79)
>   at org.apache.spark.sql.Dataset.$anonfun$logicalPlan$1(Dataset.scala:228)
>   at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3705)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90)
>   at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:772)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
>   at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3703)
>   at org.apache.spark.sql.Dataset.(Dataset.scala:228)
>   at org.apache.spark.sql.Dataset$.$anonfun$ofRows$2(Dataset.scala:99)
>   at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:772)
>

[jira] [Assigned] (SPARK-34624) Filter non-jar dependencies from ivy/maven coordinates

2021-03-05 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34624?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-34624:
-

Assignee: Shardul Mahadik

> Filter non-jar dependencies from ivy/maven coordinates
> --
>
> Key: SPARK-34624
> URL: https://issues.apache.org/jira/browse/SPARK-34624
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.1.1
>Reporter: Shardul Mahadik
>Assignee: Shardul Mahadik
>Priority: Major
>
> Some maven artifacts define non-jar dependencies. One such example is 
> {{hive-exec}}'s dependency on the {{pom}} of {{apache-curator}} 
> https://repo1.maven.org/maven2/org/apache/hive/hive-exec/2.3.8/hive-exec-2.3.8.pom
> Today trying to depend on such an artifact using {{--packages}} will print an 
> error but continue without including the non-jar dependency.
> {code}
> 1/03/04 09:46:49 ERROR SparkContext: Failed to add 
> file:/Users/smahadik/.ivy2/jars/org.apache.curator_apache-curator-2.7.1.jar 
> to Spark environment
> java.io.FileNotFoundException: Jar 
> /Users/shardul/.ivy2/jars/org.apache.curator_apache-curator-2.7.1.jar not 
> found
>   at 
> org.apache.spark.SparkContext.addLocalJarFile$1(SparkContext.scala:1935)
>   at org.apache.spark.SparkContext.addJar(SparkContext.scala:1990)
>   at org.apache.spark.SparkContext.$anonfun$new$12(SparkContext.scala:501)
>   at 
> org.apache.spark.SparkContext.$anonfun$new$12$adapted(SparkContext.scala:501)
>   at 
> scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
> {code}
> Doing the same using {{spark.sql("ADD JAR 
> ivy://org.apache.hive:hive-exec:2.3.8?exclude=org.pentaho:pentaho-aggdesigner-algorithm")}}
>  will cause a failure
> {code}
> ADD JAR /Users/smahadik/.ivy2/jars/org.apache.curator_apache-curator-2.7.1.jar
> /Users/smahadik/.ivy2/jars/org.apache.curator_apache-curator-2.7.1.jar does 
> not exist
> ==
> END HIVE FAILURE OUTPUT
> ==
> org.apache.spark.sql.execution.QueryExecutionException: 
> /Users/smahadik/.ivy2/jars/org.apache.curator_apache-curator-2.7.1.jar does 
> not exist
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$runHive$1(HiveClientImpl.scala:841)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$withHiveState$1(HiveClientImpl.scala:291)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:224)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:223)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:273)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl.runHive(HiveClientImpl.scala:800)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl.runSqlHive(HiveClientImpl.scala:787)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl.addJar(HiveClientImpl.scala:947)
>   at 
> org.apache.spark.sql.hive.HiveSessionResourceLoader.$anonfun$addJar$1(HiveSessionStateBuilder.scala:130)
>   at 
> org.apache.spark.sql.hive.HiveSessionResourceLoader.$anonfun$addJar$1$adapted(HiveSessionStateBuilder.scala:129)
>   at scala.collection.mutable.ArraySeq.foreach(ArraySeq.scala:75)
>   at 
> org.apache.spark.sql.hive.HiveSessionResourceLoader.addJar(HiveSessionStateBuilder.scala:129)
>   at 
> org.apache.spark.sql.execution.command.AddJarCommand.run(resources.scala:40)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:79)
>   at org.apache.spark.sql.Dataset.$anonfun$logicalPlan$1(Dataset.scala:228)
>   at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3705)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90)
>   at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:772)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
>   at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3703)
>   at org.apache.spark.sql.Dataset.(Dataset.scala:228)
>   at org.apache.spark.sql.Dataset$.$anonfun$ofRows$2(Dataset.scala:99)
>   at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:772)
>   at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:96)
>   at

[jira] [Commented] (SPARK-34641) ARM CI failed due to download hive-exec failed

2021-03-05 Thread Shane Knapp (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17296239#comment-17296239
 ] 

Shane Knapp commented on SPARK-34641:
-

looks like hive-exec downloaded fine in my recent build:

[https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-arm/565/console]
{code:java}
Downloading from gcs-maven-central-mirror: 
https://maven-central.storage-download.googleapis.com/maven2/org/apache/hive/hive-exec/2.3.8/hive-exec-2.3.8.pom
Progress (1): 3.1/31 kB
Progress (1): 5.8/31 kB
Progress (1): 8.6/31 kB
Progress (1): 11/31 kB 
Progress (1): 14/31 kB
Progress (1): 16/31 kB
Progress (1): 19/31 kB
Progress (1): 22/31 kB
Progress (1): 25/31 kB
Progress (1): 28/31 kB
Progress (1): 30/31 kB
Progress (1): 31 kB   
   
Downloaded from gcs-maven-central-mirror: 
https://maven-central.storage-download.googleapis.com/maven2/org/apache/hive/hive-exec/2.3.8/hive-exec-2.3.8.pom
 (31 kB at 22 kB/s)r
{code}
we'll wait and see how this build turns out.

> ARM CI failed due to download hive-exec failed
> --
>
> Key: SPARK-34641
> URL: https://issues.apache.org/jira/browse/SPARK-34641
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 3.1.2
>Reporter: Yikun Jiang
>Priority: Major
>
> The download failed (org.apache.hive#hive-exec;3.0.0!hive-exec.jar) was 
> happened in recent spark-master-test-maven-arm build tests.
> But it's not reproduced and all hive test passed in my arm and x86 local env.
> and the jenkins x86 test like [2] is also unstable due to other reason, so 
> looks like we can't find a valid reference.
> [1] [https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-arm/]
> [2] 
> [https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-2.7/]
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34641) ARM CI failed due to download hive-exec failed

2021-03-05 Thread Shane Knapp (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17296235#comment-17296235
 ] 

Shane Knapp commented on SPARK-34641:
-

no clue, tbh.  i'll try and take a closer look at the build logs later, but for 
now i wiped the ivy and maven caches on the VM and triggered a fresh build.

> ARM CI failed due to download hive-exec failed
> --
>
> Key: SPARK-34641
> URL: https://issues.apache.org/jira/browse/SPARK-34641
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 3.1.2
>Reporter: Yikun Jiang
>Priority: Major
>
> The download failed (org.apache.hive#hive-exec;3.0.0!hive-exec.jar) was 
> happened in recent spark-master-test-maven-arm build tests.
> But it's not reproduced and all hive test passed in my arm and x86 local env.
> and the jenkins x86 test like [2] is also unstable due to other reason, so 
> looks like we can't find a valid reference.
> [1] [https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-arm/]
> [2] 
> [https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-2.7/]
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-34643) Use CRAN URL in canonical form

2021-03-05 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34643:


Assignee: Apache Spark

> Use CRAN URL in canonical form
> --
>
> Key: SPARK-34643
> URL: https://issues.apache.org/jira/browse/SPARK-34643
> Project: Spark
>  Issue Type: Bug
>  Components: R
>Affects Versions: 3.2.0
>Reporter: Dongjoon Hyun
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-34643) Use CRAN URL in canonical form

2021-03-05 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34643:


Assignee: (was: Apache Spark)

> Use CRAN URL in canonical form
> --
>
> Key: SPARK-34643
> URL: https://issues.apache.org/jira/browse/SPARK-34643
> Project: Spark
>  Issue Type: Bug
>  Components: R
>Affects Versions: 3.2.0
>Reporter: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34643) Use CRAN URL in canonical form

2021-03-05 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17296227#comment-17296227
 ] 

Apache Spark commented on SPARK-34643:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/31759

> Use CRAN URL in canonical form
> --
>
> Key: SPARK-34643
> URL: https://issues.apache.org/jira/browse/SPARK-34643
> Project: Spark
>  Issue Type: Bug
>  Components: R
>Affects Versions: 3.2.0
>Reporter: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34642) TypeError in Pyspark Linear Regression docs

2021-03-05 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17296226#comment-17296226
 ] 

Apache Spark commented on SPARK-34642:
--

User 'srowen' has created a pull request for this issue:
https://github.com/apache/spark/pull/31760

> TypeError in Pyspark Linear Regression docs
> ---
>
> Key: SPARK-34642
> URL: https://issues.apache.org/jira/browse/SPARK-34642
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation, ML, PySpark
>Affects Versions: 3.1.1
>Reporter: Sean R. Owen
>Priority: Minor
>
> The documentation for Linear Regression in Pyspark includes an example, which 
> contains a TypeError:
> https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.regression.LinearRegression.html?highlight=linearregression#pyspark.ml.regression.LinearRegression
> {code}
> >>> abs(model.transform(test1).head().newPrediction - 1.0) < 0.001
> True
> >>> lr.setParams("vector")
> Traceback (most recent call last):
> ...
> TypeError: Method setParams forces keyword arguments.
> {code}
> I'm pretty sure we don't intend this, and it can be resolved by just changing 
> the function call to use keyword args.
> (HT Brooke Wenig for flagging this)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-34642) TypeError in Pyspark Linear Regression docs

2021-03-05 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34642:


Assignee: Apache Spark

> TypeError in Pyspark Linear Regression docs
> ---
>
> Key: SPARK-34642
> URL: https://issues.apache.org/jira/browse/SPARK-34642
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation, ML, PySpark
>Affects Versions: 3.1.1
>Reporter: Sean R. Owen
>Assignee: Apache Spark
>Priority: Minor
>
> The documentation for Linear Regression in Pyspark includes an example, which 
> contains a TypeError:
> https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.regression.LinearRegression.html?highlight=linearregression#pyspark.ml.regression.LinearRegression
> {code}
> >>> abs(model.transform(test1).head().newPrediction - 1.0) < 0.001
> True
> >>> lr.setParams("vector")
> Traceback (most recent call last):
> ...
> TypeError: Method setParams forces keyword arguments.
> {code}
> I'm pretty sure we don't intend this, and it can be resolved by just changing 
> the function call to use keyword args.
> (HT Brooke Wenig for flagging this)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-34642) TypeError in Pyspark Linear Regression docs

2021-03-05 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34642:


Assignee: (was: Apache Spark)

> TypeError in Pyspark Linear Regression docs
> ---
>
> Key: SPARK-34642
> URL: https://issues.apache.org/jira/browse/SPARK-34642
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation, ML, PySpark
>Affects Versions: 3.1.1
>Reporter: Sean R. Owen
>Priority: Minor
>
> The documentation for Linear Regression in Pyspark includes an example, which 
> contains a TypeError:
> https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.regression.LinearRegression.html?highlight=linearregression#pyspark.ml.regression.LinearRegression
> {code}
> >>> abs(model.transform(test1).head().newPrediction - 1.0) < 0.001
> True
> >>> lr.setParams("vector")
> Traceback (most recent call last):
> ...
> TypeError: Method setParams forces keyword arguments.
> {code}
> I'm pretty sure we don't intend this, and it can be resolved by just changing 
> the function call to use keyword args.
> (HT Brooke Wenig for flagging this)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-34592) Mark indeterminate RDD in Web UI

2021-03-05 Thread L. C. Hsieh (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34592?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

L. C. Hsieh resolved SPARK-34592.
-
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 31707
[https://github.com/apache/spark/pull/31707]

> Mark indeterminate RDD in Web UI
> 
>
> Key: SPARK-34592
> URL: https://issues.apache.org/jira/browse/SPARK-34592
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 3.2.0
>Reporter: L. C. Hsieh
>Assignee: L. C. Hsieh
>Priority: Major
> Fix For: 3.2.0
>
>
> It is somehow hard to track which part is indeterminate in a graph of RDDs. 
> In some cases we may need to track indeterminate RDDs. For example, 
> indeterminate map stage fails and Spark is unable to fallback some parent 
> stages. If Web UI can show up indeterminate RDD like cached RDD, it could be 
> useful to track it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-34643) Use CRAN URL in canonical form

2021-03-05 Thread Dongjoon Hyun (Jira)

Dongjoon Hyun created SPARK-34643:
-

 Summary: Use CRAN URL in canonical form
 Key: SPARK-34643
 URL: https://issues.apache.org/jira/browse/SPARK-34643
 Project: Spark
  Issue Type: Bug
  Components: R
Affects Versions: 3.2.0
Reporter: Dongjoon Hyun






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-34642) TypeError in Pyspark Linear Regression docs

2021-03-05 Thread Sean R. Owen (Jira)

Sean R. Owen created SPARK-34642:


 Summary: TypeError in Pyspark Linear Regression docs
 Key: SPARK-34642
 URL: https://issues.apache.org/jira/browse/SPARK-34642
 Project: Spark
  Issue Type: Bug
  Components: Documentation, ML, PySpark
Affects Versions: 3.1.1
Reporter: Sean R. Owen


The documentation for Linear Regression in Pyspark includes an example, which 
contains a TypeError:

https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.regression.LinearRegression.html?highlight=linearregression#pyspark.ml.regression.LinearRegression

{code}
>>> abs(model.transform(test1).head().newPrediction - 1.0) < 0.001
True
>>> lr.setParams("vector")
Traceback (most recent call last):
...
TypeError: Method setParams forces keyword arguments.
{code}

I'm pretty sure we don't intend this, and it can be resolved by just changing 
the function call to use keyword args.

(HT Brooke Wenig for flagging this)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-34636) sql method in UnresolvedAttribute, AttributeReference and Alias don't quote qualified names properly.

2021-03-05 Thread Kousuke Saruta (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34636?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kousuke Saruta updated SPARK-34636:
---
Description: 
UnresolvedReference, AttributeReference and Alias which take qualifier don't 
quote qualified names properly.

One instance is reported in SPARK-34626.
Other instances are like as follows.

{code}
UnresolvedAttribute("a`b"::"c.d"::Nil).sql
a`b.`c.d` // expected: `a``b`.`c.d`

AttributeReference("a.b", IntegerType)(qualifier = "c.d"::Nil).sql
c.d.`a.b` // expected: `c.d`.`a.b`

Alias(AttributeReference("a", IntegerType)(), "b.c")(qualifier = "d.e"::Nil).sql
`a` AS d.e.`b.c` // expected: `a` AS `d.e`.`b.c`
{code}

  was:
UnresolvedReference, AttributeReference and Alias which take qualifier don't 
quote qualified names properly.

One instance is reported in SPARK-34626.
Other instances are like as follows.

{code}
UnresolvedAttribute("a`b"::"c.d"::Nil).sql
a`b.`c.d` // expected: `a``b`.`c.d`

AttributeReference("a.b", IntegerType)(qualifier = "c.d"::Nil)
c.d.`a.b` // expected: `c.d`.`a.b`

Alias(AttributeReference("a", IntegerType)(), "b.c")(qualifier = "d.e"::Nil).sql
`a` AS d.e.`b.c` // expected: `a` AS `d.e`.`b.c`
{code}


> sql method in UnresolvedAttribute, AttributeReference and Alias don't quote 
> qualified names properly.
> -
>
> Key: SPARK-34636
> URL: https://issues.apache.org/jira/browse/SPARK-34636
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.2, 3.2.0, 3.1.1
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Major
>
> UnresolvedReference, AttributeReference and Alias which take qualifier don't 
> quote qualified names properly.
> One instance is reported in SPARK-34626.
> Other instances are like as follows.
> {code}
> UnresolvedAttribute("a`b"::"c.d"::Nil).sql
> a`b.`c.d` // expected: `a``b`.`c.d`
> AttributeReference("a.b", IntegerType)(qualifier = "c.d"::Nil).sql
> c.d.`a.b` // expected: `c.d`.`a.b`
> Alias(AttributeReference("a", IntegerType)(), "b.c")(qualifier = 
> "d.e"::Nil).sql
> `a` AS d.e.`b.c` // expected: `a` AS `d.e`.`b.c`
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-33288) Support k8s cluster manager with stage level scheduling

2021-03-05 Thread Itay Bittan (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-33288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17296125#comment-17296125
 ] 

Itay Bittan commented on SPARK-33288:
-

[~tgraves] you are right! I used my compiled version (3.0.1) instead of the 
official 3.1.1! thanks!

> Support k8s cluster manager with stage level scheduling
> ---
>
> Key: SPARK-33288
> URL: https://issues.apache.org/jira/browse/SPARK-33288
> Project: Spark
>  Issue Type: Sub-task
>  Components: Kubernetes, Spark Core
>Affects Versions: 3.1.0
>Reporter: Thomas Graves
>Assignee: Thomas Graves
>Priority: Major
> Fix For: 3.1.0
>
>
> Kubernetes supports dynamic allocation via the 
> {{spark.dynamicAllocation.shuffleTracking.enabled}}
> {{config, we can add support for stage level scheduling when that is turned 
> on.  }}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34641) ARM CI failed due to download hive-exec failed

2021-03-05 Thread Yikun Jiang (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17296072#comment-17296072
 ] 

Yikun Jiang commented on SPARK-34641:
-

[~shaneknapp] Do you have any idea on it?

> ARM CI failed due to download hive-exec failed
> --
>
> Key: SPARK-34641
> URL: https://issues.apache.org/jira/browse/SPARK-34641
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 3.1.2
>Reporter: Yikun Jiang
>Priority: Major
>
> The download failed (org.apache.hive#hive-exec;3.0.0!hive-exec.jar) was 
> happened in recent spark-master-test-maven-arm build tests.
> But it's not reproduced and all hive test passed in my arm and x86 local env.
> and the jenkins x86 test like [2] is also unstable due to other reason, so 
> looks like we can't find a valid reference.
> [1] [https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-arm/]
> [2] 
> [https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-2.7/]
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-34641) ARM CI failed due to download hive-exec failed

2021-03-05 Thread Yikun Jiang (Jira)

Yikun Jiang created SPARK-34641:
---

 Summary: ARM CI failed due to download hive-exec failed
 Key: SPARK-34641
 URL: https://issues.apache.org/jira/browse/SPARK-34641
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 3.1.2
Reporter: Yikun Jiang


The download failed (org.apache.hive#hive-exec;3.0.0!hive-exec.jar) was 
happened in recent spark-master-test-maven-arm build tests.

But it's not reproduced and all hive test passed in my arm and x86 local env.

and the jenkins x86 test like [2] is also unstable due to other reason, so 
looks like we can't find a valid reference.

[1] [https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-arm/]

[2] 
[https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-2.7/]

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-33288) Support k8s cluster manager with stage level scheduling

2021-03-05 Thread Thomas Graves (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-33288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17296060#comment-17296060
 ] 

Thomas Graves commented on SPARK-33288:
---

It looks like the version you are trying to launch against is incompatible.  ie 
your backend is using spark 3.0 and your client is using spark 3.1.1.

Use the same version for both.

 

> Support k8s cluster manager with stage level scheduling
> ---
>
> Key: SPARK-33288
> URL: https://issues.apache.org/jira/browse/SPARK-33288
> Project: Spark
>  Issue Type: Sub-task
>  Components: Kubernetes, Spark Core
>Affects Versions: 3.1.0
>Reporter: Thomas Graves
>Assignee: Thomas Graves
>Priority: Major
> Fix For: 3.1.0
>
>
> Kubernetes supports dynamic allocation via the 
> {{spark.dynamicAllocation.shuffleTracking.enabled}}
> {{config, we can add support for stage level scheduling when that is turned 
> on.  }}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-33288) Support k8s cluster manager with stage level scheduling

2021-03-05 Thread Itay Bittan (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-33288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17296027#comment-17296027
 ] 

Itay Bittan commented on SPARK-33288:
-

Hi!

I just upgraded from spark 3.0.1 to 3.1.1 and I'm having an issue with the 
resourceProfileId.

I saw that this [PR|https://github.com/apache/spark/pull/30204/files] added 
this argument but I didn't found documentation about that.

I'm running a simple spark application in client-mode via Jupyter (as the 
driver pod) + pyspark.

I also asked in 
[SO|https://stackoverflow.com/questions/66482218/spark-executors-fails-on-kubernetes-resourceprofileid-is-missing].

I'll appreciate any clue

> Support k8s cluster manager with stage level scheduling
> ---
>
> Key: SPARK-33288
> URL: https://issues.apache.org/jira/browse/SPARK-33288
> Project: Spark
>  Issue Type: Sub-task
>  Components: Kubernetes, Spark Core
>Affects Versions: 3.1.0
>Reporter: Thomas Graves
>Assignee: Thomas Graves
>Priority: Major
> Fix For: 3.1.0
>
>
> Kubernetes supports dynamic allocation via the 
> {{spark.dynamicAllocation.shuffleTracking.enabled}}
> {{config, we can add support for stage level scheduling when that is turned 
> on.  }}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-34640) unable to access grouping column after groupBy

2021-03-05 Thread Jiri Humpolicek (Jira)

Jiri Humpolicek created SPARK-34640:
---

 Summary: unable to access grouping column after groupBy
 Key: SPARK-34640
 URL: https://issues.apache.org/jira/browse/SPARK-34640
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.1.1
Reporter: Jiri Humpolicek


When I group by nested column, I am unable to reference it after groupBy 
operation.

 Example:
 1) Preparing dataframe with nested column:
{code:scala}
case class Sub(a2: String)
case class Top(a1: String, s: Sub)

val s = Seq(
Top("r1", Sub("s1")),
Top("r2", Sub("s3"))
)
val df = s.toDF

df.printSchema
// root
//  |-- a1: string (nullable = true)
//  |-- s: struct (nullable = true)
//  ||-- a2: string (nullable = true)
{code}
2) try to access grouping column after groupBy:
{code:scala}
df.groupBy($"s.a2").count.select('a2)
// org.apache.spark.sql.AnalysisException: cannot resolve '`a2`' given input 
columns: [count, s.a2];
{code}
 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-34634) Self-join with script transformation failed to resolve attribute correctly

2021-03-05 Thread EdisonWang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

EdisonWang updated SPARK-34634:
---
Affects Version/s: 2.4.0
   2.4.1
   2.4.2
   2.4.3
   2.4.4
   2.4.5
   2.4.6
   2.4.7

> Self-join with script transformation failed to resolve attribute correctly
> --
>
> Key: SPARK-34634
> URL: https://issues.apache.org/jira/browse/SPARK-34634
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0, 2.4.1, 2.4.2, 2.4.3, 2.4.4, 2.4.5, 2.4.6, 2.4.7, 
> 3.0.0, 3.0.1, 3.0.2, 3.1.0, 3.1.1
>Reporter: EdisonWang
>Assignee: Apache Spark
>Priority: Minor
>
> To reproduce,
> {code:java}
> // code placeholder
> create temporary view t as select * from values 0, 1, 2 as t(a);
> WITH temp AS (
> SELECT TRANSFORM(a) USING 'cat' AS (b string) FROM t
> )
> SELECT t1.b FROM temp t1 JOIN temp t2 ON t1.b = t2.b
> {code}
>  
> Spark will throw AnalysisException
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34639) always remove unnecessary Alias in Analyzer.resolveExpression

2021-03-05 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34639?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17295901#comment-17295901
 ] 

Apache Spark commented on SPARK-34639:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/31758

> always remove unnecessary Alias in Analyzer.resolveExpression
> -
>
> Key: SPARK-34639
> URL: https://issues.apache.org/jira/browse/SPARK-34639
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2
>Reporter: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-34639) always remove unnecessary Alias in Analyzer.resolveExpression

2021-03-05 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34639?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34639:


Assignee: (was: Apache Spark)

> always remove unnecessary Alias in Analyzer.resolveExpression
> -
>
> Key: SPARK-34639
> URL: https://issues.apache.org/jira/browse/SPARK-34639
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2
>Reporter: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-34639) always remove unnecessary Alias in Analyzer.resolveExpression

2021-03-05 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34639?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34639:


Assignee: Apache Spark

> always remove unnecessary Alias in Analyzer.resolveExpression
> -
>
> Key: SPARK-34639
> URL: https://issues.apache.org/jira/browse/SPARK-34639
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2
>Reporter: Wenchen Fan
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-33602) Group exception messages in execution/datasources

2021-03-05 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-33602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17295900#comment-17295900
 ] 

Apache Spark commented on SPARK-33602:
--

User 'beliefer' has created a pull request for this issue:
https://github.com/apache/spark/pull/31757

> Group exception messages in execution/datasources
> -
>
> Key: SPARK-33602
> URL: https://issues.apache.org/jira/browse/SPARK-33602
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Allison Wang
>Priority: Major
>
> '/core/src/main/scala/org/apache/spark/sql/execution/datasources'
> || Filename||   Count ||
> | DataSource.scala|   9 |
> | DataSourceStrategy.scala|   1 |
> | DataSourceUtils.scala   |   2 |
> | FileFormat.scala|   1 |
> | FileFormatWriter.scala  |   3 |
> | FileScanRDD.scala   |   2 |
> | InsertIntoHadoopFsRelationCommand.scala |   2 |
> | PartitioningAwareFileIndex.scala|   1 |
> | PartitioningUtils.scala |   3 |
> | RecordReaderIterator.scala  |   1 |
> | rules.scala |   4 |
> '/core/src/main/scala/org/apache/spark/sql/execution/datasources/binaryfile'
> || Filename   ||   Count ||
> | BinaryFileFormat.scala |   2 |
> '/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc'
> || Filename  ||   Count ||
> | JDBCOptions.scala |   2 |
> | JdbcUtils.scala   |   6 |
> '/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc'
> || Filename  ||   Count ||
> | OrcDeserializer.scala |   1 |
> | OrcFilters.scala  |   1 |
> | OrcSerializer.scala   |   1 |
> | OrcUtils.scala|   2 |
> '/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet'
> || Filename ||   Count ||
> | ParquetFileFormat.scala  |   2 |
> | ParquetReadSupport.scala |   1 |
> | ParquetSchemaConverter.scala |   6 |



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-34639) always remove unnecessary Alias in Analyzer.resolveExpression

2021-03-05 Thread Wenchen Fan (Jira)

Wenchen Fan created SPARK-34639:
---

 Summary: always remove unnecessary Alias in 
Analyzer.resolveExpression
 Key: SPARK-34639
 URL: https://issues.apache.org/jira/browse/SPARK-34639
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.2
Reporter: Wenchen Fan






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29721) Spark SQL reads unnecessary nested fields after using explode

2021-03-05 Thread Jiri Humpolicek (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17295859#comment-17295859
 ] 

Jiri Humpolicek commented on SPARK-29721:
-

here it is [SPARK-34638|https://issues.apache.org/jira/browse/SPARK-34638]

> Spark SQL reads unnecessary nested fields after using explode
> -
>
> Key: SPARK-29721
> URL: https://issues.apache.org/jira/browse/SPARK-29721
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Kai Kang
>Assignee: L. C. Hsieh
>Priority: Major
> Fix For: 3.1.0
>
>
> This is a follow up for SPARK-4502. SPARK-4502 correctly addressed column 
> pruning for nested structures. However, when explode() is called on a nested 
> field, all columns for that nested structure is still fetched from data 
> source.
> We are working on a project to create a parquet store for a big pre-joined 
> table between two tables that has one-to-many relationship, and this is a 
> blocking issue for us.
>  
> The following code illustrates the issue. 
> Part 1: loading some nested data
> {noformat}
> val jsonStr = """{
>  "items": [
>{"itemId": 1, "itemData": "a"},
>{"itemId": 2, "itemData": "b"}
>  ]
> }"""
> val df = spark.read.json(Seq(jsonStr).toDS)
> df.write.format("parquet").mode("overwrite").saveAsTable("persisted")
> {noformat}
>  
> Part 2: reading it back and explaining the queries
> {noformat}
> val read = spark.table("persisted")
> spark.conf.set("spark.sql.optimizer.nestedSchemaPruning.enabled", true)
> // pruned, only loading itemId
> // ReadSchema: struct>>
> read.select($"items.itemId").explain(true) 
> // not pruned, loading both itemId 
> // ReadSchema: struct>>
> read.select(explode($"items.itemId")).explain(true) and itemData
> {noformat}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-34638) Spark SQL reads unnecessary nested fields (another type of pruning case)

2021-03-05 Thread Jiri Humpolicek (Jira)

Jiri Humpolicek created SPARK-34638:
---

 Summary: Spark SQL reads unnecessary nested fields (another type 
of pruning case)
 Key: SPARK-34638
 URL: https://issues.apache.org/jira/browse/SPARK-34638
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.1.1
Reporter: Jiri Humpolicek


Based on this [SPARK-29721|https://issues.apache.org/jira/browse/SPARK-29721] I 
found another nested fields pruning case.

Example:
1) Loading data

{code:scala}
val jsonStr = """{
 "items": [
   {"itemId": 1, "itemData": "a"},
   {"itemId": 2, "itemData": "b"}
 ]
}"""
val df = spark.read.json(Seq(jsonStr).toDS)
df.write.format("parquet").mode("overwrite").saveAsTable("persisted")
{code}

2) read query with explain

{code:scala}
val read = spark.table("persisted")
spark.conf.set("spark.sql.optimizer.nestedSchemaPruning.enabled", true)

read.select(explode($"items").as('item)).select($"item.itemId").explain(true)
// ReadSchema: struct>>
{code}




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

67 matches

Mail list logo