[jira] [Reopened] (SPARK-34406) When we submit spark core tasks frequently, the submitted nodes will have a lot of resource pressure

2021-02-08 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reopened SPARK-34406:
--

> When we submit spark core tasks frequently, the submitted nodes will have a 
> lot of resource pressure
> 
>
> Key: SPARK-34406
> URL: https://issues.apache.org/jira/browse/SPARK-34406
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.1
>Reporter: hao
>Priority: Major
>
> When we submit spark core tasks frequently, the submitted node will have a 
> lot of resource pressure, because spark will create a process instead of a 
> thread for each submitted task. In fact, there is a lot of resource 
> consumption. When the QPS of the submitted task is very high, the submission 
> will fail due to insufficient resources. I would like to ask how to optimize 
> the amount of resources consumed by spark core submission



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34406) When we submit spark core tasks frequently, the submitted nodes will have a lot of resource pressure

2021-02-08 Thread hao (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17281560#comment-17281560
 ] 

hao commented on SPARK-34406:
-

Yes, sir. I'm using the Yarn cluster mode. What I mean here is that when the 
spark client submits spark core to the remote yarn, it is submitted in the way 
of process, but this way consumes a lot of resources

> When we submit spark core tasks frequently, the submitted nodes will have a 
> lot of resource pressure
> 
>
> Key: SPARK-34406
> URL: https://issues.apache.org/jira/browse/SPARK-34406
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.1
>Reporter: hao
>Priority: Major
>
> When we submit spark core tasks frequently, the submitted node will have a 
> lot of resource pressure, because spark will create a process instead of a 
> thread for each submitted task. In fact, there is a lot of resource 
> consumption. When the QPS of the submitted task is very high, the submission 
> will fail due to insufficient resources. I would like to ask how to optimize 
> the amount of resources consumed by spark core submission



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-34389) Spark job on Kubernetes scheduled For Zero or less than minimum number of executors and Wait indefinitely under resource starvation

2021-02-08 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-34389.
--
Resolution: Not A Problem

> Spark job on Kubernetes scheduled For Zero or less than minimum number of 
> executors and Wait indefinitely under resource starvation
> ---
>
> Key: SPARK-34389
> URL: https://issues.apache.org/jira/browse/SPARK-34389
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 3.0.1
>Reporter: Ranju
>Priority: Major
> Attachments: DriverLogs_ExecutorLaunchedLessThanMinExecutor.txt, 
> Steps to reproduce.docx
>
>
> In case Cluster does not have sufficient resource (CPU/ Memory ) for minimum 
> number of executors , the executors goes in Pending State for indefinite time 
> until the resource gets free.
> Suppose, Cluster Configurations are:
> total Memory=204Gi
> used Memory=200Gi
> free memory= 4Gi
> SPARK.EXECUTOR.MEMORY=10G
> SPARK.DYNAMICALLOCTION.MINEXECUTORS=4
> SPARK.DYNAMICALLOCATION.MAXEXECUTORS=8
> Rather, the job should be cancelled if requested number of minimum executors 
> are not available at that point of time because of resource unavailability.
> Currently it is doing partial scheduling or no scheduling and waiting 
> indefinitely. And the job got stuck.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34389) Spark job on Kubernetes scheduled For Zero or less than minimum number of executors and Wait indefinitely under resource starvation

2021-02-08 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17281544#comment-17281544
 ] 

Hyukjin Kwon commented on SPARK-34389:
--

I think this is more a question. I will tentatively resolve this ticket. 
[~ranju] it would be great if we can interact it in the mailing list first 
before filing it as an issue.

> Spark job on Kubernetes scheduled For Zero or less than minimum number of 
> executors and Wait indefinitely under resource starvation
> ---
>
> Key: SPARK-34389
> URL: https://issues.apache.org/jira/browse/SPARK-34389
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 3.0.1
>Reporter: Ranju
>Priority: Major
> Attachments: DriverLogs_ExecutorLaunchedLessThanMinExecutor.txt, 
> Steps to reproduce.docx
>
>
> In case Cluster does not have sufficient resource (CPU/ Memory ) for minimum 
> number of executors , the executors goes in Pending State for indefinite time 
> until the resource gets free.
> Suppose, Cluster Configurations are:
> total Memory=204Gi
> used Memory=200Gi
> free memory= 4Gi
> SPARK.EXECUTOR.MEMORY=10G
> SPARK.DYNAMICALLOCTION.MINEXECUTORS=4
> SPARK.DYNAMICALLOCATION.MAXEXECUTORS=8
> Rather, the job should be cancelled if requested number of minimum executors 
> are not available at that point of time because of resource unavailability.
> Currently it is doing partial scheduling or no scheduling and waiting 
> indefinitely. And the job got stuck.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34392) Invalid ID for offset-based ZoneId since Spark 3.0

2021-02-08 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17281543#comment-17281543
 ] 

Hyukjin Kwon commented on SPARK-34392:
--

cc [~maxgekk] FYI

> Invalid ID for offset-based ZoneId since Spark 3.0
> --
>
> Key: SPARK-34392
> URL: https://issues.apache.org/jira/browse/SPARK-34392
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.0.1
>Reporter: Yuming Wang
>Priority: Major
>
> How to reproduce this issue:
> {code:sql}
> select to_utc_timestamp("2020-02-07 16:00:00", "GMT+8:00");
> {code}
> Spark 2.4:
> {noformat}
> spark-sql> select to_utc_timestamp("2020-02-07 16:00:00", "GMT+8:00");
> 2020-02-07 08:00:00
> Time taken: 0.089 seconds, Fetched 1 row(s)
> {noformat}
> Spark 3.x:
> {noformat}
> spark-sql> select to_utc_timestamp("2020-02-07 16:00:00", "GMT+8:00");
> 21/02/07 01:24:32 ERROR SparkSQLDriver: Failed in [select 
> to_utc_timestamp("2020-02-07 16:00:00", "GMT+8:00")]
> java.time.DateTimeException: Invalid ID for offset-based ZoneId: GMT+8:00
>   at java.time.ZoneId.ofWithPrefix(ZoneId.java:437)
>   at java.time.ZoneId.of(ZoneId.java:407)
>   at java.time.ZoneId.of(ZoneId.java:359)
>   at java.time.ZoneId.of(ZoneId.java:315)
>   at 
> org.apache.spark.sql.catalyst.util.DateTimeUtils$.getZoneId(DateTimeUtils.scala:53)
>   at 
> org.apache.spark.sql.catalyst.util.DateTimeUtils$.toUTCTime(DateTimeUtils.scala:814)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-34406) When we submit spark core tasks frequently, the submitted nodes will have a lot of resource pressure

2021-02-08 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-34406.
--
Resolution: Won't Fix

> When we submit spark core tasks frequently, the submitted nodes will have a 
> lot of resource pressure
> 
>
> Key: SPARK-34406
> URL: https://issues.apache.org/jira/browse/SPARK-34406
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.1
>Reporter: hao
>Priority: Major
>
> When we submit spark core tasks frequently, the submitted node will have a 
> lot of resource pressure, because spark will create a process instead of a 
> thread for each submitted task. In fact, there is a lot of resource 
> consumption. When the QPS of the submitted task is very high, the submission 
> will fail due to insufficient resources. I would like to ask how to optimize 
> the amount of resources consumed by spark core submission



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34406) When we submit spark core tasks frequently, the submitted nodes will have a lot of resource pressure

2021-02-08 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17281541#comment-17281541
 ] 

Hyukjin Kwon commented on SPARK-34406:
--

You should use Yarn cluster mode that will evenly distribute the drivers.

> When we submit spark core tasks frequently, the submitted nodes will have a 
> lot of resource pressure
> 
>
> Key: SPARK-34406
> URL: https://issues.apache.org/jira/browse/SPARK-34406
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.1
>Reporter: hao
>Priority: Major
>
> When we submit spark core tasks frequently, the submitted node will have a 
> lot of resource pressure, because spark will create a process instead of a 
> thread for each submitted task. In fact, there is a lot of resource 
> consumption. When the QPS of the submitted task is very high, the submission 
> will fail due to insufficient resources. I would like to ask how to optimize 
> the amount of resources consumed by spark core submission



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34406) When we submit spark core tasks frequently, the submitted nodes will have a lot of resource pressure

2021-02-08 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-34406:
-
Priority: Major  (was: Critical)

> When we submit spark core tasks frequently, the submitted nodes will have a 
> lot of resource pressure
> 
>
> Key: SPARK-34406
> URL: https://issues.apache.org/jira/browse/SPARK-34406
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.1
>Reporter: hao
>Priority: Major
>
> When we submit spark core tasks frequently, the submitted node will have a 
> lot of resource pressure, because spark will create a process instead of a 
> thread for each submitted task. In fact, there is a lot of resource 
> consumption. When the QPS of the submitted task is very high, the submission 
> will fail due to insufficient resources. I would like to ask how to optimize 
> the amount of resources consumed by spark core submission



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34407) KubernetesClusterSchedulerBackend.stop should clean up K8s resources

2021-02-08 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34407?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-34407:
--
Affects Version/s: 2.4.7

> KubernetesClusterSchedulerBackend.stop should clean up K8s resources
> 
>
> Key: SPARK-34407
> URL: https://issues.apache.org/jira/browse/SPARK-34407
> Project: Spark
>  Issue Type: Sub-task
>  Components: Kubernetes
>Affects Versions: 2.4.7, 3.0.1, 3.1.0, 3.1.1
>Reporter: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34407) KubernetesClusterSchedulerBackend.stop should clean up K8s resources

2021-02-08 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34407?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-34407:
-

Assignee: Dongjoon Hyun

> KubernetesClusterSchedulerBackend.stop should clean up K8s resources
> 
>
> Key: SPARK-34407
> URL: https://issues.apache.org/jira/browse/SPARK-34407
> Project: Spark
>  Issue Type: Sub-task
>  Components: Kubernetes
>Affects Versions: 2.3.4, 2.4.7, 3.0.1, 3.1.0, 3.1.1
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-34407) KubernetesClusterSchedulerBackend.stop should clean up K8s resources

2021-02-08 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34407?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-34407.
---
Fix Version/s: 3.1.2
   2.4.8
   3.0.2
   Resolution: Fixed

Issue resolved by pull request 31533
[https://github.com/apache/spark/pull/31533]

> KubernetesClusterSchedulerBackend.stop should clean up K8s resources
> 
>
> Key: SPARK-34407
> URL: https://issues.apache.org/jira/browse/SPARK-34407
> Project: Spark
>  Issue Type: Sub-task
>  Components: Kubernetes
>Affects Versions: 2.3.4, 2.4.7, 3.0.1, 3.1.0, 3.1.1
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 3.0.2, 2.4.8, 3.1.2
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34389) Spark job on Kubernetes scheduled For Zero or less than minimum number of executors and Wait indefinitely under resource starvation

2021-02-08 Thread Ranju (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17281534#comment-17281534
 ] 

Ranju commented on SPARK-34389:
---

Yes I understood this why there is no retry logic and thanks for this 
explanation and can close the issue.

Can you guide , to mitigate the indefinite waiting time for executors, is it 
possible to get the available resources of the cluster and match it with the 
required executor resources and if it satisfies then submits the job.

> Spark job on Kubernetes scheduled For Zero or less than minimum number of 
> executors and Wait indefinitely under resource starvation
> ---
>
> Key: SPARK-34389
> URL: https://issues.apache.org/jira/browse/SPARK-34389
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 3.0.1
>Reporter: Ranju
>Priority: Major
> Attachments: DriverLogs_ExecutorLaunchedLessThanMinExecutor.txt, 
> Steps to reproduce.docx
>
>
> In case Cluster does not have sufficient resource (CPU/ Memory ) for minimum 
> number of executors , the executors goes in Pending State for indefinite time 
> until the resource gets free.
> Suppose, Cluster Configurations are:
> total Memory=204Gi
> used Memory=200Gi
> free memory= 4Gi
> SPARK.EXECUTOR.MEMORY=10G
> SPARK.DYNAMICALLOCTION.MINEXECUTORS=4
> SPARK.DYNAMICALLOCATION.MAXEXECUTORS=8
> Rather, the job should be cancelled if requested number of minimum executors 
> are not available at that point of time because of resource unavailability.
> Currently it is doing partial scheduling or no scheduling and waiting 
> indefinitely. And the job got stuck.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34405) The getMetricsSnapshot method of the PrometheusServlet class has a wrong value

2021-02-08 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-34405:
-

Assignee: iteblog

> The getMetricsSnapshot method of the PrometheusServlet class has a wrong value
> --
>
> Key: SPARK-34405
> URL: https://issues.apache.org/jira/browse/SPARK-34405
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.1, 3.1.0, 3.2.0, 3.1.1
>Reporter: iteblog
>Assignee: iteblog
>Priority: Minor
>
> The mean value of timersLabels in the PrometheusServlet class is wrong, You 
> can look at line 105 of this class: 
> [L105.|https://github.com/apache/spark/blob/37fe8c6d3cd1c5aae3a61fb9b32ca4595267f1bb/core/src/main/scala/org/apache/spark/metrics/sink/PrometheusServlet.scala#L105]
> {code:java}
> // code placeholder
> sb.append(s"${prefix}Mean$timersLabels ${snapshot.getMax}\n"){code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-34405) The getMetricsSnapshot method of the PrometheusServlet class has a wrong value

2021-02-08 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-34405.
---
Fix Version/s: 3.0.2
   3.1.1
   Resolution: Fixed

Issue resolved by pull request 31532
[https://github.com/apache/spark/pull/31532]

> The getMetricsSnapshot method of the PrometheusServlet class has a wrong value
> --
>
> Key: SPARK-34405
> URL: https://issues.apache.org/jira/browse/SPARK-34405
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.1, 3.1.0, 3.2.0, 3.1.1
>Reporter: iteblog
>Assignee: iteblog
>Priority: Minor
> Fix For: 3.1.1, 3.0.2
>
>
> The mean value of timersLabels in the PrometheusServlet class is wrong, You 
> can look at line 105 of this class: 
> [L105.|https://github.com/apache/spark/blob/37fe8c6d3cd1c5aae3a61fb9b32ca4595267f1bb/core/src/main/scala/org/apache/spark/metrics/sink/PrometheusServlet.scala#L105]
> {code:java}
> // code placeholder
> sb.append(s"${prefix}Mean$timersLabels ${snapshot.getMax}\n"){code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34405) The getMetricsSnapshot method of the PrometheusServlet class has a wrong value

2021-02-08 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-34405:
--
Fix Version/s: (was: 3.1.1)
   3.1.2

> The getMetricsSnapshot method of the PrometheusServlet class has a wrong value
> --
>
> Key: SPARK-34405
> URL: https://issues.apache.org/jira/browse/SPARK-34405
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.1, 3.1.0, 3.2.0, 3.1.1
>Reporter: iteblog
>Assignee: iteblog
>Priority: Minor
> Fix For: 3.0.2, 3.1.2
>
>
> The mean value of timersLabels in the PrometheusServlet class is wrong, You 
> can look at line 105 of this class: 
> [L105.|https://github.com/apache/spark/blob/37fe8c6d3cd1c5aae3a61fb9b32ca4595267f1bb/core/src/main/scala/org/apache/spark/metrics/sink/PrometheusServlet.scala#L105]
> {code:java}
> // code placeholder
> sb.append(s"${prefix}Mean$timersLabels ${snapshot.getMax}\n"){code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34405) The getMetricsSnapshot method of the PrometheusServlet class has a wrong value

2021-02-08 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-34405:
--
Affects Version/s: 3.1.1
   3.2.0
   3.1.0

> The getMetricsSnapshot method of the PrometheusServlet class has a wrong value
> --
>
> Key: SPARK-34405
> URL: https://issues.apache.org/jira/browse/SPARK-34405
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.1, 3.1.0, 3.2.0, 3.1.1
>Reporter: iteblog
>Priority: Minor
>
> The mean value of timersLabels in the PrometheusServlet class is wrong, You 
> can look at line 105 of this class: 
> [L105.|https://github.com/apache/spark/blob/37fe8c6d3cd1c5aae3a61fb9b32ca4595267f1bb/core/src/main/scala/org/apache/spark/metrics/sink/PrometheusServlet.scala#L105]
> {code:java}
> // code placeholder
> sb.append(s"${prefix}Mean$timersLabels ${snapshot.getMax}\n"){code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34407) KubernetesClusterSchedulerBackend.stop should clean up K8s resources

2021-02-08 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34407?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-34407:
--
Affects Version/s: 2.3.4

> KubernetesClusterSchedulerBackend.stop should clean up K8s resources
> 
>
> Key: SPARK-34407
> URL: https://issues.apache.org/jira/browse/SPARK-34407
> Project: Spark
>  Issue Type: Sub-task
>  Components: Kubernetes
>Affects Versions: 2.3.4, 2.4.7, 3.0.1, 3.1.0, 3.1.1
>Reporter: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-34407) KubernetesClusterSchedulerBackend.stop should clean up K8s resources

2021-02-08 Thread Dongjoon Hyun (Jira)
Dongjoon Hyun created SPARK-34407:
-

 Summary: KubernetesClusterSchedulerBackend.stop should clean up 
K8s resources
 Key: SPARK-34407
 URL: https://issues.apache.org/jira/browse/SPARK-34407
 Project: Spark
  Issue Type: Bug
  Components: Kubernetes
Affects Versions: 3.0.1, 3.1.0, 3.1.1
Reporter: Dongjoon Hyun






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34407) KubernetesClusterSchedulerBackend.stop should clean up K8s resources

2021-02-08 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34407?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-34407:
--
Parent: SPARK-33005
Issue Type: Sub-task  (was: Bug)

> KubernetesClusterSchedulerBackend.stop should clean up K8s resources
> 
>
> Key: SPARK-34407
> URL: https://issues.apache.org/jira/browse/SPARK-34407
> Project: Spark
>  Issue Type: Sub-task
>  Components: Kubernetes
>Affects Versions: 3.0.1, 3.1.0, 3.1.1
>Reporter: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34407) KubernetesClusterSchedulerBackend.stop should clean up K8s resources

2021-02-08 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17281523#comment-17281523
 ] 

Apache Spark commented on SPARK-34407:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/31533

> KubernetesClusterSchedulerBackend.stop should clean up K8s resources
> 
>
> Key: SPARK-34407
> URL: https://issues.apache.org/jira/browse/SPARK-34407
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 3.0.1, 3.1.0, 3.1.1
>Reporter: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34407) KubernetesClusterSchedulerBackend.stop should clean up K8s resources

2021-02-08 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34407?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34407:


Assignee: (was: Apache Spark)

> KubernetesClusterSchedulerBackend.stop should clean up K8s resources
> 
>
> Key: SPARK-34407
> URL: https://issues.apache.org/jira/browse/SPARK-34407
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 3.0.1, 3.1.0, 3.1.1
>Reporter: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34407) KubernetesClusterSchedulerBackend.stop should clean up K8s resources

2021-02-08 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34407?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34407:


Assignee: Apache Spark

> KubernetesClusterSchedulerBackend.stop should clean up K8s resources
> 
>
> Key: SPARK-34407
> URL: https://issues.apache.org/jira/browse/SPARK-34407
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 3.0.1, 3.1.0, 3.1.1
>Reporter: Dongjoon Hyun
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34407) KubernetesClusterSchedulerBackend.stop should clean up K8s resources

2021-02-08 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17281522#comment-17281522
 ] 

Apache Spark commented on SPARK-34407:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/31533

> KubernetesClusterSchedulerBackend.stop should clean up K8s resources
> 
>
> Key: SPARK-34407
> URL: https://issues.apache.org/jira/browse/SPARK-34407
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 3.0.1, 3.1.0, 3.1.1
>Reporter: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34405) The getMetricsSnapshot method of the PrometheusServlet class has a wrong value

2021-02-08 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17281506#comment-17281506
 ] 

Apache Spark commented on SPARK-34405:
--

User '397090770' has created a pull request for this issue:
https://github.com/apache/spark/pull/31532

> The getMetricsSnapshot method of the PrometheusServlet class has a wrong value
> --
>
> Key: SPARK-34405
> URL: https://issues.apache.org/jira/browse/SPARK-34405
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.1
>Reporter: iteblog
>Priority: Minor
>
> The mean value of timersLabels in the PrometheusServlet class is wrong, You 
> can look at line 105 of this class: 
> [L105.|https://github.com/apache/spark/blob/37fe8c6d3cd1c5aae3a61fb9b32ca4595267f1bb/core/src/main/scala/org/apache/spark/metrics/sink/PrometheusServlet.scala#L105]
> {code:java}
> // code placeholder
> sb.append(s"${prefix}Mean$timersLabels ${snapshot.getMax}\n"){code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34405) The getMetricsSnapshot method of the PrometheusServlet class has a wrong value

2021-02-08 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17281505#comment-17281505
 ] 

Apache Spark commented on SPARK-34405:
--

User '397090770' has created a pull request for this issue:
https://github.com/apache/spark/pull/31532

> The getMetricsSnapshot method of the PrometheusServlet class has a wrong value
> --
>
> Key: SPARK-34405
> URL: https://issues.apache.org/jira/browse/SPARK-34405
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.1
>Reporter: iteblog
>Priority: Minor
>
> The mean value of timersLabels in the PrometheusServlet class is wrong, You 
> can look at line 105 of this class: 
> [L105.|https://github.com/apache/spark/blob/37fe8c6d3cd1c5aae3a61fb9b32ca4595267f1bb/core/src/main/scala/org/apache/spark/metrics/sink/PrometheusServlet.scala#L105]
> {code:java}
> // code placeholder
> sb.append(s"${prefix}Mean$timersLabels ${snapshot.getMax}\n"){code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34405) The getMetricsSnapshot method of the PrometheusServlet class has a wrong value

2021-02-08 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34405:


Assignee: Apache Spark

> The getMetricsSnapshot method of the PrometheusServlet class has a wrong value
> --
>
> Key: SPARK-34405
> URL: https://issues.apache.org/jira/browse/SPARK-34405
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.1
>Reporter: iteblog
>Assignee: Apache Spark
>Priority: Minor
>
> The mean value of timersLabels in the PrometheusServlet class is wrong, You 
> can look at line 105 of this class: 
> [L105.|https://github.com/apache/spark/blob/37fe8c6d3cd1c5aae3a61fb9b32ca4595267f1bb/core/src/main/scala/org/apache/spark/metrics/sink/PrometheusServlet.scala#L105]
> {code:java}
> // code placeholder
> sb.append(s"${prefix}Mean$timersLabels ${snapshot.getMax}\n"){code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34405) The getMetricsSnapshot method of the PrometheusServlet class has a wrong value

2021-02-08 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34405:


Assignee: (was: Apache Spark)

> The getMetricsSnapshot method of the PrometheusServlet class has a wrong value
> --
>
> Key: SPARK-34405
> URL: https://issues.apache.org/jira/browse/SPARK-34405
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.1
>Reporter: iteblog
>Priority: Minor
>
> The mean value of timersLabels in the PrometheusServlet class is wrong, You 
> can look at line 105 of this class: 
> [L105.|https://github.com/apache/spark/blob/37fe8c6d3cd1c5aae3a61fb9b32ca4595267f1bb/core/src/main/scala/org/apache/spark/metrics/sink/PrometheusServlet.scala#L105]
> {code:java}
> // code placeholder
> sb.append(s"${prefix}Mean$timersLabels ${snapshot.getMax}\n"){code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-34406) When we submit spark core tasks frequently, the submitted nodes will have a lot of resource pressure

2021-02-08 Thread hao (Jira)
hao created SPARK-34406:
---

 Summary: When we submit spark core tasks frequently, the submitted 
nodes will have a lot of resource pressure
 Key: SPARK-34406
 URL: https://issues.apache.org/jira/browse/SPARK-34406
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 3.0.1
Reporter: hao


When we submit spark core tasks frequently, the submitted node will have a lot 
of resource pressure, because spark will create a process instead of a thread 
for each submitted task. In fact, there is a lot of resource consumption. When 
the QPS of the submitted task is very high, the submission will fail due to 
insufficient resources. I would like to ask how to optimize the amount of 
resources consumed by spark core submission



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-34405) The getMetricsSnapshot method of the PrometheusServlet class has a wrong value

2021-02-08 Thread iteblog (Jira)
iteblog created SPARK-34405:
---

 Summary: The getMetricsSnapshot method of the PrometheusServlet 
class has a wrong value
 Key: SPARK-34405
 URL: https://issues.apache.org/jira/browse/SPARK-34405
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 3.0.1
Reporter: iteblog


The mean value of timersLabels in the PrometheusServlet class is wrong, You can 
look at line 105 of this class: 
[L105.|https://github.com/apache/spark/blob/37fe8c6d3cd1c5aae3a61fb9b32ca4595267f1bb/core/src/main/scala/org/apache/spark/metrics/sink/PrometheusServlet.scala#L105]
{code:java}
// code placeholder
sb.append(s"${prefix}Mean$timersLabels ${snapshot.getMax}\n"){code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34352) Improve SQLQueryTestSuite so as could run on windows system

2021-02-08 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-34352:
-
Fix Version/s: (was: 3.1.2)
   3.2.0

> Improve SQLQueryTestSuite so as could run on windows system
> ---
>
> Key: SPARK-34352
> URL: https://issues.apache.org/jira/browse/SPARK-34352
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: jiaan.geng
>Priority: Major
> Fix For: 3.2.0
>
>
> The current implement of SQLQueryTestSuite cannot run on windows system.
> Becasue the code below will fail on windows system:
> assume(TestUtils.testCommandAvailable("/bin/bash"))



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-29465) Unable to configure SPARK UI (spark.ui.port) in spark yarn cluster mode.

2021-02-08 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen reassigned SPARK-29465:


Assignee: Vishwas Nalka

> Unable to configure SPARK UI (spark.ui.port) in spark yarn cluster mode. 
> -
>
> Key: SPARK-29465
> URL: https://issues.apache.org/jira/browse/SPARK-29465
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, Spark Submit, YARN
>Affects Versions: 3.1.0
>Reporter: Vishwas Nalka
>Assignee: Vishwas Nalka
>Priority: Major
> Fix For: 3.1.0
>
>
>  I'm trying to restrict the ports used by spark app which is launched in yarn 
> cluster mode. All ports (viz. driver, executor, blockmanager) could be 
> specified using the respective properties except the ui port. The spark app 
> is launched using JAVA code and setting the property spark.ui.port in 
> sparkConf doesn't seem to help. Even setting a JVM option 
> -Dspark.ui.port="some_port" does not spawn the UI is required port. 
> From the logs of the spark app, *_the property spark.ui.port is overridden 
> and the JVM property '-Dspark.ui.port=0' is set_* even though it is never set 
> to 0. 
> _(Run in Spark 1.6.2) From the logs ->_
> _command:LD_LIBRARY_PATH="/usr/hdp/2.6.4.0-91/hadoop/lib/native:$LD_LIBRARY_PATH"
>  {{JAVA_HOME}}/bin/java -server -XX:OnOutOfMemoryError='kill %p' -Xms4096m 
> -Xmx4096m -Djava.io.tmpdir={{PWD}}/tmp '-Dspark.blockManager.port=9900' 
> '-Dspark.driver.port=9902' '-Dspark.fileserver.port=9903' 
> '-Dspark.broadcast.port=9904' '-Dspark.port.maxRetries=20' 
> '-Dspark.ui.port=0' '-Dspark.executor.port=9905'_
> _19/10/14 16:39:59 INFO Utils: Successfully started service 'SparkUI' on port 
> 35167.19/10/14 16:39:59 INFO SparkUI: Started SparkUI at_ 
> [_http://10.65.170.98:35167_|http://10.65.170.98:35167/]
> Even tried using a *spark-submit command with --conf spark.ui.port* does 
> spawn UI in required port
> {color:#172b4d}_(Run in Spark 2.4.4)_{color}
>  {color:#172b4d}_./bin/spark-submit --class org.apache.spark.examples.SparkPi 
> --master yarn --deploy-mode cluster --driver-memory 4g --executor-memory 2g 
> --executor-cores 1 --conf spark.ui.port=12345 --conf spark.driver.port=12340 
> --queue default examples/jars/spark-examples_2.11-2.4.4.jar 10_{color}
> _From the logs::_
>  _19/10/15 00:04:05 INFO ui.SparkUI: Stopped Spark web UI at 
> [http://invrh74ace005.informatica.com:46622|http://invrh74ace005.informatica.com:46622/]_
> _command:{{JAVA_HOME}}/bin/java -server -Xmx2048m 
> -Djava.io.tmpdir={{PWD}}/tmp '-Dspark.ui.port=0'  'Dspark.driver.port=12340' 
> -Dspark.yarn.app.container.log.dir= -XX:OnOutOfMemoryError='kill %p' 
> org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url 
> spark://coarsegrainedschedu...@invrh74ace005.informatica.com:12340 
> --executor-id  --hostname  --cores 1 --app-id 
> application_1570992022035_0089 --user-class-path 
> [file:$PWD/__app__.jar1|file://%24pwd/__app__.jar1]>/stdout2>/stderr_
>  
> Looks like the application master override this and set a JVM property before 
> launch resulting in random UI port even though spark.ui.port is set by the 
> user.
> In these links
>  # 
> [https://github.com/apache/spark/blob/master/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala]
>  (line 214)
>  # 
> [https://github.com/cloudera/spark/blob/master/yarn/alpha/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala]
>  (line 75)
> I can see that the method _*run() in above files sets a system property 
> UI_PORT*_ and _*spark.ui.port respectively.*_



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29465) Unable to configure SPARK UI (spark.ui.port) in spark yarn cluster mode.

2021-02-08 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen updated SPARK-29465:
-
Priority: Minor  (was: Major)

> Unable to configure SPARK UI (spark.ui.port) in spark yarn cluster mode. 
> -
>
> Key: SPARK-29465
> URL: https://issues.apache.org/jira/browse/SPARK-29465
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, Spark Submit, YARN
>Affects Versions: 3.1.0
>Reporter: Vishwas Nalka
>Assignee: Vishwas Nalka
>Priority: Minor
> Fix For: 3.1.0
>
>
>  I'm trying to restrict the ports used by spark app which is launched in yarn 
> cluster mode. All ports (viz. driver, executor, blockmanager) could be 
> specified using the respective properties except the ui port. The spark app 
> is launched using JAVA code and setting the property spark.ui.port in 
> sparkConf doesn't seem to help. Even setting a JVM option 
> -Dspark.ui.port="some_port" does not spawn the UI is required port. 
> From the logs of the spark app, *_the property spark.ui.port is overridden 
> and the JVM property '-Dspark.ui.port=0' is set_* even though it is never set 
> to 0. 
> _(Run in Spark 1.6.2) From the logs ->_
> _command:LD_LIBRARY_PATH="/usr/hdp/2.6.4.0-91/hadoop/lib/native:$LD_LIBRARY_PATH"
>  {{JAVA_HOME}}/bin/java -server -XX:OnOutOfMemoryError='kill %p' -Xms4096m 
> -Xmx4096m -Djava.io.tmpdir={{PWD}}/tmp '-Dspark.blockManager.port=9900' 
> '-Dspark.driver.port=9902' '-Dspark.fileserver.port=9903' 
> '-Dspark.broadcast.port=9904' '-Dspark.port.maxRetries=20' 
> '-Dspark.ui.port=0' '-Dspark.executor.port=9905'_
> _19/10/14 16:39:59 INFO Utils: Successfully started service 'SparkUI' on port 
> 35167.19/10/14 16:39:59 INFO SparkUI: Started SparkUI at_ 
> [_http://10.65.170.98:35167_|http://10.65.170.98:35167/]
> Even tried using a *spark-submit command with --conf spark.ui.port* does 
> spawn UI in required port
> {color:#172b4d}_(Run in Spark 2.4.4)_{color}
>  {color:#172b4d}_./bin/spark-submit --class org.apache.spark.examples.SparkPi 
> --master yarn --deploy-mode cluster --driver-memory 4g --executor-memory 2g 
> --executor-cores 1 --conf spark.ui.port=12345 --conf spark.driver.port=12340 
> --queue default examples/jars/spark-examples_2.11-2.4.4.jar 10_{color}
> _From the logs::_
>  _19/10/15 00:04:05 INFO ui.SparkUI: Stopped Spark web UI at 
> [http://invrh74ace005.informatica.com:46622|http://invrh74ace005.informatica.com:46622/]_
> _command:{{JAVA_HOME}}/bin/java -server -Xmx2048m 
> -Djava.io.tmpdir={{PWD}}/tmp '-Dspark.ui.port=0'  'Dspark.driver.port=12340' 
> -Dspark.yarn.app.container.log.dir= -XX:OnOutOfMemoryError='kill %p' 
> org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url 
> spark://coarsegrainedschedu...@invrh74ace005.informatica.com:12340 
> --executor-id  --hostname  --cores 1 --app-id 
> application_1570992022035_0089 --user-class-path 
> [file:$PWD/__app__.jar1|file://%24pwd/__app__.jar1]>/stdout2>/stderr_
>  
> Looks like the application master override this and set a JVM property before 
> launch resulting in random UI port even though spark.ui.port is set by the 
> user.
> In these links
>  # 
> [https://github.com/apache/spark/blob/master/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala]
>  (line 214)
>  # 
> [https://github.com/cloudera/spark/blob/master/yarn/alpha/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala]
>  (line 75)
> I can see that the method _*run() in above files sets a system property 
> UI_PORT*_ and _*spark.ui.port respectively.*_



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33571) Handling of hybrid to proleptic calendar when reading and writing Parquet data not working correctly

2021-02-08 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33571?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen reassigned SPARK-33571:


Assignee: Maxim Gekk

> Handling of hybrid to proleptic calendar when reading and writing Parquet 
> data not working correctly
> 
>
> Key: SPARK-33571
> URL: https://issues.apache.org/jira/browse/SPARK-33571
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Core
>Affects Versions: 3.0.0, 3.0.1
>Reporter: Simon
>Assignee: Maxim Gekk
>Priority: Major
> Fix For: 3.1.0
>
>
> The handling of old dates written with older Spark versions (<2.4.6) using 
> the hybrid calendar in Spark 3.0.0 and 3.0.1 seems to be broken/not working 
> correctly.
> From what I understand it should work like this:
>  * Only relevant for `DateType` before 1582-10-15 or `TimestampType` before 
> 1900-01-01T00:00:00Z
>  * Only applies when reading or writing parquet files
>  * When reading parquet files written with Spark < 2.4.6 which contain dates 
> or timestamps before the above mentioned moments in time a 
> `SparkUpgradeException` should be raised informing the user to choose either 
> `LEGACY` or `CORRECTED` for the `datetimeRebaseModeInRead`
>  * When reading parquet files written with Spark < 2.4.6 which contain dates 
> or timestamps before the above mentioned moments in time and 
> `datetimeRebaseModeInRead` is set to `LEGACY` the dates and timestamps should 
> show the same values in Spark 3.0.1. with for example `df.show()` as they did 
> in Spark 2.4.5
>  * When reading parquet files written with Spark < 2.4.6 which contain dates 
> or timestamps before the above mentioned moments in time and 
> `datetimeRebaseModeInRead` is set to `CORRECTED` the dates and timestamps 
> should show different values in Spark 3.0.1. with for example `df.show()` as 
> they did in Spark 2.4.5
>  * When writing parqet files with Spark > 3.0.0 which contain dates or 
> timestamps before the above mentioned moment in time a 
> `SparkUpgradeException` should be raised informing the user to choose either 
> `LEGACY` or `CORRECTED` for the `datetimeRebaseModeInWrite`
> First of all I'm not 100% sure all of this is correct. I've been unable to 
> find any clear documentation on the expected behavior. The understanding I 
> have was pieced together from the mailing list 
> ([http://apache-spark-user-list.1001560.n3.nabble.com/Spark-3-0-1-new-Proleptic-Gregorian-calendar-td38914.html)]
>  the blog post linked there and looking at the Spark code.
> From our testing we're seeing several issues:
>  * Reading parquet data with Spark 3.0.1 that was written with Spark 2.4.5. 
> that contains fields of type `TimestampType` which contain timestamps before 
> the above mentioned moments in time without `datetimeRebaseModeInRead` set 
> doesn't raise the `SparkUpgradeException`, it succeeds without any changes to 
> the resulting dataframe compared to that dataframe in Spark 2.4.5
>  * Reading parquet data with Spark 3.0.1 that was written with Spark 2.4.5. 
> that contains fields of type `TimestampType` or `DateType` which contain 
> dates or timestamps before the above mentioned moments in time with 
> `datetimeRebaseModeInRead` set to `LEGACY` results in the same values in the 
> dataframe as when using `CORRECTED`, so it seems like no rebasing is 
> happening.
> I've made some scripts to help with testing/show the behavior, it uses 
> pyspark 2.4.5, 2.4.6 and 3.0.1. You can find them here 
> [https://github.com/simonvanderveldt/spark3-rebasemode-issue]. I'll post the 
> outputs in a comment below as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34395) Clean up unused code for code simplifications

2021-02-08 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen reassigned SPARK-34395:


Assignee: yikf

> Clean up unused code for code simplifications
> -
>
> Key: SPARK-34395
> URL: https://issues.apache.org/jira/browse/SPARK-34395
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0, 3.2.0
>Reporter: yikf
>Assignee: yikf
>Priority: Trivial
>
> Currently, we pass the default value `EmptyRow` to method `checkEvaluation` 
> in the StringExpressionsSuite, but the default value of the 'checkEvaluation' 
> method parameter is the `emptyRow`.
> We can clean the parameter for Code Simplifications.
>  
> example:
> *before:*
> {code:java}
> def testConcat(inputs: String*): Unit = {
>   val expected = if (inputs.contains(null)) null else inputs.mkString
>   checkEvaluation(Concat(inputs.map(Literal.create(_, StringType))), 
> expected, EmptyRow)
> }{code}
> *after:*
> {code:java}
> def testConcat(inputs: String*): Unit = {
>   val expected = if (inputs.contains(null)) null else inputs.mkString
>   checkEvaluation(Concat(inputs.map(Literal.create(_, StringType))), expected)
> }{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-34395) Clean up unused code for code simplifications

2021-02-08 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-34395.
--
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 31510
[https://github.com/apache/spark/pull/31510]

> Clean up unused code for code simplifications
> -
>
> Key: SPARK-34395
> URL: https://issues.apache.org/jira/browse/SPARK-34395
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0, 3.2.0
>Reporter: yikf
>Assignee: yikf
>Priority: Trivial
> Fix For: 3.2.0
>
>
> Currently, we pass the default value `EmptyRow` to method `checkEvaluation` 
> in the StringExpressionsSuite, but the default value of the 'checkEvaluation' 
> method parameter is the `emptyRow`.
> We can clean the parameter for Code Simplifications.
>  
> example:
> *before:*
> {code:java}
> def testConcat(inputs: String*): Unit = {
>   val expected = if (inputs.contains(null)) null else inputs.mkString
>   checkEvaluation(Concat(inputs.map(Literal.create(_, StringType))), 
> expected, EmptyRow)
> }{code}
> *after:*
> {code:java}
> def testConcat(inputs: String*): Unit = {
>   val expected = if (inputs.contains(null)) null else inputs.mkString
>   checkEvaluation(Concat(inputs.map(Literal.create(_, StringType))), expected)
> }{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-34352) Improve SQLQueryTestSuite so as could run on windows system

2021-02-08 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-34352.
--
Fix Version/s: 3.1.2
   Resolution: Fixed

Fixed in https://github.com/apache/spark/pull/31466

> Improve SQLQueryTestSuite so as could run on windows system
> ---
>
> Key: SPARK-34352
> URL: https://issues.apache.org/jira/browse/SPARK-34352
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: jiaan.geng
>Priority: Major
> Fix For: 3.1.2
>
>
> The current implement of SQLQueryTestSuite cannot run on windows system.
> Becasue the code below will fail on windows system:
> assume(TestUtils.testCommandAvailable("/bin/bash"))



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33566) Incorrectly Parsing CSV file

2021-02-08 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33566?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-33566:


Assignee: Yang Jie

> Incorrectly Parsing CSV file
> 
>
> Key: SPARK-33566
> URL: https://issues.apache.org/jira/browse/SPARK-33566
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.7
>Reporter: Stephen More
>Assignee: Yang Jie
>Priority: Minor
> Fix For: 3.1.0
>
>
> Here is a test case: 
> [https://github.com/mores/maven-examples/blob/master/comma/src/test/java/org/test/CommaTest.java]
> It shows how I believe apache commons csv and opencsv correctly parses the 
> sample csv file.
> spark is not correctly parsing the sample csv file.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33565) python/run-tests.py calling python3.8

2021-02-08 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33565?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-33565:


Assignee: Shane Knapp  (was: Apache Spark)

> python/run-tests.py calling python3.8
> -
>
> Key: SPARK-33565
> URL: https://issues.apache.org/jira/browse/SPARK-33565
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 3.0.1
>Reporter: Shane Knapp
>Assignee: Shane Knapp
>Priority: Major
> Fix For: 3.1.0
>
>
> this line in run-tests.py on master:
> |python_execs = [x for x in ["python3.6", "python3.8", "pypy3"] if which(x)]|
>  
> and this line in branch-3.0:
> python_execs = [x for x in ["python3.8", "python2.7", "pypy3", "pypy"] if 
> which(x)]
> ...are currently breaking builds on the new ubuntu 20.04LTS workers.
> the default  system python is /usr/bin/python3.8 and we do NOT have a working 
> python3.8 anaconda deployment yet.  this is causing python test breakages.
> PRs incoming
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33376) Remove the option of "sharesHadoopClasses" in Hive IsolatedClientLoader

2021-02-08 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-33376:


Assignee: Chao Sun  (was: Apache Spark)

> Remove the option of "sharesHadoopClasses" in Hive IsolatedClientLoader
> ---
>
> Key: SPARK-33376
> URL: https://issues.apache.org/jira/browse/SPARK-33376
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.1
>Reporter: Chao Sun
>Assignee: Chao Sun
>Priority: Major
> Fix For: 3.1.0
>
>
> Currently, when initializing {{IsolatedClientLoader}}, ppl can specify to 
> either share Hadoop classes from Spark or not. In the latter case it's 
> supposed to only loads the Hadoop classes from the Hive jars themselves.
> However this feature is currently used in two cases: 1) unit tests, 2) when 
> the Hadoop version defined in Maven can not be found when 
> {{spark.sql.hive.metastore.jars == "maven"}}. Also when 
> {{sharesHadoopClasses}} is false, it isn't really only using Hadoop classes 
> from Hive jars: Spark also download {{hadoop-client}} jar and put it together 
> with the Hive jars, and the Hadoop version used by {{hadoop-client}} is the 
> same version used by Spark itself. This could potentially cause issues 
> because we are mixing two versions of Hadoop jars in the classpath.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33303) Deduplicate deterministic PythonUDF calls

2021-02-08 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-33303:


Assignee: Peter Toth  (was: Apache Spark)

> Deduplicate deterministic PythonUDF calls
> -
>
> Key: SPARK-33303
> URL: https://issues.apache.org/jira/browse/SPARK-33303
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Peter Toth
>Assignee: Peter Toth
>Priority: Major
> Fix For: 3.1.0
>
>
> We run into an issue where a customer created a column with an expensive 
> PythonUDF call and build a very complex logic on the the top of that column 
> as new derived columns. Due to `CollapseProject` and `ExtractPythonUDFs` 
> rules the UDF is called ~1000 times for each row which degraded the 
> performance of the query significantly.
> The `ExtractPythonUDFs` rule could deduplicate deterministic UDFs so as to 
> avoid performance degradation.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32215) Expose end point on Master so that it can be informed about decommissioned workers out of band

2021-02-08 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-32215:


Assignee: Devesh Agrawal

> Expose end point on Master so that it can be informed about decommissioned 
> workers out of band
> --
>
> Key: SPARK-32215
> URL: https://issues.apache.org/jira/browse/SPARK-32215
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.1.0
> Environment: Standalone Scheduler 
>Reporter: Devesh Agrawal
>Assignee: Devesh Agrawal
>Priority: Major
> Fix For: 3.1.0
>
>
> The use case here is to allow some external entity that has made a 
> decommissioning decision to inform the Master (in case of Standalone 
> scheduling mode)
> The current decommissioning is triggered by the Worker getting getting a 
> SIGPWR
>  (out of band possibly by some cleanup hook), which then informs the Master
>  about it. This approach may not be feasible in some environments that cannot
>  trigger a clean up hook on the Worker.
> Add a new post endpoint {{/workers/kill}} on the MasterWebUI that allows an
>  external agent to inform the master about all the nodes being decommissioned 
> in
>  bulk. The workers are identified by either their {{host:port}} or just the 
> host
>  – in which case all workers on the host would be decommissioned.
> This API is merely a new entry point into the existing decommissioning
>  logic. It does not change how the decommissioning request is handled in
>  its core.
> The path /workers/kill is so chosen to be consistent with the other endpoint 
> names on the MasterWebUI. 
> Since this is a sensitive operation, this API will be disabled by default.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33297) Intermittent Compilation failure In GitHub Actions after SBT upgrade

2021-02-08 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-33297:


Assignee: Hyukjin Kwon  (was: Apache Spark)

> Intermittent Compilation failure In GitHub Actions after SBT upgrade
> 
>
> Key: SPARK-33297
> URL: https://issues.apache.org/jira/browse/SPARK-33297
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 3.1.0
>
>
> https://github.com/apache/spark/runs/1314691686
> {code}
> Error:  java.util.MissingResourceException: Can't find bundle for base name 
> org.scalactic.ScalacticBundle, locale en
> Error:at 
> java.util.ResourceBundle.throwMissingResourceException(ResourceBundle.java:1581)
> Error:at 
> java.util.ResourceBundle.getBundleImpl(ResourceBundle.java:1396)
> Error:at java.util.ResourceBundle.getBundle(ResourceBundle.java:782)
> Error:at 
> org.scalactic.Resources$.resourceBundle$lzycompute(Resources.scala:8)
> Error:at org.scalactic.Resources$.resourceBundle(Resources.scala:8)
> Error:at 
> org.scalactic.Resources$.pleaseDefineScalacticFillFilePathnameEnvVar(Resources.scala:256)
> Error:at 
> org.scalactic.source.PositionMacro$PositionMacroImpl.apply(PositionMacro.scala:65)
> Error:at 
> org.scalactic.source.PositionMacro$.genPosition(PositionMacro.scala:85)
> Error:at sun.reflect.GeneratedMethodAccessor34.invoke(Unknown Source)
> Error:at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> Error:at java.lang.reflect.Method.invoke(Method.java:498)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32450) Upgrade pycodestyle to 2.6.0

2021-02-08 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32450?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-32450:


Assignee: L. C. Hsieh  (was: Apache Spark)

> Upgrade pycodestyle to 2.6.0
> 
>
> Key: SPARK-32450
> URL: https://issues.apache.org/jira/browse/SPARK-32450
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.1.0
>Reporter: L. C. Hsieh
>Assignee: L. C. Hsieh
>Priority: Trivial
> Fix For: 3.1.0
>
>
> Upgrade pycodestyle to 2.6.0 to include bug fixes and newer Python version 
> support.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32613) DecommissionWorkerSuite has started failing sporadically again

2021-02-08 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32613?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-32613:


Assignee: Devesh Agrawal

> DecommissionWorkerSuite has started failing sporadically again
> --
>
> Key: SPARK-32613
> URL: https://issues.apache.org/jira/browse/SPARK-32613
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: Devesh Agrawal
>Assignee: Devesh Agrawal
>Priority: Major
> Fix For: 3.1.0
>
>
> Test "decommission workers ensure that fetch failures lead to rerun" is 
> failing: 
>  
> [https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/127357/testReport/org.apache.spark.deploy/DecommissionWorkerSuite/decommission_workers_ensure_that_fetch_failures_lead_to_rerun/]
> https://github.com/apache/spark/pull/29367/checks?check_run_id=972990200#step:14:13579
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33190) Set upperbound of PyArrow version in GitHub Actions

2021-02-08 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-33190:


Assignee: Hyukjin Kwon  (was: Apache Spark)

> Set upperbound of PyArrow version in GitHub Actions
> ---
>
> Key: SPARK-33190
> URL: https://issues.apache.org/jira/browse/SPARK-33190
> Project: Spark
>  Issue Type: Test
>  Components: PySpark, Tests
>Affects Versions: 2.4.7, 3.0.1, 3.1.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 3.0.2, 3.1.0
>
>
> See SPARK-33189. Some tests look being failed with PyArrow 2.0.0+. We should 
> make the tests pass.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31830) Consistent error handling for datetime formatting functions

2021-02-08 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31830?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-31830:


Assignee: Kent Yao  (was: Apache Spark)

> Consistent error handling for datetime formatting functions
> ---
>
> Key: SPARK-31830
> URL: https://issues.apache.org/jira/browse/SPARK-31830
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Major
> Fix For: 3.1.0
>
>
> date_format and from_unixtime have different error handling behavior for 
> formatting datetime values.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32074) Update AppVeyor R to 4.0.2

2021-02-08 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32074?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-32074:


Assignee: Hyukjin Kwon

> Update AppVeyor R to 4.0.2
> --
>
> Key: SPARK-32074
> URL: https://issues.apache.org/jira/browse/SPARK-32074
> Project: Spark
>  Issue Type: Improvement
>  Components: R
>Affects Versions: 3.0.1, 3.1.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 3.1.0
>
>
> We should test R 4.0.0.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32282) Improve EnsureRquirement.reorderJoinKeys to handle more scenarios such as PartitioningCollection

2021-02-08 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32282?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-32282:


Assignee: Terry Kim  (was: Apache Spark)

> Improve EnsureRquirement.reorderJoinKeys to handle more scenarios such as 
> PartitioningCollection
> 
>
> Key: SPARK-32282
> URL: https://issues.apache.org/jira/browse/SPARK-32282
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Terry Kim
>Assignee: Terry Kim
>Priority: Major
> Fix For: 3.1.0
>
>
> The EnsureRquirement.reorderJoinKeys can be improved to handle the following 
> scenarios:
> # If the keys cannot be reordered to match the left-side HashPartitioning, 
> consider the right-side HashPartitioning.
> # Handle PartitioningCollection, which may contain HashPartitioning



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-30880) Delete Sphinx Makefile cruft

2021-02-08 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-30880:


Assignee: Nicholas Chammas

> Delete Sphinx Makefile cruft
> 
>
> Key: SPARK-30880
> URL: https://issues.apache.org/jira/browse/SPARK-30880
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 3.1.0
>Reporter: Nicholas Chammas
>Assignee: Nicholas Chammas
>Priority: Minor
> Fix For: 3.1.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31598) LegacySimpleTimestampFormatter incorrectly interprets pre-Gregorian timestamps

2021-02-08 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31598?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-31598:


Assignee: Bruce Robbins

> LegacySimpleTimestampFormatter incorrectly interprets pre-Gregorian timestamps
> --
>
> Key: SPARK-31598
> URL: https://issues.apache.org/jira/browse/SPARK-31598
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Bruce Robbins
>Assignee: Bruce Robbins
>Priority: Major
> Fix For: 3.0.0, 3.1.0
>
>
> As per discussion with [~maxgekk]:
> {{LegacySimpleTimestampFormatter#parse}} misinterprets pre-Gregorian 
> timestamps:
> {noformat}
> scala> sql("set spark.sql.legacy.timeParserPolicy=LEGACY")
> res0: org.apache.spark.sql.DataFrame = [key: string, value: string]
> scala> val df1 = Seq("0002-01-01 00:00:00", "1000-01-01 00:00:00", 
> "1800-01-01 00:00:00").toDF("expected")
> df1: org.apache.spark.sql.DataFrame = [expected: string]
> scala> val df2 = df1.select('expected, to_timestamp('expected, "-MM-dd 
> HH:mm:ss").as("actual"))
> df2: org.apache.spark.sql.DataFrame = [expected: string, actual: timestamp]
> scala> df2.show(truncate=false)
> +---+---+
> |expected   |actual |
> +---+---+
> |0002-01-01 00:00:00|0001-12-30 00:00:00|
> |1000-01-01 00:00:00|1000-01-06 00:00:00|
> |1800-01-01 00:00:00|1800-01-01 00:00:00|
> +---+---+
> scala> 
> {noformat}
> Legacy timestamp parsing with JSON and CSV files is correct, so apparently 
> {{LegacyFastTimestampFormatter}} does not have this issue (need to double 
> check).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31000) Add ability to set table description in the catalog

2021-02-08 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31000?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-31000:


Assignee: Nicholas Chammas

> Add ability to set table description in the catalog
> ---
>
> Key: SPARK-31000
> URL: https://issues.apache.org/jira/browse/SPARK-31000
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Nicholas Chammas
>Assignee: Nicholas Chammas
>Priority: Minor
> Fix For: 3.1.0
>
>
> It seems that the catalog supports a {{description}} attribute on tables.
> https://github.com/apache/spark/blob/86cc907448f0102ad0c185e87fcc897d0a32707f/sql/core/src/main/scala/org/apache/spark/sql/catalog/interface.scala#L68
> However, the {{createTable()}} interface doesn't provide any way to set that 
> attribute.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31517) SparkR::orderBy with multiple columns descending produces error

2021-02-08 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-31517:


Assignee: Michael Chirico

> SparkR::orderBy with multiple columns descending produces error
> ---
>
> Key: SPARK-31517
> URL: https://issues.apache.org/jira/browse/SPARK-31517
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.4.5
> Environment: Databricks Runtime 6.5
>Reporter: Ross Bowen
>Assignee: Michael Chirico
>Priority: Major
> Fix For: 3.1.0, 3.2.0
>
>
> When specifying two columns within an `orderBy()` function, to attempt to get 
> an ordering by two columns in descending order, an error is returned.
> {code:java}
> library(magrittr) 
> library(SparkR) 
> cars <- cbind(model = rownames(mtcars), mtcars) 
> carsDF <- createDataFrame(cars) 
> carsDF %>% 
>   mutate(rank = over(rank(), orderBy(windowPartitionBy(column("cyl")), 
> desc(column("mpg")), desc(column("disp") %>% 
>   head() {code}
> This returns an error:
> {code:java}
>  Error in ns[[i]] : subscript out of bounds{code}
> This seems to be related to the more general issue that the following code, 
> excluding the use of the `desc()` function also fails:
> {code:java}
> carsDF %>% 
>   mutate(rank = over(rank(), orderBy(windowPartitionBy(column("cyl")), 
> column("mpg"), column("disp" %>% 
>   head(){code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-30889) Add version information to the configuration of Worker

2021-02-08 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30889?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-30889:


Assignee: jiaan.geng

> Add version information to the configuration of Worker
> --
>
> Key: SPARK-30889
> URL: https://issues.apache.org/jira/browse/SPARK-30889
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: jiaan.geng
>Assignee: jiaan.geng
>Priority: Major
> Fix For: 3.0.0, 3.1.0
>
>
> core/src/main/scala/org/apache/spark/internal/config/Worker.scala



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-30856) SQLContext retains reference to unusable instance after SparkContext restarted

2021-02-08 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-30856:


Assignee: Alex Favaro

> SQLContext retains reference to unusable instance after SparkContext restarted
> --
>
> Key: SPARK-30856
> URL: https://issues.apache.org/jira/browse/SPARK-30856
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.4.5
>Reporter: Alex Favaro
>Assignee: Alex Favaro
>Priority: Major
> Fix For: 3.1.0
>
>
> When the underlying SQLContext is instantiated for a SparkSession, the 
> instance is saved as a class attribute and returned from subsequent calls to 
> SQLContext.getOrCreate(). If the SparkContext is stopped and a new one 
> started, the SQLContext class attribute is never cleared so any code which 
> calls SQLContext.getOrCreate() will get a SQLContext with a reference to the 
> old, unusable SparkContext.
> A similar issue was identified and fixed for SparkSession in SPARK-19055, but 
> the fix did not change SQLContext as well. I ran into this because mllib 
> still 
> [uses|https://github.com/apache/spark/blob/master/python/pyspark/mllib/common.py#L105]
>  SQLContext.getOrCreate() under the hood.
> I've already written a fix for this, which I'll be sharing in a PR, that 
> clears the class attribute on SQLContext when the SparkSession is stopped. 
> Another option would be to deprecate SQLContext.getOrCreate() entirely since 
> the corresponding Scala 
> [method|https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/SQLContext.html#getOrCreate-org.apache.spark.SparkContext-]
>  is itself deprecated. That seems like a larger change for a relatively minor 
> issue, however.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-30988) Add more edge-case exercising values to stats tests

2021-02-08 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30988?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-30988:


Assignee: Maxim Gekk

> Add more edge-case exercising values to stats tests
> ---
>
> Key: SPARK-30988
> URL: https://issues.apache.org/jira/browse/SPARK-30988
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Major
> Fix For: 3.1.0
>
>
> Add more edge-cases to StatisticsCollectionTestBase



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-30839) Add version information for Spark configuration

2021-02-08 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30839?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-30839:


Assignee: jiaan.geng

> Add version information for Spark configuration
> ---
>
> Key: SPARK-30839
> URL: https://issues.apache.org/jira/browse/SPARK-30839
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, DStreams, Kubernetes, Mesos, Spark Core, 
> SQL, Structured Streaming, YARN
>Affects Versions: 3.1.0
>Reporter: jiaan.geng
>Assignee: jiaan.geng
>Priority: Major
> Fix For: 3.0.0, 3.1.0
>
>
> Spark ConfigEntry and ConfigBuilder missing Spark version information of each 
> configuration at release. This is not good for Spark user when they visiting 
> the page of spark configuration.
> http://spark.apache.org/docs/latest/configuration.html



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-30821) Executor pods with multiple containers will not be rescheduled unless all containers fail

2021-02-08 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-30821:


Assignee: Shiqi Sun  (was: Apache Spark)

> Executor pods with multiple containers will not be rescheduled unless all 
> containers fail
> -
>
> Key: SPARK-30821
> URL: https://issues.apache.org/jira/browse/SPARK-30821
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes, Spark Core
>Affects Versions: 3.1.0
>Reporter: Kevin Hogeland
>Assignee: Shiqi Sun
>Priority: Major
> Fix For: 3.0.2, 3.1.0
>
>
> Since the restart policy of launched pods is Never, additional handling is 
> required for pods that may have sidecar containers. The executor should be 
> considered failed if any containers have terminated and have a non-zero exit 
> code, but Spark currently only checks the pod phase. The pod phase will 
> remain "running" as long as _any_ pods are still running. Kubernetes sidecar 
> support in 1.18/1.19 does not address this situation, as sidecar containers 
> are excluded from pod phase calculation.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-30795) Spark SQL codegen's code() interpolator should treat escapes like Scala's StringContext.s()

2021-02-08 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30795?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-30795:


Assignee: Kris Mok

> Spark SQL codegen's code() interpolator should treat escapes like Scala's 
> StringContext.s()
> ---
>
> Key: SPARK-30795
> URL: https://issues.apache.org/jira/browse/SPARK-30795
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0, 2.4.1, 2.4.2, 2.4.3, 2.4.4, 2.4.5, 3.0.0
>Reporter: Kris Mok
>Assignee: Kris Mok
>Priority: Major
> Fix For: 3.1.0
>
>
> The {{code()}} string interpolator in Spark SQL's code generator should treat 
> escapes like Scala's builtin {{StringContext.s()}} interpolator, i.e. it 
> should treat escapes in the code parts, and should not treat escapes in the 
> input arguments.
> For example,
> {code}
> val arg = "This is an argument."
> val str = s"This is string part 1. $arg This is string part 2."
> val code = code"This is string part 1. $arg This is string part 2."
> assert(code.toString == str)
> {code}
> We should expect the {{code()}} interpolator produce the same thing as the 
> {{StringContext.s()}} interpolator, where only escapes in the string parts 
> should be treated, while the args should be kept verbatim.
> But in the current implementation, due to the eager folding of code parts and 
> literal input args, the escape treatment is incorrectly done on both code 
> parts and literal args.
> That causes a problem when an arg contains escape sequences and wants to 
> preserve that in the final produced code string. For example, in {{Like}} 
> expression's codegen, there's an ugly workaround for this bug:
> {code}
>   // We need double escape to avoid 
> org.codehaus.commons.compiler.CompileException.
>   // '\\' will cause exception 'Single quote must be backslash-escaped in 
> character literal'.
>   // '\"' will cause exception 'Line break in literal not allowed'.
>   val newEscapeChar = if (escapeChar == '\"' || escapeChar == '\\') {
> s"""\\$escapeChar"""
>   } else {
> escapeChar
>   }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-30733) Fix SparkR tests per testthat and R version upgrade

2021-02-08 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30733?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-30733:


Assignee: Hyukjin Kwon

> Fix SparkR tests per testthat and R version upgrade
> ---
>
> Key: SPARK-30733
> URL: https://issues.apache.org/jira/browse/SPARK-30733
> Project: Spark
>  Issue Type: Test
>  Components: SparkR, SQL
>Affects Versions: 2.4.5, 3.0.0, 3.1.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Critical
> Fix For: 2.4.6, 3.0.0, 3.1.0
>
>
> 5 SparkR tests seem being failed after upgrading testthat 2.0.0 and R 3.5.x
> {code}
> test_context.R:49: failure: Check masked functions
> length(maskedCompletely) not equal to length(namesOfMaskedCompletely).
> 1/1 mismatches
> [1] 6 - 4 == 2
> test_context.R:53: failure: Check masked functions
> sort(maskedCompletely, na.last = TRUE) not equal to 
> sort(namesOfMaskedCompletely, na.last = TRUE).
> 5/6 mismatches
> x[2]: "endsWith"
> y[2]: "filter"
> x[3]: "filter"
> y[3]: "not"
> x[4]: "not"
> y[4]: "sample"
> x[5]: "sample"
> y[5]: NA
> x[6]: "startsWith"
> y[6]: NA
> {code}
> {code}
> test_includePackage.R:31: error: include inside function
> package or namespace load failed for ���plyr���:
>  package ���plyr��� was installed by an R version with different internals; 
> it needs to be reinstalled for use with this R version
> Seems it's a package installation issue. Looks like plyr has to be 
> re-installed.
> {code}
> {code}
> test_sparkSQL.R:499: warning: SPARK-17811: can create DataFrame containing NA 
> as date and time
> Your system is mis-configured: ���/etc/localtime��� is not a symlink
> test_sparkSQL.R:504: warning: SPARK-17811: can create DataFrame containing NA 
> as date and time
> Your system is mis-configured: ���/etc/localtime��� is not a symlink
> {code}
> {code}
> test_sparkSQL.R:499: warning: SPARK-17811: can create DataFrame containing NA 
> as date and time
> It is strongly recommended to set envionment variable TZ to 
> ���America/Los_Angeles��� (or equivalent)
> test_sparkSQL.R:504: warning: SPARK-17811: can create DataFrame containing NA 
> as date and time
> It is strongly recommended to set envionment variable TZ to 
> ���America/Los_Angeles��� (or equivalent
> {code}
> {code}
> test_sparkSQL.R:1814: error: string operators
> unable to find an inherited method for function ���startsWith��� for 
> signature ���"character"���
> 1: expect_true(startsWith("Hello World", "Hello")) at 
> /home/jenkins/workspace/SparkPullRequestBuilder@2/R/pkg/tests/fulltests/test_sparkSQL.R:1814
> 2: quasi_label(enquo(object), label)
> 3: eval_bare(get_expr(quo), get_env(quo))
> 4: startsWith("Hello World", "Hello")
> 5: (function (classes, fdef, mtable) 
>{
>methods <- .findInheritedMethods(classes, fdef, mtable)
>if (length(methods) == 1L) 
>return(methods[[1L]])
>else if (length(methods) == 0L) {
>cnames <- paste0("\"", vapply(classes, as.character, ""), "\"", 
> collapse = ", ")
>stop(gettextf("unable to find an inherited method for function %s 
> for signature %s", 
>sQuote(fdef@generic), sQuote(cnames)), domain = NA)
>}
>else stop("Internal error in finding inherited methods; didn't return 
> a unique method", 
>domain = NA)
>})(list("character"), new("nonstandardGenericFunction", .Data = function 
> (x, prefix) 
>{
>standardGeneric("startsWith")
>}, generic = structure("startsWith", package = "SparkR"), package = 
> "SparkR", group = list(), 
>valueClass = character(0), signature = c("x", "prefix"), default = 
> NULL, skeleton = (function (x, 
>prefix) 
>stop("invalid call in method dispatch to 'startsWith' (no default 
> method)", domain = NA))(x, 
>prefix)), )
> 6: stop(gettextf("unable to find an inherited method for function %s for 
> signature %s", 
>sQuote(fdef@generic), sQuote(cnames)), domain = NA)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-28646) Allow usage of `count` only for parameterless aggregate function

2021-02-08 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-28646:


Assignee: jiaan.geng  (was: Apache Spark)

> Allow usage of `count` only for parameterless aggregate function
> 
>
> Key: SPARK-28646
> URL: https://issues.apache.org/jira/browse/SPARK-28646
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Dylan Guedes
>Assignee: jiaan.geng
>Priority: Major
> Fix For: 3.1.0
>
>
> Currently, Spark allows calls to `count` even for non parameterless aggregate 
> function. For example, the following query actually works:
> {code:sql}SELECT count() OVER () FROM tenk1;{code}
> In PgSQL, on the other hand, the following error is thrown:
> {code:sql}ERROR:  count(*) must be used to call a parameterless aggregate 
> function{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27497) Spark wipes out bucket spec in metastore when updating table stats

2021-02-08 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-27497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-27497:


Assignee: Bruce Robbins

> Spark wipes out bucket spec in metastore when updating table stats
> --
>
> Key: SPARK-27497
> URL: https://issues.apache.org/jira/browse/SPARK-27497
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0, 3.0.0
>Reporter: Bruce Robbins
>Assignee: Bruce Robbins
>Priority: Major
> Fix For: 2.4.6, 3.1.0
>
>
> The bucket spec gets wiped out after Spark writes to a Hive-bucketed table 
> that has the following characteristics:
>  - table is created by Hive (or even Spark, if you use HQL DDL)
>  - table is stored in Parquet format
>  - table has at least one Hive-created data file already
> Also, spark.sql.hive.convertMetastoreParquet has to be set to true (the 
> default).
> For example, do the following in Hive:
> {noformat}
> hive> create table sourcetable as select 1 a, 3 b, 7 c;
> hive> drop table hivebucket1;
> hive> create table hivebucket1 (a int, b int, c int) clustered by (a, b) 
> sorted by (a, b asc) into 10 buckets stored as parquet;
> hive> insert into hivebucket1 select * from sourcetable;
> hive> show create table hivebucket1;
> OK
> CREATE TABLE `hivebucket1`(
>   `a` int, 
>   `b` int, 
>   `c` int)
> CLUSTERED BY ( 
>   a, 
>   b) 
> SORTED BY ( 
>   a ASC, 
>   b ASC) 
> INTO 10 BUCKETS
> ROW FORMAT SERDE 
>   'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' 
> STORED AS INPUTFORMAT 
>   'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' 
> OUTPUTFORMAT 
>   'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
> LOCATION
>   'file:/Users/brobbins/github/spark_upstream/spark-warehouse/hivebucket1'
> TBLPROPERTIES (
>   'COLUMN_STATS_ACCURATE'='true', 
>   'numFiles'='1', 
>   'numRows'='1', 
>   'rawDataSize'='3', 
>   'totalSize'='352', 
>   'transient_lastDdlTime'='142971')
> Time taken: 0.056 seconds, Fetched: 26 row(s)
> hive> 
> {noformat}
> Then in spark-shell, do the following:
> {noformat}
> scala> sql("insert into hivebucket1 select 1, 3, 7")
> 19/04/17 10:49:30 WARN ObjectStore: Failed to get database global_temp, 
> returning NoSuchObjectException
> res0: org.apache.spark.sql.DataFrame = []
> {noformat}
> Note: At this point, I would have expected Spark to throw an 
> {{AnalysisException}} with the message "Output Hive table 
> `default`.`hivebucket1` is bucketed...". However, I am ignoring that for now 
> and may open a separate Jira (SPARK-27498).
> Return to some Hive CLI and note that the bucket specification is gone from 
> the table definition:
> {noformat}
> hive> show create table hivebucket1;
> OK
> CREATE TABLE `hivebucket1`(
>   `a` int, 
>   `b` int, 
>   `c` int)
> ROW FORMAT SERDE 
>   'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' 
> STORED AS INPUTFORMAT 
>   'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' 
> OUTPUTFORMAT 
>   'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
> LOCATION
>   ''
> TBLPROPERTIES (
>   'COLUMN_STATS_ACCURATE'='false', 
>   'SORTBUCKETCOLSPREFIX'='TRUE', 
>   'numFiles'='2', 
>   'numRows'='-1', 
>   'rawDataSize'='-1', 
>   'totalSize'='1144', 
>   'transient_lastDdlTime'='123374')
> Time taken: 1.619 seconds, Fetched: 20 row(s)
> hive> 
> {noformat}
> This information is lost when Spark attempts to update table stats. 
> HiveClientImpl.toHiveTable drops the bucket specification. toHiveTable drops 
> the bucket information because {{table.provider}} is None instead of "hive". 
> {{table.provider}} is not "hive" because Spark bypassed the serdes and used 
> the built-in parquet code path (by default, 
> spark.sql.hive.convertMetastoreParquet is true).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-28367) Kafka connector infinite wait because metadata never updated

2021-02-08 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-28367:


Assignee: Gabor Somogyi

> Kafka connector infinite wait because metadata never updated
> 
>
> Key: SPARK-28367
> URL: https://issues.apache.org/jira/browse/SPARK-28367
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.1.3, 2.2.3, 2.3.3, 2.4.3, 3.0.0, 3.1.0
>Reporter: Gabor Somogyi
>Assignee: Gabor Somogyi
>Priority: Critical
> Fix For: 3.1.0
>
>
> Spark uses an old and deprecated API named poll(long) which never returns and 
> stays in live lock if metadata is not updated (for instance when broker 
> disappears at consumer creation).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33670) Verify the partition provider is Hive in v1 SHOW TABLE EXTENDED

2021-02-08 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33670?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-33670:


Assignee: Maxim Gekk  (was: Apache Spark)

> Verify the partition provider is Hive in v1 SHOW TABLE EXTENDED
> ---
>
> Key: SPARK-33670
> URL: https://issues.apache.org/jira/browse/SPARK-33670
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.2, 3.1.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Major
> Fix For: 2.4.8, 3.0.2, 3.1.0
>
>
> Invoke the check verifyPartitionProviderIsHive() from v1 implementation of 
> SHOW TABLE EXTENDED.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33697) UnionExec should require column ordering in RemoveRedundantProjects

2021-02-08 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33697?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-33697:


Assignee: Allison Wang  (was: Apache Spark)

> UnionExec should require column ordering in RemoveRedundantProjects
> ---
>
> Key: SPARK-33697
> URL: https://issues.apache.org/jira/browse/SPARK-33697
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Allison Wang
>Assignee: Allison Wang
>Priority: Major
> Fix For: 3.1.0
>
>
> UnionExec requires its children's columns to have the same order in order to 
> merge the columns. Currently, the physical rule `RemoveRedundantProjects` can 
> pass through the ordering requirements from its parent and incorrectly remove 
> the necessary project nodes below a union operation.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33803) Sort table properties by key in DESCRIBE TABLE command

2021-02-08 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33803?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-33803:


Assignee: Hyukjin Kwon  (was: Apache Spark)

> Sort table properties by key in DESCRIBE TABLE command
> --
>
> Key: SPARK-33803
> URL: https://issues.apache.org/jira/browse/SPARK-33803
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Minor
> Fix For: 3.1.0
>
>
> Currently:
> {code}
> -- !query
> DESC FORMATTED v
> -- !query schema
> struct
> -- !query output
> a string  
> b int 
> c string  
> d string  
>   
> # Detailed Table Information  
> Database  default 
> Table v   
> Created Time [not included in comparison]
> Last Access [not included in comparison]
> Created By [not included in comparison]
> Type  VIEW
> View Text SELECT * FROM t 
> View Original TextSELECT * FROM t 
> View Catalog and Namespacespark_catalog.default   
> View Query Output Columns [a, b, c, d]
> Table Properties  [view.catalogAndNamespace.numParts=2, 
> view.catalogAndNamespace.part.0=spark_catalog, 
> view.catalogAndNamespace.part.1=default, view.query.out.col.0=a, 
> view.query.out.col.1=b, view.query.out.col.2=c, view.query.out.col.3=d, 
> view.query.out.numCols=4, view.referredTempFunctionsNames=[], 
> view.referredTempViewNames=[]]
> {code}
> The order of "Table Properties" is indeterministic which makes the test above 
> fails in other environments. It should be best to sort it by key. This is 
> consistent with DSv2 command as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33901) Char and Varchar display error after DDLs

2021-02-08 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33901?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-33901:


Assignee: Kent Yao  (was: Apache Spark)

> Char and Varchar display error after DDLs
> -
>
> Key: SPARK-33901
> URL: https://issues.apache.org/jira/browse/SPARK-33901
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Major
> Fix For: 3.1.0
>
>
> CTAS / CREATE TABLE LIKE/ CVAS/ alter table add columns



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33907) Only prune columns of from_json if parsing options is empty

2021-02-08 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33907?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-33907:


Assignee: L. C. Hsieh  (was: Apache Spark)

> Only prune columns of from_json if parsing options is empty
> ---
>
> Key: SPARK-33907
> URL: https://issues.apache.org/jira/browse/SPARK-33907
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0, 3.2.0
>Reporter: L. C. Hsieh
>Assignee: L. C. Hsieh
>Priority: Major
> Fix For: 3.1.0
>
>
> For safety, we should only prune columns from from_json expression if the 
> parsing option is empty.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34181) Update build doc help document

2021-02-08 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-34181:


Assignee: angerszhu

> Update build doc help document
> --
>
> Key: SPARK-34181
> URL: https://issues.apache.org/jira/browse/SPARK-34181
> Project: Spark
>  Issue Type: Improvement
>  Components: docs
>Affects Versions: 3.2.0
>Reporter: angerszhu
>Assignee: angerszhu
>Priority: Major
> Fix For: 3.0.2, 3.1.1
>
>
> According to https://github.com/jekyll/jekyll/issues/8523



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25872) Add an optimizer tracker for TPC-DS queries

2021-02-08 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-25872?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang reassigned SPARK-25872:
---

Assignee: wuyi  (was: Apache Spark)

> Add an optimizer tracker for TPC-DS queries
> ---
>
> Key: SPARK-25872
> URL: https://issues.apache.org/jira/browse/SPARK-25872
> Project: Spark
>  Issue Type: Improvement
>  Components: Tests
>Affects Versions: 3.1.0
>Reporter: Yuming Wang
>Assignee: wuyi
>Priority: Major
> Fix For: 3.1.0
>
>
> Used for track all TPC-DS queries optimized plans.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34080) Add UnivariateFeatureSelector to deprecate existing selectors

2021-02-08 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17281478#comment-17281478
 ] 

Apache Spark commented on SPARK-34080:
--

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/31531

> Add UnivariateFeatureSelector to deprecate existing selectors
> -
>
> Key: SPARK-34080
> URL: https://issues.apache.org/jira/browse/SPARK-34080
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 3.2.0, 3.1.1
>Reporter: Xiangrui Meng
>Assignee: Huaxin Gao
>Priority: Critical
> Fix For: 3.2.0, 3.1.1
>
>
> In SPARK-26111, we introduced a few univariate feature selectors, which share 
> a common set of params. And they are named after the underlying test, which 
> requires users to understand the test to find the matched scenarios. It would 
> be nice if we introduce a single class called UnivariateFeatureSelector that 
> accepts a selection criterion and a score method (string names). Then we can 
> deprecate all other univariate selectors.
> For the params, instead of ask users to provide what score function to use, 
> it is more friendly to ask users to specify the feature and label types 
> (continuous or categorical) and we set a default score function for each 
> combo. We can also detect the types from feature metadata if given. Advanced 
> users can overwrite it (if there are multiple score function that is 
> compatible with the feature type and label type combo). Example (param names 
> are not finalized):
> {code}
> selector = UnivariateFeatureSelector(featureCols=["x", "y", "z"], 
> labelCol=["target"], featureType="categorical", labelType="continuous", 
> select="bestK", k=100)
> {code}
> cc: [~huaxingao] [~ruifengz] [~weichenxu123]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32736) Avoid caching the removed decommissioned executors in TaskSchedulerImpl

2021-02-08 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32736?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang reassigned SPARK-32736:
---

Assignee: wuyi  (was: Apache Spark)

> Avoid caching the removed decommissioned executors in TaskSchedulerImpl
> ---
>
> Key: SPARK-32736
> URL: https://issues.apache.org/jira/browse/SPARK-32736
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: wuyi
>Assignee: wuyi
>Priority: Major
> Fix For: 3.1.0
>
>
> We can save the host directly in the ExecutorDecommissionState. Therefore, 
> when the executor lost, we could unregister the shuffle map status on the 
> host. Thus, we don't need to hold the cache to wait for FetchFailureException 
> to do the unregister.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32689) HiveSerDeReadWriteSuite and ScriptTransformationSuite are currently failed under hive1.2 profile in branch-3.0 and master

2021-02-08 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang reassigned SPARK-32689:
---

Assignee: L. C. Hsieh

> HiveSerDeReadWriteSuite and ScriptTransformationSuite are currently failed 
> under hive1.2 profile in branch-3.0 and master
> -
>
> Key: SPARK-32689
> URL: https://issues.apache.org/jira/browse/SPARK-32689
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Affects Versions: 3.0.1, 3.1.0
>Reporter: L. C. Hsieh
>Assignee: L. C. Hsieh
>Priority: Major
> Fix For: 3.0.1, 3.1.0
>
>
> There are three tests which are currently failed under hive1.2 profiles in 
> branch-3.0 and master branches:
> org.apache.spark.sql.hive.execution.HiveSerDeReadWriteSuite.Read/Write Hive 
> PARQUET serde table
> org.apache.spark.sql.hive.execution.HiveSerDeReadWriteSuite.Read/Write Hive 
> TEXTFILE serde table
> org.apache.spark.sql.hive.execution.ScriptTransformationSuite.SPARK-32608: 
> Script Transform ROW FORMAT DELIMIT value should format value
> Please see [https://github.com/apache/spark/pull/29517].
> This test is failed under hive1.2 profiles in master branch:
> org.apache.spark.sql.hive.orc.HiveOrcHadoopFsRelationSuite.save()/load() - 
> partitioned table - simple queries - partition columns in data 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33139) protect setActiveSession and clearActiveSession

2021-02-08 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33139?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang reassigned SPARK-33139:
---

Assignee: jiahong.li

> protect setActiveSession and clearActiveSession
> ---
>
> Key: SPARK-33139
> URL: https://issues.apache.org/jira/browse/SPARK-33139
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Leanken.Lin
>Assignee: jiahong.li
>Priority: Major
> Fix For: 3.1.0
>
>
> This PR is a sub-task of 
> [SPARK-33138](https://issues.apache.org/jira/browse/SPARK-33138). In order to 
> make SQLConf.get reliable and stable, we need to make sure user can't pollute 
> the SQLConf and SparkSession Context via calling setActiveSession and 
> clearActiveSession.
> Change of the PR:
> * add legacy config spark.sql.legacy.allowModifyActiveSession to fallback to 
> old behavior if user do need to call these two API.
> * by default, if user call these two API, it will throw exception
> * add extra two internal and private API setActiveSessionInternal and 
> clearActiveSessionInternal for current internal usage
> * change all internal reference to new internal API exception for 
> SQLContext.setActive and SQLContext.clearActive



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33105) Broken installation of source packages on AppVeyor

2021-02-08 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33105?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang reassigned SPARK-33105:
---

Assignee: Maciej Szymkiewicz

> Broken installation of source packages on AppVeyor
> --
>
> Key: SPARK-33105
> URL: https://issues.apache.org/jira/browse/SPARK-33105
> Project: Spark
>  Issue Type: Bug
>  Components: Project Infra, R
>Affects Versions: 3.1.0
> Environment: *strong text*
>Reporter: Maciej Szymkiewicz
>Assignee: Maciej Szymkiewicz
>Priority: Major
> Fix For: 3.1.0
>
>
> It looks like AppVeyor configuration is broken, which leads to failure of 
> installation of  source packages (become a problem when {{rlang}} has been 
> updated from 0.4.7 and 0.4.8, with latter available only as a source package).
> {code}
> [00:01:48] trying URL
> 'https://cloud.r-project.org/src/contrib/rlang_0.4.8.tar.gz'
> [00:01:48] Content type 'application/x-gzip' length 847517 bytes (827 KB)
> [00:01:48] ==
> [00:01:48] downloaded 827 KB
> [00:01:48] 
> [00:01:48] Warning in strptime(xx, f, tz = tz) :
> [00:01:48]   unable to identify current timezone 'C':
> [00:01:48] please set environment variable 'TZ'
> [00:01:49] * installing *source* package 'rlang' ...
> [00:01:49] ** package 'rlang' successfully unpacked and MD5 sums checked
> [00:01:49] ** using staged installation
> [00:01:49] ** libs
> [00:01:49] 
> [00:01:49] *** arch - i386
> [00:01:49] C:/Rtools40/mingw64/bin/gcc  -I"C:/R/include" -DNDEBUG
> -I./lib/ -O2 -Wall  -std=gnu99 -mfpmath=sse -msse2
> -mstackrealign -c capture.c -o capture.o
> [00:01:49] C:/Rtools40/mingw64/bin/gcc  -I"C:/R/include" -DNDEBUG
> -I./lib/ -O2 -Wall  -std=gnu99 -mfpmath=sse -msse2
> -mstackrealign -c export.c -o export.o
> [00:01:49] C:/Rtools40/mingw64/bin/gcc  -I"C:/R/include" -DNDEBUG
> -I./lib/ -O2 -Wall  -std=gnu99 -mfpmath=sse -msse2
> -mstackrealign -c internal.c -o internal.o
> [00:01:50] In file included from ./lib/rlang.h:74,
> [00:01:50]  from internal/arg.c:1,
> [00:01:50]  from internal.c:1:
> [00:01:50] internal/eval-tidy.c: In function 'rlang_tilde_eval':
> [00:01:50] ./lib/env.h:33:10: warning: 'top' may be used uninitialized
> in this function [-Wmaybe-uninitialized]
> [00:01:50]return ENCLOS(env);
> [00:01:50]   ^~~
> [00:01:50] In file included from internal.c:8:
> [00:01:50] internal/eval-tidy.c:406:9: note: 'top' was declared here
> [00:01:50]sexp* top;
> [00:01:50]  ^~~
> [00:01:50] C:/Rtools40/mingw64/bin/gcc  -I"C:/R/include" -DNDEBUG
> -I./lib/ -O2 -Wall  -std=gnu99 -mfpmath=sse -msse2
> -mstackrealign -c lib.c -o lib.o
> [00:01:51] C:/Rtools40/mingw64/bin/gcc  -I"C:/R/include" -DNDEBUG
> -I./lib/ -O2 -Wall  -std=gnu99 -mfpmath=sse -msse2
> -mstackrealign -c version.c -o version.o
> [00:01:52] C:/Rtools40/mingw64/bin/gcc -shared -s -static-libgcc -o
> rlang.dll tmp.def capture.o export.o internal.o lib.o version.o
> -LC:/R/bin/i386 -lR
> [00:01:52]
> c:/Rtools40/mingw64/bin/../lib/gcc/x86_64-w64-mingw32/8.3.0/../../../../x86_64-w64-mingw32/bin/ld.exe:
> skipping incompatible C:/R/bin/i386/R.dll when searching for -lR
> [00:01:52]
> c:/Rtools40/mingw64/bin/../lib/gcc/x86_64-w64-mingw32/8.3.0/../../../../x86_64-w64-mingw32/bin/ld.exe:
> skipping incompatible C:/R/bin/i386/R.dll when searching for -lR
> [00:01:52]
> c:/Rtools40/mingw64/bin/../lib/gcc/x86_64-w64-mingw32/8.3.0/../../../../x86_64-w64-mingw32/bin/ld.exe:
> cannot find -lR
> [00:01:52] collect2.exe: error: ld returned 1 exit status
> [00:01:52] no DLL was created
> [00:01:52] ERROR: compilation failed for package 'rlang'
> [00:01:52] * removing 'C:/RLibrary/rlang'
> [00:01:52] 
> [00:01:52] The downloaded source packages are in
> [00:01:52]
> 'C:\Users\appveyor\AppData\Local\Temp\1\Rtmp8qrryA\downloaded_packages'
> [00:01:52] Warning message:
> [00:01:52] In install.packages(c("knitr", "rmarkdown", "testthat",
> "e1071",  :
> [00:01:52]   installation of package 'rlang' had non-zero exit status 
> {code}
> This leads to failures to install {{devtools}} and generate Rd files and, as 
> a result, CRAN check failure.
> There are some discrepancies in the 
> {{dev/appveyor-install-dependencies.ps1}}, but the direct source of this 
> issue seems to be {{$env:BINPREF}}, which forces usage of 64 bit mingw, even 
> if packages are compiled for 32 bit. 
> Modifying the variable to include current architecture:
> {code}
> $env:BINPREF=$RtoolsDrive + '/Rtools40/mingw$(WIN)/bin/'
> {code}
> (as proposed [here|https://stackoverflow.com/a/44035904] by R Yoda) looks 
> like a valid fix, though we might want to clean remaining issues as well.



--
This message was sent by Atlassian Jira

[jira] [Assigned] (SPARK-33169) Check propagation of datasource options to underlying file system for built-in file-based datasources

2021-02-08 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33169?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang reassigned SPARK-33169:
---

Assignee: Maxim Gekk

> Check propagation of datasource options to underlying file system for 
> built-in file-based datasources
> -
>
> Key: SPARK-33169
> URL: https://issues.apache.org/jira/browse/SPARK-33169
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Major
> Fix For: 3.1.0
>
>
> Add a common trait with a test to check that datasource options are 
> propagated to underlying file systems. Individual tests were already added by 
> SPARK-33094 and SPARK-33089. The ticket aims to de-duplicate the tests.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33165) Remove dependencies(scalatest,scalactic) from Benchmark

2021-02-08 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang reassigned SPARK-33165:
---

Assignee: Takeshi Yamamuro

> Remove dependencies(scalatest,scalactic) from Benchmark
> ---
>
> Key: SPARK-33165
> URL: https://issues.apache.org/jira/browse/SPARK-33165
> Project: Spark
>  Issue Type: Test
>  Components: SQL, Tests
>Affects Versions: 3.0.1, 3.1.0
>Reporter: Takeshi Yamamuro
>Assignee: Takeshi Yamamuro
>Priority: Major
> Fix For: 3.0.2, 3.1.0
>
>
> This ticket aims at removing `assert` from `Benchmark` for making it easier 
> to run benchmark codes via `spark-submit`.
> Since the current `Benchmark` (`master` and `branch-3.0`) has `assert`, we 
> need to pass the proper jars of `scalatest` and `scalactic`;
>  - scalatest-core_2.12-3.2.0.jar
>  - scalatest-compatible-3.2.0.jar
>  - scalactic_2.12-3.0.jar
> {code}
> ./bin/spark-submit --jars 
> scalatest-core_2.12-3.2.0.jar,scalatest-compatible-3.2.0.jar,scalactic_2.12-3.0.jar,./sql/catalyst/target/spark-catalyst_2.12-3.1.0-SNAPSHOT-tests.jar,./core/target/spark-core_2.12-3.1.0-SNAPSHOT-tests.jar
>  --class org.apache.spark.sql.execution.benchmark.TPCDSQueryBenchmark 
> ./sql/core/target/spark-sql_2.12-3.1.0-SNAPSHOT-tests.jar --data-location 
> /tmp/tpcds-sf1
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33265) Rename classOf[Seq] to classOf[scala.collection.Seq] in PostgresIntegrationSuite for Scala 2.13

2021-02-08 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang reassigned SPARK-33265:
---

Assignee: Kousuke Saruta  (was: Apache Spark)

> Rename classOf[Seq] to classOf[scala.collection.Seq] in 
> PostgresIntegrationSuite for Scala 2.13
> ---
>
> Key: SPARK-33265
> URL: https://issues.apache.org/jira/browse/SPARK-33265
> Project: Spark
>  Issue Type: Sub-task
>  Components: Tests
>Affects Versions: 3.1.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Minor
> Fix For: 3.1.0
>
>
> In PostgresIntegrationSuite, evaluation of classOf[Seq.isAssignableFrom] 
> fails due to ClassCastException. 
> The reason is the same as what is resolved in SPARK-29292 but this happens at 
> test time, not compile time.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33860) Make CatalystTypeConverters.convertToCatalyst match special Array value

2021-02-08 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33860?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang reassigned SPARK-33860:
---

Assignee: ulysses you

> Make CatalystTypeConverters.convertToCatalyst match special Array value
> ---
>
> Key: SPARK-33860
> URL: https://issues.apache.org/jira/browse/SPARK-33860
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: ulysses you
>Assignee: ulysses you
>Priority: Minor
> Fix For: 3.1.0
>
>
> Array[Any] doesn't match Array[Int].



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34116) Separate state store numKeys metric test

2021-02-08 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34116?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang reassigned SPARK-34116:
---

Assignee: L. C. Hsieh  (was: Apache Spark)

> Separate state store numKeys metric test
> 
>
> Key: SPARK-34116
> URL: https://issues.apache.org/jira/browse/SPARK-34116
> Project: Spark
>  Issue Type: Test
>  Components: Structured Streaming
>Affects Versions: 3.2.0, 3.1.1
>Reporter: L. C. Hsieh
>Assignee: L. C. Hsieh
>Priority: Minor
> Fix For: 3.1.1
>
>
> Right now in StateStoreSuite, the tests of get/put/remove/commit are mixed 
> with numKeys metric test. I found it is flaky when I was testing with other 
> StateStore implementation. Specifically, we also are able to check these 
> metrics after state store is updated (committed).  So I think we can refactor 
> the test a little bit to make it easier to incorporate other StateStore 
> externally.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34027) ALTER TABLE .. RECOVER PARTITIONS doesn't refresh cache

2021-02-08 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34027?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang reassigned SPARK-34027:
---

Assignee: Maxim Gekk  (was: Apache Spark)

> ALTER TABLE .. RECOVER PARTITIONS doesn't refresh cache
> ---
>
> Key: SPARK-34027
> URL: https://issues.apache.org/jira/browse/SPARK-34027
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.1, 3.1.0, 3.2.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Major
>  Labels: correctness
> Fix For: 3.0.2, 3.2.0, 3.1.1
>
>
> Here is the example to reproduce the issue:
> {code:sql}
> spark-sql> create table tbl (col int, part int) using parquet partitioned by 
> (part);
> spark-sql> insert into tbl partition (part=0) select 0;
> spark-sql> cache table tbl;
> spark-sql> select * from tbl;
> 0 0
> spark-sql> show table extended like 'tbl' partition(part=0);
> default   tbl false   Partition Values: [part=0]
> Location: 
> file:/Users/maximgekk/proj/recover-partitions-refresh-cache/spark-warehouse/tbl/part=0
> ...
> {code}
> Add new partition by copying the existing one:
> {code}
> cp -r 
> /Users/maximgekk/proj/recover-partitions-refresh-cache/spark-warehouse/tbl/part=0
>  
> /Users/maximgekk/proj/recover-partitions-refresh-cache/spark-warehouse/tbl/part=1
> {code}
>  Recover and select the table:
> {code}
> spark-sql> alter table tbl recover partitions;
> spark-sql> select * from tbl;
> 0 0
> {code}
> We see only old data.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-34374) Use standard methods to extract keys or values from a Map.

2021-02-08 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34374?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-34374.
--
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 31484
[https://github.com/apache/spark/pull/31484]

> Use standard methods to extract keys or values from a Map.
> --
>
> Key: SPARK-34374
> URL: https://issues.apache.org/jira/browse/SPARK-34374
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Trivial
> Fix For: 3.2.0
>
>
> For keys:
> *before* 
> {code:scala}
> map.map(_._1)
> {code}
> *after*
> {code:java}
> map.keys
> {code}
> For values:
> {code:scala}
> map.map(_._2)
> {code}
> *after*
> {code:java}
> map.values
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34374) Use standard methods to extract keys or values from a Map.

2021-02-08 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34374?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen reassigned SPARK-34374:


Assignee: Yang Jie

> Use standard methods to extract keys or values from a Map.
> --
>
> Key: SPARK-34374
> URL: https://issues.apache.org/jira/browse/SPARK-34374
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Trivial
>
> For keys:
> *before* 
> {code:scala}
> map.map(_._1)
> {code}
> *after*
> {code:java}
> map.keys
> {code}
> For values:
> {code:scala}
> map.map(_._2)
> {code}
> *after*
> {code:java}
> map.values
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-34194) Queries that only touch partition columns shouldn't scan through all files

2021-02-08 Thread Nicholas Chammas (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas resolved SPARK-34194.
--
Resolution: Won't Fix

> Queries that only touch partition columns shouldn't scan through all files
> --
>
> Key: SPARK-34194
> URL: https://issues.apache.org/jira/browse/SPARK-34194
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Nicholas Chammas
>Priority: Minor
>
> When querying only the partition columns of a partitioned table, it seems 
> that Spark nonetheless scans through all files in the table, even though it 
> doesn't need to.
> Here's an example:
> {code:python}
> >>> data = spark.read.option('mergeSchema', 
> >>> 'false').parquet('s3a://some/dataset')
> [Stage 0:==>  (407 + 12) / 
> 1158]
> {code}
> Note the 1158 tasks. This matches the number of partitions in the table, 
> which is partitioned on a single field named {{file_date}}:
> {code:sh}
> $ aws s3 ls s3://some/dataset | head -n 3
>PRE file_date=2017-05-01/
>PRE file_date=2017-05-02/
>PRE file_date=2017-05-03/
> $ aws s3 ls s3://some/dataset | wc -l
> 1158
> {code}
> The table itself has over 138K files, though:
> {code:sh}
> $ aws s3 ls --recursive --human --summarize s3://some/dataset
> ...
> Total Objects: 138708
>Total Size: 3.7 TiB
> {code}
> Now let's try to query just the {{file_date}} field and see what Spark does.
> {code:python}
> >>> data.select('file_date').orderBy('file_date', 
> >>> ascending=False).limit(1).explain()
> == Physical Plan ==
> TakeOrderedAndProject(limit=1, orderBy=[file_date#11 DESC NULLS LAST], 
> output=[file_date#11])
> +- *(1) ColumnarToRow
>+- FileScan parquet [file_date#11] Batched: true, DataFilters: [], Format: 
> Parquet, Location: InMemoryFileIndex[s3a://some/dataset], PartitionFilters: 
> [], PushedFilters: [], ReadSchema: struct<>
> >>> data.select('file_date').orderBy('file_date', 
> >>> ascending=False).limit(1).show()
> [Stage 2:>   (179 + 12) / 
> 41011]
> {code}
> Notice that Spark has spun up 41,011 tasks. Maybe more will be needed as the 
> job progresses? I'm not sure.
> What I do know is that this operation takes a long time (~20 min) running 
> from my laptop, whereas to list the top-level {{file_date}} partitions via 
> the AWS CLI take a second or two.
> Spark appears to be going through all the files in the table, when it just 
> needs to list the partitions captured in the S3 "directory" structure. The 
> query is only touching {{file_date}}, after all.
> The current workaround for this performance problem / optimizer wastefulness, 
> is to [query the catalog 
> directly|https://stackoverflow.com/a/65724151/877069]. It works, but is a lot 
> of extra work compared to the elegant query against {{file_date}} that users 
> actually intend.
> Spark should somehow know when it is only querying partition fields and skip 
> iterating through all the individual files in a table.
> Tested on Spark 3.0.1.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34194) Queries that only touch partition columns shouldn't scan through all files

2021-02-08 Thread Cheng Su (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17281363#comment-17281363
 ] 

Cheng Su commented on SPARK-34194:
--

[~nchammas] - I think metadata-only query on partition column cannot be correct 
reliably with current design, and it's hard to fix. Here's an example - for 
Parquet/ORC table, there's only 0-row files under partition directory (this can 
happen). There's no easy way to return result with only metadata, as we don't 
know whether the file has 0 row or not, until opening the file to observe the 
file metadata. If we end up opening files, there's not so much performance 
improvement we can get, compared to actually run the query.

For your specific case if you are sure that your data do not have 0-row files 
problem, I suggest to set config `spark.sql.optimizer.metadataOnly` to true to 
unblock yourself.

> Queries that only touch partition columns shouldn't scan through all files
> --
>
> Key: SPARK-34194
> URL: https://issues.apache.org/jira/browse/SPARK-34194
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Nicholas Chammas
>Priority: Minor
>
> When querying only the partition columns of a partitioned table, it seems 
> that Spark nonetheless scans through all files in the table, even though it 
> doesn't need to.
> Here's an example:
> {code:python}
> >>> data = spark.read.option('mergeSchema', 
> >>> 'false').parquet('s3a://some/dataset')
> [Stage 0:==>  (407 + 12) / 
> 1158]
> {code}
> Note the 1158 tasks. This matches the number of partitions in the table, 
> which is partitioned on a single field named {{file_date}}:
> {code:sh}
> $ aws s3 ls s3://some/dataset | head -n 3
>PRE file_date=2017-05-01/
>PRE file_date=2017-05-02/
>PRE file_date=2017-05-03/
> $ aws s3 ls s3://some/dataset | wc -l
> 1158
> {code}
> The table itself has over 138K files, though:
> {code:sh}
> $ aws s3 ls --recursive --human --summarize s3://some/dataset
> ...
> Total Objects: 138708
>Total Size: 3.7 TiB
> {code}
> Now let's try to query just the {{file_date}} field and see what Spark does.
> {code:python}
> >>> data.select('file_date').orderBy('file_date', 
> >>> ascending=False).limit(1).explain()
> == Physical Plan ==
> TakeOrderedAndProject(limit=1, orderBy=[file_date#11 DESC NULLS LAST], 
> output=[file_date#11])
> +- *(1) ColumnarToRow
>+- FileScan parquet [file_date#11] Batched: true, DataFilters: [], Format: 
> Parquet, Location: InMemoryFileIndex[s3a://some/dataset], PartitionFilters: 
> [], PushedFilters: [], ReadSchema: struct<>
> >>> data.select('file_date').orderBy('file_date', 
> >>> ascending=False).limit(1).show()
> [Stage 2:>   (179 + 12) / 
> 41011]
> {code}
> Notice that Spark has spun up 41,011 tasks. Maybe more will be needed as the 
> job progresses? I'm not sure.
> What I do know is that this operation takes a long time (~20 min) running 
> from my laptop, whereas to list the top-level {{file_date}} partitions via 
> the AWS CLI take a second or two.
> Spark appears to be going through all the files in the table, when it just 
> needs to list the partitions captured in the S3 "directory" structure. The 
> query is only touching {{file_date}}, after all.
> The current workaround for this performance problem / optimizer wastefulness, 
> is to [query the catalog 
> directly|https://stackoverflow.com/a/65724151/877069]. It works, but is a lot 
> of extra work compared to the elegant query against {{file_date}} that users 
> actually intend.
> Spark should somehow know when it is only querying partition fields and skip 
> iterating through all the individual files in a table.
> Tested on Spark 3.0.1.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34404) Support Avro datasource options to control datetime rebasing in read

2021-02-08 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17281348#comment-17281348
 ] 

Apache Spark commented on SPARK-34404:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/31529

> Support Avro datasource options to control datetime rebasing in read
> 
>
> Key: SPARK-34404
> URL: https://issues.apache.org/jira/browse/SPARK-34404
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Major
> Fix For: 3.2.0
>
>
> Add new Avro option similar to the SQL configs 
> {{spark.sql.legacy.parquet.datetimeRebaseModeInRead}}{{.}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34404) Support Avro datasource options to control datetime rebasing in read

2021-02-08 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34404?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34404:


Assignee: Apache Spark  (was: Maxim Gekk)

> Support Avro datasource options to control datetime rebasing in read
> 
>
> Key: SPARK-34404
> URL: https://issues.apache.org/jira/browse/SPARK-34404
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Maxim Gekk
>Assignee: Apache Spark
>Priority: Major
> Fix For: 3.2.0
>
>
> Add new Avro option similar to the SQL configs 
> {{spark.sql.legacy.parquet.datetimeRebaseModeInRead}}{{.}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34404) Support Avro datasource options to control datetime rebasing in read

2021-02-08 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34404?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34404:


Assignee: Maxim Gekk  (was: Apache Spark)

> Support Avro datasource options to control datetime rebasing in read
> 
>
> Key: SPARK-34404
> URL: https://issues.apache.org/jira/browse/SPARK-34404
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Major
> Fix For: 3.2.0
>
>
> Add new Avro option similar to the SQL configs 
> {{spark.sql.legacy.parquet.datetimeRebaseModeInRead}}{{.}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34404) Support Avro datasource options to control datetime rebasing in read

2021-02-08 Thread Maxim Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34404?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maxim Gekk updated SPARK-34404:
---
Description: Add new Avro option similar to the SQL configs 
{{spark.sql.legacy.parquet.datetimeRebaseModeInRead}}{{.}}  (was: Add new 
parquet options similar to the SQL configs 
{{spark.sql.legacy.parquet.datetimeRebaseModeInRead}} and 
{{spark.sql.legacy.parquet.int96RebaseModeInRead.}})

> Support Avro datasource options to control datetime rebasing in read
> 
>
> Key: SPARK-34404
> URL: https://issues.apache.org/jira/browse/SPARK-34404
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Major
> Fix For: 3.2.0
>
>
> Add new Avro option similar to the SQL configs 
> {{spark.sql.legacy.parquet.datetimeRebaseModeInRead}}{{.}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-34404) Support Avro datasource options to control datetime rebasing in read

2021-02-08 Thread Maxim Gekk (Jira)
Maxim Gekk created SPARK-34404:
--

 Summary: Support Avro datasource options to control datetime 
rebasing in read
 Key: SPARK-34404
 URL: https://issues.apache.org/jira/browse/SPARK-34404
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 3.2.0
Reporter: Maxim Gekk
Assignee: Maxim Gekk
 Fix For: 3.2.0


Add new parquet options similar to the SQL configs 
{{spark.sql.legacy.parquet.datetimeRebaseModeInRead}} and 
{{spark.sql.legacy.parquet.int96RebaseModeInRead.}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34168) Support DPP in AQE When the join is Broadcast hash join before applying the AQE rules

2021-02-08 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34168?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-34168:
--
Parent: SPARK-33828
Issue Type: Sub-task  (was: Improvement)

> Support DPP in AQE When the join is Broadcast hash join before applying the 
> AQE rules
> -
>
> Key: SPARK-34168
> URL: https://issues.apache.org/jira/browse/SPARK-34168
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0, 3.0.1
>Reporter: Ke Jia
>Assignee: Ke Jia
>Priority: Major
> Fix For: 3.2.0
>
>
> Both AQE and DPP cannot be applied at the same time. This PR will enable AQE 
> and DPP when the join is Broadcast hash join at the beginning.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34194) Queries that only touch partition columns shouldn't scan through all files

2021-02-08 Thread Nicholas Chammas (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17281269#comment-17281269
 ] 

Nicholas Chammas commented on SPARK-34194:
--

It's not clear to me whether SPARK-26709 is describing an inherent design issue 
that has no fix, or whether SPARK-26709 simply captures a bug in the past 
implementation of {{OptimizeMetadataOnlyQuery}} which could conceivably be 
fixed in the future.

If it's something that could be fixed and reintroduced, this issue should stay 
open. If we know for design reasons that metadata-only queries cannot be made 
reliably correct, then this issue should be closed with a clear explanation to 
that effect.

> Queries that only touch partition columns shouldn't scan through all files
> --
>
> Key: SPARK-34194
> URL: https://issues.apache.org/jira/browse/SPARK-34194
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Nicholas Chammas
>Priority: Minor
>
> When querying only the partition columns of a partitioned table, it seems 
> that Spark nonetheless scans through all files in the table, even though it 
> doesn't need to.
> Here's an example:
> {code:python}
> >>> data = spark.read.option('mergeSchema', 
> >>> 'false').parquet('s3a://some/dataset')
> [Stage 0:==>  (407 + 12) / 
> 1158]
> {code}
> Note the 1158 tasks. This matches the number of partitions in the table, 
> which is partitioned on a single field named {{file_date}}:
> {code:sh}
> $ aws s3 ls s3://some/dataset | head -n 3
>PRE file_date=2017-05-01/
>PRE file_date=2017-05-02/
>PRE file_date=2017-05-03/
> $ aws s3 ls s3://some/dataset | wc -l
> 1158
> {code}
> The table itself has over 138K files, though:
> {code:sh}
> $ aws s3 ls --recursive --human --summarize s3://some/dataset
> ...
> Total Objects: 138708
>Total Size: 3.7 TiB
> {code}
> Now let's try to query just the {{file_date}} field and see what Spark does.
> {code:python}
> >>> data.select('file_date').orderBy('file_date', 
> >>> ascending=False).limit(1).explain()
> == Physical Plan ==
> TakeOrderedAndProject(limit=1, orderBy=[file_date#11 DESC NULLS LAST], 
> output=[file_date#11])
> +- *(1) ColumnarToRow
>+- FileScan parquet [file_date#11] Batched: true, DataFilters: [], Format: 
> Parquet, Location: InMemoryFileIndex[s3a://some/dataset], PartitionFilters: 
> [], PushedFilters: [], ReadSchema: struct<>
> >>> data.select('file_date').orderBy('file_date', 
> >>> ascending=False).limit(1).show()
> [Stage 2:>   (179 + 12) / 
> 41011]
> {code}
> Notice that Spark has spun up 41,011 tasks. Maybe more will be needed as the 
> job progresses? I'm not sure.
> What I do know is that this operation takes a long time (~20 min) running 
> from my laptop, whereas to list the top-level {{file_date}} partitions via 
> the AWS CLI take a second or two.
> Spark appears to be going through all the files in the table, when it just 
> needs to list the partitions captured in the S3 "directory" structure. The 
> query is only touching {{file_date}}, after all.
> The current workaround for this performance problem / optimizer wastefulness, 
> is to [query the catalog 
> directly|https://stackoverflow.com/a/65724151/877069]. It works, but is a lot 
> of extra work compared to the elegant query against {{file_date}} that users 
> actually intend.
> Spark should somehow know when it is only querying partition fields and skip 
> iterating through all the individual files in a table.
> Tested on Spark 3.0.1.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34344) Have functionality to trace back Spark SQL queries from the application ID that got submitted on YARN

2021-02-08 Thread Arpan Bhandari (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34344?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arpan Bhandari updated SPARK-34344:
---
Component/s: SQL

> Have functionality to trace back Spark SQL queries from the application ID 
> that got submitted on YARN
> -
>
> Key: SPARK-34344
> URL: https://issues.apache.org/jira/browse/SPARK-34344
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Shell, SQL
>Affects Versions: 1.6.3, 2.3.0, 2.4.5
>Reporter: Arpan Bhandari
>Priority: Major
>
> We need to have Application Id from resource manager mapped to the specific 
> spark sql query that got executed with respect to that application Id so that 
> back tracing is possible.
> For example : if i run a query using spark shell : 
> spark.sql("select dt.d_year,item.i_brand_id brand_id,item.i_brand 
> brand,sum(ss_ext_sales_price) sum_agg from date_dim dt,store_sales,item where 
> dt.d_date_sk = store_sales.ss_sold_date_sk and store_sales.ss_item_sk = 
> item.i_item_sk and item.i_manufact_id = 436 and dt.d_moy=12 group by 
> dt.d_year,item.i_brand,item.i_brand_id order by dt.d_year,sum_agg 
> desc,brand_id limit 100").show();
> When  i see the event logs or the history server i don't see the query 
> anywhere, but the query plan is there, so it becomes difficult to trace back 
> what query actually got submitted. (if have to map it to the specific 
> application Id on yarn)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34344) Have functionality to trace back Spark SQL queries from the application ID that got submitted on YARN

2021-02-08 Thread Arpan Bhandari (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34344?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arpan Bhandari updated SPARK-34344:
---
Component/s: (was: Spark Submit)

> Have functionality to trace back Spark SQL queries from the application ID 
> that got submitted on YARN
> -
>
> Key: SPARK-34344
> URL: https://issues.apache.org/jira/browse/SPARK-34344
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Shell
>Affects Versions: 1.6.3, 2.3.0, 2.4.5
>Reporter: Arpan Bhandari
>Priority: Major
>
> We need to have Application Id from resource manager mapped to the specific 
> spark sql query that got executed with respect to that application Id so that 
> back tracing is possible.
> For example : if i run a query using spark shell : 
> spark.sql("select dt.d_year,item.i_brand_id brand_id,item.i_brand 
> brand,sum(ss_ext_sales_price) sum_agg from date_dim dt,store_sales,item where 
> dt.d_date_sk = store_sales.ss_sold_date_sk and store_sales.ss_item_sk = 
> item.i_item_sk and item.i_manufact_id = 436 and dt.d_moy=12 group by 
> dt.d_year,item.i_brand,item.i_brand_id order by dt.d_year,sum_agg 
> desc,brand_id limit 100").show();
> When  i see the event logs or the history server i don't see the query 
> anywhere, but the query plan is there, so it becomes difficult to trace back 
> what query actually got submitted. (if have to map it to the specific 
> application Id on yarn)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34168) Support DPP in AQE When the join is Broadcast hash join before applying the AQE rules

2021-02-08 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34168?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-34168:
---

Assignee: Ke Jia

> Support DPP in AQE When the join is Broadcast hash join before applying the 
> AQE rules
> -
>
> Key: SPARK-34168
> URL: https://issues.apache.org/jira/browse/SPARK-34168
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0, 3.0.1
>Reporter: Ke Jia
>Assignee: Ke Jia
>Priority: Major
>
> Both AQE and DPP cannot be applied at the same time. This PR will enable AQE 
> and DPP when the join is Broadcast hash join at the beginning.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-34168) Support DPP in AQE When the join is Broadcast hash join before applying the AQE rules

2021-02-08 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34168?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-34168.
-
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 31258
[https://github.com/apache/spark/pull/31258]

> Support DPP in AQE When the join is Broadcast hash join before applying the 
> AQE rules
> -
>
> Key: SPARK-34168
> URL: https://issues.apache.org/jira/browse/SPARK-34168
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0, 3.0.1
>Reporter: Ke Jia
>Assignee: Ke Jia
>Priority: Major
> Fix For: 3.2.0
>
>
> Both AQE and DPP cannot be applied at the same time. This PR will enable AQE 
> and DPP when the join is Broadcast hash join at the beginning.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34403) Remove dependency to commons-httpclient, is not used and has vulnerabilities.

2021-02-08 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34403?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34403:


Assignee: (was: Apache Spark)

> Remove dependency to commons-httpclient, is not used and has vulnerabilities.
> -
>
> Key: SPARK-34403
> URL: https://issues.apache.org/jira/browse/SPARK-34403
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: Sergio Sainz
>Priority: Major
>
> 
>  commons-httpclient
>  commons-httpclient
> 
>  
> Has vulnerabilities as below:
>  
> CVE-2012-6153
> CVE-2012-5783
>  
> Also, after removing it and running `spark/sql/hive$mvn compile test` the 
> result is SUCCESS
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34403) Remove dependency to commons-httpclient, is not used and has vulnerabilities.

2021-02-08 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34403?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34403:


Assignee: Apache Spark

> Remove dependency to commons-httpclient, is not used and has vulnerabilities.
> -
>
> Key: SPARK-34403
> URL: https://issues.apache.org/jira/browse/SPARK-34403
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: Sergio Sainz
>Assignee: Apache Spark
>Priority: Major
>
> 
>  commons-httpclient
>  commons-httpclient
> 
>  
> Has vulnerabilities as below:
>  
> CVE-2012-6153
> CVE-2012-5783
>  
> Also, after removing it and running `spark/sql/hive$mvn compile test` the 
> result is SUCCESS
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34403) Remove dependency to commons-httpclient, is not used and has vulnerabilities.

2021-02-08 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17281180#comment-17281180
 ] 

Apache Spark commented on SPARK-34403:
--

User 'ssainz' has created a pull request for this issue:
https://github.com/apache/spark/pull/31528

> Remove dependency to commons-httpclient, is not used and has vulnerabilities.
> -
>
> Key: SPARK-34403
> URL: https://issues.apache.org/jira/browse/SPARK-34403
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: Sergio Sainz
>Priority: Major
>
> 
>  commons-httpclient
>  commons-httpclient
> 
>  
> Has vulnerabilities as below:
>  
> CVE-2012-6153
> CVE-2012-5783
>  
> Also, after removing it and running `spark/sql/hive$mvn compile test` the 
> result is SUCCESS
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34403) Remove dependency to commons-httpclient, is not used and has vulnerabilities.

2021-02-08 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17281181#comment-17281181
 ] 

Apache Spark commented on SPARK-34403:
--

User 'ssainz' has created a pull request for this issue:
https://github.com/apache/spark/pull/31528

> Remove dependency to commons-httpclient, is not used and has vulnerabilities.
> -
>
> Key: SPARK-34403
> URL: https://issues.apache.org/jira/browse/SPARK-34403
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: Sergio Sainz
>Priority: Major
>
> 
>  commons-httpclient
>  commons-httpclient
> 
>  
> Has vulnerabilities as below:
>  
> CVE-2012-6153
> CVE-2012-5783
>  
> Also, after removing it and running `spark/sql/hive$mvn compile test` the 
> result is SUCCESS
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34403) Remove dependency to commons-httpclient, is not used and has vulnerabilities.

2021-02-08 Thread Sergio Sainz (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17281179#comment-17281179
 ] 

Sergio Sainz commented on SPARK-34403:
--

PR: https://github.com/apache/spark/pull/31528

> Remove dependency to commons-httpclient, is not used and has vulnerabilities.
> -
>
> Key: SPARK-34403
> URL: https://issues.apache.org/jira/browse/SPARK-34403
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: Sergio Sainz
>Priority: Major
>
> 
>  commons-httpclient
>  commons-httpclient
> 
>  
> Has vulnerabilities as below:
>  
> CVE-2012-6153
> CVE-2012-5783
>  
> Also, after removing it and running `spark/sql/hive$mvn compile test` the 
> result is SUCCESS
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   3   >