[jira] [Reopened] (SPARK-34406) When we submit spark core tasks frequently, the submitted nodes will have a lot of resource pressure
[ https://issues.apache.org/jira/browse/SPARK-34406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reopened SPARK-34406: -- > When we submit spark core tasks frequently, the submitted nodes will have a > lot of resource pressure > > > Key: SPARK-34406 > URL: https://issues.apache.org/jira/browse/SPARK-34406 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.1 >Reporter: hao >Priority: Major > > When we submit spark core tasks frequently, the submitted node will have a > lot of resource pressure, because spark will create a process instead of a > thread for each submitted task. In fact, there is a lot of resource > consumption. When the QPS of the submitted task is very high, the submission > will fail due to insufficient resources. I would like to ask how to optimize > the amount of resources consumed by spark core submission -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34406) When we submit spark core tasks frequently, the submitted nodes will have a lot of resource pressure
[ https://issues.apache.org/jira/browse/SPARK-34406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17281560#comment-17281560 ] hao commented on SPARK-34406: - Yes, sir. I'm using the Yarn cluster mode. What I mean here is that when the spark client submits spark core to the remote yarn, it is submitted in the way of process, but this way consumes a lot of resources > When we submit spark core tasks frequently, the submitted nodes will have a > lot of resource pressure > > > Key: SPARK-34406 > URL: https://issues.apache.org/jira/browse/SPARK-34406 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.1 >Reporter: hao >Priority: Major > > When we submit spark core tasks frequently, the submitted node will have a > lot of resource pressure, because spark will create a process instead of a > thread for each submitted task. In fact, there is a lot of resource > consumption. When the QPS of the submitted task is very high, the submission > will fail due to insufficient resources. I would like to ask how to optimize > the amount of resources consumed by spark core submission -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-34389) Spark job on Kubernetes scheduled For Zero or less than minimum number of executors and Wait indefinitely under resource starvation
[ https://issues.apache.org/jira/browse/SPARK-34389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-34389. -- Resolution: Not A Problem > Spark job on Kubernetes scheduled For Zero or less than minimum number of > executors and Wait indefinitely under resource starvation > --- > > Key: SPARK-34389 > URL: https://issues.apache.org/jira/browse/SPARK-34389 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 3.0.1 >Reporter: Ranju >Priority: Major > Attachments: DriverLogs_ExecutorLaunchedLessThanMinExecutor.txt, > Steps to reproduce.docx > > > In case Cluster does not have sufficient resource (CPU/ Memory ) for minimum > number of executors , the executors goes in Pending State for indefinite time > until the resource gets free. > Suppose, Cluster Configurations are: > total Memory=204Gi > used Memory=200Gi > free memory= 4Gi > SPARK.EXECUTOR.MEMORY=10G > SPARK.DYNAMICALLOCTION.MINEXECUTORS=4 > SPARK.DYNAMICALLOCATION.MAXEXECUTORS=8 > Rather, the job should be cancelled if requested number of minimum executors > are not available at that point of time because of resource unavailability. > Currently it is doing partial scheduling or no scheduling and waiting > indefinitely. And the job got stuck. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34389) Spark job on Kubernetes scheduled For Zero or less than minimum number of executors and Wait indefinitely under resource starvation
[ https://issues.apache.org/jira/browse/SPARK-34389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17281544#comment-17281544 ] Hyukjin Kwon commented on SPARK-34389: -- I think this is more a question. I will tentatively resolve this ticket. [~ranju] it would be great if we can interact it in the mailing list first before filing it as an issue. > Spark job on Kubernetes scheduled For Zero or less than minimum number of > executors and Wait indefinitely under resource starvation > --- > > Key: SPARK-34389 > URL: https://issues.apache.org/jira/browse/SPARK-34389 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 3.0.1 >Reporter: Ranju >Priority: Major > Attachments: DriverLogs_ExecutorLaunchedLessThanMinExecutor.txt, > Steps to reproduce.docx > > > In case Cluster does not have sufficient resource (CPU/ Memory ) for minimum > number of executors , the executors goes in Pending State for indefinite time > until the resource gets free. > Suppose, Cluster Configurations are: > total Memory=204Gi > used Memory=200Gi > free memory= 4Gi > SPARK.EXECUTOR.MEMORY=10G > SPARK.DYNAMICALLOCTION.MINEXECUTORS=4 > SPARK.DYNAMICALLOCATION.MAXEXECUTORS=8 > Rather, the job should be cancelled if requested number of minimum executors > are not available at that point of time because of resource unavailability. > Currently it is doing partial scheduling or no scheduling and waiting > indefinitely. And the job got stuck. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34392) Invalid ID for offset-based ZoneId since Spark 3.0
[ https://issues.apache.org/jira/browse/SPARK-34392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17281543#comment-17281543 ] Hyukjin Kwon commented on SPARK-34392: -- cc [~maxgekk] FYI > Invalid ID for offset-based ZoneId since Spark 3.0 > -- > > Key: SPARK-34392 > URL: https://issues.apache.org/jira/browse/SPARK-34392 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0, 3.0.1 >Reporter: Yuming Wang >Priority: Major > > How to reproduce this issue: > {code:sql} > select to_utc_timestamp("2020-02-07 16:00:00", "GMT+8:00"); > {code} > Spark 2.4: > {noformat} > spark-sql> select to_utc_timestamp("2020-02-07 16:00:00", "GMT+8:00"); > 2020-02-07 08:00:00 > Time taken: 0.089 seconds, Fetched 1 row(s) > {noformat} > Spark 3.x: > {noformat} > spark-sql> select to_utc_timestamp("2020-02-07 16:00:00", "GMT+8:00"); > 21/02/07 01:24:32 ERROR SparkSQLDriver: Failed in [select > to_utc_timestamp("2020-02-07 16:00:00", "GMT+8:00")] > java.time.DateTimeException: Invalid ID for offset-based ZoneId: GMT+8:00 > at java.time.ZoneId.ofWithPrefix(ZoneId.java:437) > at java.time.ZoneId.of(ZoneId.java:407) > at java.time.ZoneId.of(ZoneId.java:359) > at java.time.ZoneId.of(ZoneId.java:315) > at > org.apache.spark.sql.catalyst.util.DateTimeUtils$.getZoneId(DateTimeUtils.scala:53) > at > org.apache.spark.sql.catalyst.util.DateTimeUtils$.toUTCTime(DateTimeUtils.scala:814) > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-34406) When we submit spark core tasks frequently, the submitted nodes will have a lot of resource pressure
[ https://issues.apache.org/jira/browse/SPARK-34406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-34406. -- Resolution: Won't Fix > When we submit spark core tasks frequently, the submitted nodes will have a > lot of resource pressure > > > Key: SPARK-34406 > URL: https://issues.apache.org/jira/browse/SPARK-34406 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.1 >Reporter: hao >Priority: Major > > When we submit spark core tasks frequently, the submitted node will have a > lot of resource pressure, because spark will create a process instead of a > thread for each submitted task. In fact, there is a lot of resource > consumption. When the QPS of the submitted task is very high, the submission > will fail due to insufficient resources. I would like to ask how to optimize > the amount of resources consumed by spark core submission -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34406) When we submit spark core tasks frequently, the submitted nodes will have a lot of resource pressure
[ https://issues.apache.org/jira/browse/SPARK-34406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17281541#comment-17281541 ] Hyukjin Kwon commented on SPARK-34406: -- You should use Yarn cluster mode that will evenly distribute the drivers. > When we submit spark core tasks frequently, the submitted nodes will have a > lot of resource pressure > > > Key: SPARK-34406 > URL: https://issues.apache.org/jira/browse/SPARK-34406 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.1 >Reporter: hao >Priority: Major > > When we submit spark core tasks frequently, the submitted node will have a > lot of resource pressure, because spark will create a process instead of a > thread for each submitted task. In fact, there is a lot of resource > consumption. When the QPS of the submitted task is very high, the submission > will fail due to insufficient resources. I would like to ask how to optimize > the amount of resources consumed by spark core submission -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-34406) When we submit spark core tasks frequently, the submitted nodes will have a lot of resource pressure
[ https://issues.apache.org/jira/browse/SPARK-34406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-34406: - Priority: Major (was: Critical) > When we submit spark core tasks frequently, the submitted nodes will have a > lot of resource pressure > > > Key: SPARK-34406 > URL: https://issues.apache.org/jira/browse/SPARK-34406 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.1 >Reporter: hao >Priority: Major > > When we submit spark core tasks frequently, the submitted node will have a > lot of resource pressure, because spark will create a process instead of a > thread for each submitted task. In fact, there is a lot of resource > consumption. When the QPS of the submitted task is very high, the submission > will fail due to insufficient resources. I would like to ask how to optimize > the amount of resources consumed by spark core submission -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-34407) KubernetesClusterSchedulerBackend.stop should clean up K8s resources
[ https://issues.apache.org/jira/browse/SPARK-34407?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-34407: -- Affects Version/s: 2.4.7 > KubernetesClusterSchedulerBackend.stop should clean up K8s resources > > > Key: SPARK-34407 > URL: https://issues.apache.org/jira/browse/SPARK-34407 > Project: Spark > Issue Type: Sub-task > Components: Kubernetes >Affects Versions: 2.4.7, 3.0.1, 3.1.0, 3.1.1 >Reporter: Dongjoon Hyun >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-34407) KubernetesClusterSchedulerBackend.stop should clean up K8s resources
[ https://issues.apache.org/jira/browse/SPARK-34407?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-34407: - Assignee: Dongjoon Hyun > KubernetesClusterSchedulerBackend.stop should clean up K8s resources > > > Key: SPARK-34407 > URL: https://issues.apache.org/jira/browse/SPARK-34407 > Project: Spark > Issue Type: Sub-task > Components: Kubernetes >Affects Versions: 2.3.4, 2.4.7, 3.0.1, 3.1.0, 3.1.1 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-34407) KubernetesClusterSchedulerBackend.stop should clean up K8s resources
[ https://issues.apache.org/jira/browse/SPARK-34407?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-34407. --- Fix Version/s: 3.1.2 2.4.8 3.0.2 Resolution: Fixed Issue resolved by pull request 31533 [https://github.com/apache/spark/pull/31533] > KubernetesClusterSchedulerBackend.stop should clean up K8s resources > > > Key: SPARK-34407 > URL: https://issues.apache.org/jira/browse/SPARK-34407 > Project: Spark > Issue Type: Sub-task > Components: Kubernetes >Affects Versions: 2.3.4, 2.4.7, 3.0.1, 3.1.0, 3.1.1 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Fix For: 3.0.2, 2.4.8, 3.1.2 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34389) Spark job on Kubernetes scheduled For Zero or less than minimum number of executors and Wait indefinitely under resource starvation
[ https://issues.apache.org/jira/browse/SPARK-34389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17281534#comment-17281534 ] Ranju commented on SPARK-34389: --- Yes I understood this why there is no retry logic and thanks for this explanation and can close the issue. Can you guide , to mitigate the indefinite waiting time for executors, is it possible to get the available resources of the cluster and match it with the required executor resources and if it satisfies then submits the job. > Spark job on Kubernetes scheduled For Zero or less than minimum number of > executors and Wait indefinitely under resource starvation > --- > > Key: SPARK-34389 > URL: https://issues.apache.org/jira/browse/SPARK-34389 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 3.0.1 >Reporter: Ranju >Priority: Major > Attachments: DriverLogs_ExecutorLaunchedLessThanMinExecutor.txt, > Steps to reproduce.docx > > > In case Cluster does not have sufficient resource (CPU/ Memory ) for minimum > number of executors , the executors goes in Pending State for indefinite time > until the resource gets free. > Suppose, Cluster Configurations are: > total Memory=204Gi > used Memory=200Gi > free memory= 4Gi > SPARK.EXECUTOR.MEMORY=10G > SPARK.DYNAMICALLOCTION.MINEXECUTORS=4 > SPARK.DYNAMICALLOCATION.MAXEXECUTORS=8 > Rather, the job should be cancelled if requested number of minimum executors > are not available at that point of time because of resource unavailability. > Currently it is doing partial scheduling or no scheduling and waiting > indefinitely. And the job got stuck. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-34405) The getMetricsSnapshot method of the PrometheusServlet class has a wrong value
[ https://issues.apache.org/jira/browse/SPARK-34405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-34405: - Assignee: iteblog > The getMetricsSnapshot method of the PrometheusServlet class has a wrong value > -- > > Key: SPARK-34405 > URL: https://issues.apache.org/jira/browse/SPARK-34405 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.1, 3.1.0, 3.2.0, 3.1.1 >Reporter: iteblog >Assignee: iteblog >Priority: Minor > > The mean value of timersLabels in the PrometheusServlet class is wrong, You > can look at line 105 of this class: > [L105.|https://github.com/apache/spark/blob/37fe8c6d3cd1c5aae3a61fb9b32ca4595267f1bb/core/src/main/scala/org/apache/spark/metrics/sink/PrometheusServlet.scala#L105] > {code:java} > // code placeholder > sb.append(s"${prefix}Mean$timersLabels ${snapshot.getMax}\n"){code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-34405) The getMetricsSnapshot method of the PrometheusServlet class has a wrong value
[ https://issues.apache.org/jira/browse/SPARK-34405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-34405. --- Fix Version/s: 3.0.2 3.1.1 Resolution: Fixed Issue resolved by pull request 31532 [https://github.com/apache/spark/pull/31532] > The getMetricsSnapshot method of the PrometheusServlet class has a wrong value > -- > > Key: SPARK-34405 > URL: https://issues.apache.org/jira/browse/SPARK-34405 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.1, 3.1.0, 3.2.0, 3.1.1 >Reporter: iteblog >Assignee: iteblog >Priority: Minor > Fix For: 3.1.1, 3.0.2 > > > The mean value of timersLabels in the PrometheusServlet class is wrong, You > can look at line 105 of this class: > [L105.|https://github.com/apache/spark/blob/37fe8c6d3cd1c5aae3a61fb9b32ca4595267f1bb/core/src/main/scala/org/apache/spark/metrics/sink/PrometheusServlet.scala#L105] > {code:java} > // code placeholder > sb.append(s"${prefix}Mean$timersLabels ${snapshot.getMax}\n"){code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-34405) The getMetricsSnapshot method of the PrometheusServlet class has a wrong value
[ https://issues.apache.org/jira/browse/SPARK-34405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-34405: -- Fix Version/s: (was: 3.1.1) 3.1.2 > The getMetricsSnapshot method of the PrometheusServlet class has a wrong value > -- > > Key: SPARK-34405 > URL: https://issues.apache.org/jira/browse/SPARK-34405 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.1, 3.1.0, 3.2.0, 3.1.1 >Reporter: iteblog >Assignee: iteblog >Priority: Minor > Fix For: 3.0.2, 3.1.2 > > > The mean value of timersLabels in the PrometheusServlet class is wrong, You > can look at line 105 of this class: > [L105.|https://github.com/apache/spark/blob/37fe8c6d3cd1c5aae3a61fb9b32ca4595267f1bb/core/src/main/scala/org/apache/spark/metrics/sink/PrometheusServlet.scala#L105] > {code:java} > // code placeholder > sb.append(s"${prefix}Mean$timersLabels ${snapshot.getMax}\n"){code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-34405) The getMetricsSnapshot method of the PrometheusServlet class has a wrong value
[ https://issues.apache.org/jira/browse/SPARK-34405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-34405: -- Affects Version/s: 3.1.1 3.2.0 3.1.0 > The getMetricsSnapshot method of the PrometheusServlet class has a wrong value > -- > > Key: SPARK-34405 > URL: https://issues.apache.org/jira/browse/SPARK-34405 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.1, 3.1.0, 3.2.0, 3.1.1 >Reporter: iteblog >Priority: Minor > > The mean value of timersLabels in the PrometheusServlet class is wrong, You > can look at line 105 of this class: > [L105.|https://github.com/apache/spark/blob/37fe8c6d3cd1c5aae3a61fb9b32ca4595267f1bb/core/src/main/scala/org/apache/spark/metrics/sink/PrometheusServlet.scala#L105] > {code:java} > // code placeholder > sb.append(s"${prefix}Mean$timersLabels ${snapshot.getMax}\n"){code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-34407) KubernetesClusterSchedulerBackend.stop should clean up K8s resources
[ https://issues.apache.org/jira/browse/SPARK-34407?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-34407: -- Affects Version/s: 2.3.4 > KubernetesClusterSchedulerBackend.stop should clean up K8s resources > > > Key: SPARK-34407 > URL: https://issues.apache.org/jira/browse/SPARK-34407 > Project: Spark > Issue Type: Sub-task > Components: Kubernetes >Affects Versions: 2.3.4, 2.4.7, 3.0.1, 3.1.0, 3.1.1 >Reporter: Dongjoon Hyun >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-34407) KubernetesClusterSchedulerBackend.stop should clean up K8s resources
Dongjoon Hyun created SPARK-34407: - Summary: KubernetesClusterSchedulerBackend.stop should clean up K8s resources Key: SPARK-34407 URL: https://issues.apache.org/jira/browse/SPARK-34407 Project: Spark Issue Type: Bug Components: Kubernetes Affects Versions: 3.0.1, 3.1.0, 3.1.1 Reporter: Dongjoon Hyun -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-34407) KubernetesClusterSchedulerBackend.stop should clean up K8s resources
[ https://issues.apache.org/jira/browse/SPARK-34407?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-34407: -- Parent: SPARK-33005 Issue Type: Sub-task (was: Bug) > KubernetesClusterSchedulerBackend.stop should clean up K8s resources > > > Key: SPARK-34407 > URL: https://issues.apache.org/jira/browse/SPARK-34407 > Project: Spark > Issue Type: Sub-task > Components: Kubernetes >Affects Versions: 3.0.1, 3.1.0, 3.1.1 >Reporter: Dongjoon Hyun >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34407) KubernetesClusterSchedulerBackend.stop should clean up K8s resources
[ https://issues.apache.org/jira/browse/SPARK-34407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17281523#comment-17281523 ] Apache Spark commented on SPARK-34407: -- User 'dongjoon-hyun' has created a pull request for this issue: https://github.com/apache/spark/pull/31533 > KubernetesClusterSchedulerBackend.stop should clean up K8s resources > > > Key: SPARK-34407 > URL: https://issues.apache.org/jira/browse/SPARK-34407 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 3.0.1, 3.1.0, 3.1.1 >Reporter: Dongjoon Hyun >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-34407) KubernetesClusterSchedulerBackend.stop should clean up K8s resources
[ https://issues.apache.org/jira/browse/SPARK-34407?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-34407: Assignee: (was: Apache Spark) > KubernetesClusterSchedulerBackend.stop should clean up K8s resources > > > Key: SPARK-34407 > URL: https://issues.apache.org/jira/browse/SPARK-34407 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 3.0.1, 3.1.0, 3.1.1 >Reporter: Dongjoon Hyun >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-34407) KubernetesClusterSchedulerBackend.stop should clean up K8s resources
[ https://issues.apache.org/jira/browse/SPARK-34407?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-34407: Assignee: Apache Spark > KubernetesClusterSchedulerBackend.stop should clean up K8s resources > > > Key: SPARK-34407 > URL: https://issues.apache.org/jira/browse/SPARK-34407 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 3.0.1, 3.1.0, 3.1.1 >Reporter: Dongjoon Hyun >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34407) KubernetesClusterSchedulerBackend.stop should clean up K8s resources
[ https://issues.apache.org/jira/browse/SPARK-34407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17281522#comment-17281522 ] Apache Spark commented on SPARK-34407: -- User 'dongjoon-hyun' has created a pull request for this issue: https://github.com/apache/spark/pull/31533 > KubernetesClusterSchedulerBackend.stop should clean up K8s resources > > > Key: SPARK-34407 > URL: https://issues.apache.org/jira/browse/SPARK-34407 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 3.0.1, 3.1.0, 3.1.1 >Reporter: Dongjoon Hyun >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34405) The getMetricsSnapshot method of the PrometheusServlet class has a wrong value
[ https://issues.apache.org/jira/browse/SPARK-34405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17281506#comment-17281506 ] Apache Spark commented on SPARK-34405: -- User '397090770' has created a pull request for this issue: https://github.com/apache/spark/pull/31532 > The getMetricsSnapshot method of the PrometheusServlet class has a wrong value > -- > > Key: SPARK-34405 > URL: https://issues.apache.org/jira/browse/SPARK-34405 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.1 >Reporter: iteblog >Priority: Minor > > The mean value of timersLabels in the PrometheusServlet class is wrong, You > can look at line 105 of this class: > [L105.|https://github.com/apache/spark/blob/37fe8c6d3cd1c5aae3a61fb9b32ca4595267f1bb/core/src/main/scala/org/apache/spark/metrics/sink/PrometheusServlet.scala#L105] > {code:java} > // code placeholder > sb.append(s"${prefix}Mean$timersLabels ${snapshot.getMax}\n"){code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34405) The getMetricsSnapshot method of the PrometheusServlet class has a wrong value
[ https://issues.apache.org/jira/browse/SPARK-34405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17281505#comment-17281505 ] Apache Spark commented on SPARK-34405: -- User '397090770' has created a pull request for this issue: https://github.com/apache/spark/pull/31532 > The getMetricsSnapshot method of the PrometheusServlet class has a wrong value > -- > > Key: SPARK-34405 > URL: https://issues.apache.org/jira/browse/SPARK-34405 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.1 >Reporter: iteblog >Priority: Minor > > The mean value of timersLabels in the PrometheusServlet class is wrong, You > can look at line 105 of this class: > [L105.|https://github.com/apache/spark/blob/37fe8c6d3cd1c5aae3a61fb9b32ca4595267f1bb/core/src/main/scala/org/apache/spark/metrics/sink/PrometheusServlet.scala#L105] > {code:java} > // code placeholder > sb.append(s"${prefix}Mean$timersLabels ${snapshot.getMax}\n"){code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-34405) The getMetricsSnapshot method of the PrometheusServlet class has a wrong value
[ https://issues.apache.org/jira/browse/SPARK-34405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-34405: Assignee: Apache Spark > The getMetricsSnapshot method of the PrometheusServlet class has a wrong value > -- > > Key: SPARK-34405 > URL: https://issues.apache.org/jira/browse/SPARK-34405 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.1 >Reporter: iteblog >Assignee: Apache Spark >Priority: Minor > > The mean value of timersLabels in the PrometheusServlet class is wrong, You > can look at line 105 of this class: > [L105.|https://github.com/apache/spark/blob/37fe8c6d3cd1c5aae3a61fb9b32ca4595267f1bb/core/src/main/scala/org/apache/spark/metrics/sink/PrometheusServlet.scala#L105] > {code:java} > // code placeholder > sb.append(s"${prefix}Mean$timersLabels ${snapshot.getMax}\n"){code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-34405) The getMetricsSnapshot method of the PrometheusServlet class has a wrong value
[ https://issues.apache.org/jira/browse/SPARK-34405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-34405: Assignee: (was: Apache Spark) > The getMetricsSnapshot method of the PrometheusServlet class has a wrong value > -- > > Key: SPARK-34405 > URL: https://issues.apache.org/jira/browse/SPARK-34405 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.1 >Reporter: iteblog >Priority: Minor > > The mean value of timersLabels in the PrometheusServlet class is wrong, You > can look at line 105 of this class: > [L105.|https://github.com/apache/spark/blob/37fe8c6d3cd1c5aae3a61fb9b32ca4595267f1bb/core/src/main/scala/org/apache/spark/metrics/sink/PrometheusServlet.scala#L105] > {code:java} > // code placeholder > sb.append(s"${prefix}Mean$timersLabels ${snapshot.getMax}\n"){code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-34406) When we submit spark core tasks frequently, the submitted nodes will have a lot of resource pressure
hao created SPARK-34406: --- Summary: When we submit spark core tasks frequently, the submitted nodes will have a lot of resource pressure Key: SPARK-34406 URL: https://issues.apache.org/jira/browse/SPARK-34406 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 3.0.1 Reporter: hao When we submit spark core tasks frequently, the submitted node will have a lot of resource pressure, because spark will create a process instead of a thread for each submitted task. In fact, there is a lot of resource consumption. When the QPS of the submitted task is very high, the submission will fail due to insufficient resources. I would like to ask how to optimize the amount of resources consumed by spark core submission -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-34405) The getMetricsSnapshot method of the PrometheusServlet class has a wrong value
iteblog created SPARK-34405: --- Summary: The getMetricsSnapshot method of the PrometheusServlet class has a wrong value Key: SPARK-34405 URL: https://issues.apache.org/jira/browse/SPARK-34405 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 3.0.1 Reporter: iteblog The mean value of timersLabels in the PrometheusServlet class is wrong, You can look at line 105 of this class: [L105.|https://github.com/apache/spark/blob/37fe8c6d3cd1c5aae3a61fb9b32ca4595267f1bb/core/src/main/scala/org/apache/spark/metrics/sink/PrometheusServlet.scala#L105] {code:java} // code placeholder sb.append(s"${prefix}Mean$timersLabels ${snapshot.getMax}\n"){code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-34352) Improve SQLQueryTestSuite so as could run on windows system
[ https://issues.apache.org/jira/browse/SPARK-34352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-34352: - Fix Version/s: (was: 3.1.2) 3.2.0 > Improve SQLQueryTestSuite so as could run on windows system > --- > > Key: SPARK-34352 > URL: https://issues.apache.org/jira/browse/SPARK-34352 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0 >Reporter: jiaan.geng >Priority: Major > Fix For: 3.2.0 > > > The current implement of SQLQueryTestSuite cannot run on windows system. > Becasue the code below will fail on windows system: > assume(TestUtils.testCommandAvailable("/bin/bash")) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-29465) Unable to configure SPARK UI (spark.ui.port) in spark yarn cluster mode.
[ https://issues.apache.org/jira/browse/SPARK-29465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen reassigned SPARK-29465: Assignee: Vishwas Nalka > Unable to configure SPARK UI (spark.ui.port) in spark yarn cluster mode. > - > > Key: SPARK-29465 > URL: https://issues.apache.org/jira/browse/SPARK-29465 > Project: Spark > Issue Type: Improvement > Components: Spark Core, Spark Submit, YARN >Affects Versions: 3.1.0 >Reporter: Vishwas Nalka >Assignee: Vishwas Nalka >Priority: Major > Fix For: 3.1.0 > > > I'm trying to restrict the ports used by spark app which is launched in yarn > cluster mode. All ports (viz. driver, executor, blockmanager) could be > specified using the respective properties except the ui port. The spark app > is launched using JAVA code and setting the property spark.ui.port in > sparkConf doesn't seem to help. Even setting a JVM option > -Dspark.ui.port="some_port" does not spawn the UI is required port. > From the logs of the spark app, *_the property spark.ui.port is overridden > and the JVM property '-Dspark.ui.port=0' is set_* even though it is never set > to 0. > _(Run in Spark 1.6.2) From the logs ->_ > _command:LD_LIBRARY_PATH="/usr/hdp/2.6.4.0-91/hadoop/lib/native:$LD_LIBRARY_PATH" > {{JAVA_HOME}}/bin/java -server -XX:OnOutOfMemoryError='kill %p' -Xms4096m > -Xmx4096m -Djava.io.tmpdir={{PWD}}/tmp '-Dspark.blockManager.port=9900' > '-Dspark.driver.port=9902' '-Dspark.fileserver.port=9903' > '-Dspark.broadcast.port=9904' '-Dspark.port.maxRetries=20' > '-Dspark.ui.port=0' '-Dspark.executor.port=9905'_ > _19/10/14 16:39:59 INFO Utils: Successfully started service 'SparkUI' on port > 35167.19/10/14 16:39:59 INFO SparkUI: Started SparkUI at_ > [_http://10.65.170.98:35167_|http://10.65.170.98:35167/] > Even tried using a *spark-submit command with --conf spark.ui.port* does > spawn UI in required port > {color:#172b4d}_(Run in Spark 2.4.4)_{color} > {color:#172b4d}_./bin/spark-submit --class org.apache.spark.examples.SparkPi > --master yarn --deploy-mode cluster --driver-memory 4g --executor-memory 2g > --executor-cores 1 --conf spark.ui.port=12345 --conf spark.driver.port=12340 > --queue default examples/jars/spark-examples_2.11-2.4.4.jar 10_{color} > _From the logs::_ > _19/10/15 00:04:05 INFO ui.SparkUI: Stopped Spark web UI at > [http://invrh74ace005.informatica.com:46622|http://invrh74ace005.informatica.com:46622/]_ > _command:{{JAVA_HOME}}/bin/java -server -Xmx2048m > -Djava.io.tmpdir={{PWD}}/tmp '-Dspark.ui.port=0' 'Dspark.driver.port=12340' > -Dspark.yarn.app.container.log.dir= -XX:OnOutOfMemoryError='kill %p' > org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url > spark://coarsegrainedschedu...@invrh74ace005.informatica.com:12340 > --executor-id --hostname --cores 1 --app-id > application_1570992022035_0089 --user-class-path > [file:$PWD/__app__.jar1|file://%24pwd/__app__.jar1]>/stdout2>/stderr_ > > Looks like the application master override this and set a JVM property before > launch resulting in random UI port even though spark.ui.port is set by the > user. > In these links > # > [https://github.com/apache/spark/blob/master/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala] > (line 214) > # > [https://github.com/cloudera/spark/blob/master/yarn/alpha/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala] > (line 75) > I can see that the method _*run() in above files sets a system property > UI_PORT*_ and _*spark.ui.port respectively.*_ -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29465) Unable to configure SPARK UI (spark.ui.port) in spark yarn cluster mode.
[ https://issues.apache.org/jira/browse/SPARK-29465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen updated SPARK-29465: - Priority: Minor (was: Major) > Unable to configure SPARK UI (spark.ui.port) in spark yarn cluster mode. > - > > Key: SPARK-29465 > URL: https://issues.apache.org/jira/browse/SPARK-29465 > Project: Spark > Issue Type: Improvement > Components: Spark Core, Spark Submit, YARN >Affects Versions: 3.1.0 >Reporter: Vishwas Nalka >Assignee: Vishwas Nalka >Priority: Minor > Fix For: 3.1.0 > > > I'm trying to restrict the ports used by spark app which is launched in yarn > cluster mode. All ports (viz. driver, executor, blockmanager) could be > specified using the respective properties except the ui port. The spark app > is launched using JAVA code and setting the property spark.ui.port in > sparkConf doesn't seem to help. Even setting a JVM option > -Dspark.ui.port="some_port" does not spawn the UI is required port. > From the logs of the spark app, *_the property spark.ui.port is overridden > and the JVM property '-Dspark.ui.port=0' is set_* even though it is never set > to 0. > _(Run in Spark 1.6.2) From the logs ->_ > _command:LD_LIBRARY_PATH="/usr/hdp/2.6.4.0-91/hadoop/lib/native:$LD_LIBRARY_PATH" > {{JAVA_HOME}}/bin/java -server -XX:OnOutOfMemoryError='kill %p' -Xms4096m > -Xmx4096m -Djava.io.tmpdir={{PWD}}/tmp '-Dspark.blockManager.port=9900' > '-Dspark.driver.port=9902' '-Dspark.fileserver.port=9903' > '-Dspark.broadcast.port=9904' '-Dspark.port.maxRetries=20' > '-Dspark.ui.port=0' '-Dspark.executor.port=9905'_ > _19/10/14 16:39:59 INFO Utils: Successfully started service 'SparkUI' on port > 35167.19/10/14 16:39:59 INFO SparkUI: Started SparkUI at_ > [_http://10.65.170.98:35167_|http://10.65.170.98:35167/] > Even tried using a *spark-submit command with --conf spark.ui.port* does > spawn UI in required port > {color:#172b4d}_(Run in Spark 2.4.4)_{color} > {color:#172b4d}_./bin/spark-submit --class org.apache.spark.examples.SparkPi > --master yarn --deploy-mode cluster --driver-memory 4g --executor-memory 2g > --executor-cores 1 --conf spark.ui.port=12345 --conf spark.driver.port=12340 > --queue default examples/jars/spark-examples_2.11-2.4.4.jar 10_{color} > _From the logs::_ > _19/10/15 00:04:05 INFO ui.SparkUI: Stopped Spark web UI at > [http://invrh74ace005.informatica.com:46622|http://invrh74ace005.informatica.com:46622/]_ > _command:{{JAVA_HOME}}/bin/java -server -Xmx2048m > -Djava.io.tmpdir={{PWD}}/tmp '-Dspark.ui.port=0' 'Dspark.driver.port=12340' > -Dspark.yarn.app.container.log.dir= -XX:OnOutOfMemoryError='kill %p' > org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url > spark://coarsegrainedschedu...@invrh74ace005.informatica.com:12340 > --executor-id --hostname --cores 1 --app-id > application_1570992022035_0089 --user-class-path > [file:$PWD/__app__.jar1|file://%24pwd/__app__.jar1]>/stdout2>/stderr_ > > Looks like the application master override this and set a JVM property before > launch resulting in random UI port even though spark.ui.port is set by the > user. > In these links > # > [https://github.com/apache/spark/blob/master/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala] > (line 214) > # > [https://github.com/cloudera/spark/blob/master/yarn/alpha/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala] > (line 75) > I can see that the method _*run() in above files sets a system property > UI_PORT*_ and _*spark.ui.port respectively.*_ -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33571) Handling of hybrid to proleptic calendar when reading and writing Parquet data not working correctly
[ https://issues.apache.org/jira/browse/SPARK-33571?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen reassigned SPARK-33571: Assignee: Maxim Gekk > Handling of hybrid to proleptic calendar when reading and writing Parquet > data not working correctly > > > Key: SPARK-33571 > URL: https://issues.apache.org/jira/browse/SPARK-33571 > Project: Spark > Issue Type: Bug > Components: PySpark, Spark Core >Affects Versions: 3.0.0, 3.0.1 >Reporter: Simon >Assignee: Maxim Gekk >Priority: Major > Fix For: 3.1.0 > > > The handling of old dates written with older Spark versions (<2.4.6) using > the hybrid calendar in Spark 3.0.0 and 3.0.1 seems to be broken/not working > correctly. > From what I understand it should work like this: > * Only relevant for `DateType` before 1582-10-15 or `TimestampType` before > 1900-01-01T00:00:00Z > * Only applies when reading or writing parquet files > * When reading parquet files written with Spark < 2.4.6 which contain dates > or timestamps before the above mentioned moments in time a > `SparkUpgradeException` should be raised informing the user to choose either > `LEGACY` or `CORRECTED` for the `datetimeRebaseModeInRead` > * When reading parquet files written with Spark < 2.4.6 which contain dates > or timestamps before the above mentioned moments in time and > `datetimeRebaseModeInRead` is set to `LEGACY` the dates and timestamps should > show the same values in Spark 3.0.1. with for example `df.show()` as they did > in Spark 2.4.5 > * When reading parquet files written with Spark < 2.4.6 which contain dates > or timestamps before the above mentioned moments in time and > `datetimeRebaseModeInRead` is set to `CORRECTED` the dates and timestamps > should show different values in Spark 3.0.1. with for example `df.show()` as > they did in Spark 2.4.5 > * When writing parqet files with Spark > 3.0.0 which contain dates or > timestamps before the above mentioned moment in time a > `SparkUpgradeException` should be raised informing the user to choose either > `LEGACY` or `CORRECTED` for the `datetimeRebaseModeInWrite` > First of all I'm not 100% sure all of this is correct. I've been unable to > find any clear documentation on the expected behavior. The understanding I > have was pieced together from the mailing list > ([http://apache-spark-user-list.1001560.n3.nabble.com/Spark-3-0-1-new-Proleptic-Gregorian-calendar-td38914.html)] > the blog post linked there and looking at the Spark code. > From our testing we're seeing several issues: > * Reading parquet data with Spark 3.0.1 that was written with Spark 2.4.5. > that contains fields of type `TimestampType` which contain timestamps before > the above mentioned moments in time without `datetimeRebaseModeInRead` set > doesn't raise the `SparkUpgradeException`, it succeeds without any changes to > the resulting dataframe compared to that dataframe in Spark 2.4.5 > * Reading parquet data with Spark 3.0.1 that was written with Spark 2.4.5. > that contains fields of type `TimestampType` or `DateType` which contain > dates or timestamps before the above mentioned moments in time with > `datetimeRebaseModeInRead` set to `LEGACY` results in the same values in the > dataframe as when using `CORRECTED`, so it seems like no rebasing is > happening. > I've made some scripts to help with testing/show the behavior, it uses > pyspark 2.4.5, 2.4.6 and 3.0.1. You can find them here > [https://github.com/simonvanderveldt/spark3-rebasemode-issue]. I'll post the > outputs in a comment below as well. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-34395) Clean up unused code for code simplifications
[ https://issues.apache.org/jira/browse/SPARK-34395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen reassigned SPARK-34395: Assignee: yikf > Clean up unused code for code simplifications > - > > Key: SPARK-34395 > URL: https://issues.apache.org/jira/browse/SPARK-34395 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0, 3.2.0 >Reporter: yikf >Assignee: yikf >Priority: Trivial > > Currently, we pass the default value `EmptyRow` to method `checkEvaluation` > in the StringExpressionsSuite, but the default value of the 'checkEvaluation' > method parameter is the `emptyRow`. > We can clean the parameter for Code Simplifications. > > example: > *before:* > {code:java} > def testConcat(inputs: String*): Unit = { > val expected = if (inputs.contains(null)) null else inputs.mkString > checkEvaluation(Concat(inputs.map(Literal.create(_, StringType))), > expected, EmptyRow) > }{code} > *after:* > {code:java} > def testConcat(inputs: String*): Unit = { > val expected = if (inputs.contains(null)) null else inputs.mkString > checkEvaluation(Concat(inputs.map(Literal.create(_, StringType))), expected) > }{code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-34395) Clean up unused code for code simplifications
[ https://issues.apache.org/jira/browse/SPARK-34395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen resolved SPARK-34395. -- Fix Version/s: 3.2.0 Resolution: Fixed Issue resolved by pull request 31510 [https://github.com/apache/spark/pull/31510] > Clean up unused code for code simplifications > - > > Key: SPARK-34395 > URL: https://issues.apache.org/jira/browse/SPARK-34395 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0, 3.2.0 >Reporter: yikf >Assignee: yikf >Priority: Trivial > Fix For: 3.2.0 > > > Currently, we pass the default value `EmptyRow` to method `checkEvaluation` > in the StringExpressionsSuite, but the default value of the 'checkEvaluation' > method parameter is the `emptyRow`. > We can clean the parameter for Code Simplifications. > > example: > *before:* > {code:java} > def testConcat(inputs: String*): Unit = { > val expected = if (inputs.contains(null)) null else inputs.mkString > checkEvaluation(Concat(inputs.map(Literal.create(_, StringType))), > expected, EmptyRow) > }{code} > *after:* > {code:java} > def testConcat(inputs: String*): Unit = { > val expected = if (inputs.contains(null)) null else inputs.mkString > checkEvaluation(Concat(inputs.map(Literal.create(_, StringType))), expected) > }{code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-34352) Improve SQLQueryTestSuite so as could run on windows system
[ https://issues.apache.org/jira/browse/SPARK-34352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-34352. -- Fix Version/s: 3.1.2 Resolution: Fixed Fixed in https://github.com/apache/spark/pull/31466 > Improve SQLQueryTestSuite so as could run on windows system > --- > > Key: SPARK-34352 > URL: https://issues.apache.org/jira/browse/SPARK-34352 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0 >Reporter: jiaan.geng >Priority: Major > Fix For: 3.1.2 > > > The current implement of SQLQueryTestSuite cannot run on windows system. > Becasue the code below will fail on windows system: > assume(TestUtils.testCommandAvailable("/bin/bash")) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33566) Incorrectly Parsing CSV file
[ https://issues.apache.org/jira/browse/SPARK-33566?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-33566: Assignee: Yang Jie > Incorrectly Parsing CSV file > > > Key: SPARK-33566 > URL: https://issues.apache.org/jira/browse/SPARK-33566 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.7 >Reporter: Stephen More >Assignee: Yang Jie >Priority: Minor > Fix For: 3.1.0 > > > Here is a test case: > [https://github.com/mores/maven-examples/blob/master/comma/src/test/java/org/test/CommaTest.java] > It shows how I believe apache commons csv and opencsv correctly parses the > sample csv file. > spark is not correctly parsing the sample csv file. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33565) python/run-tests.py calling python3.8
[ https://issues.apache.org/jira/browse/SPARK-33565?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-33565: Assignee: Shane Knapp (was: Apache Spark) > python/run-tests.py calling python3.8 > - > > Key: SPARK-33565 > URL: https://issues.apache.org/jira/browse/SPARK-33565 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 3.0.1 >Reporter: Shane Knapp >Assignee: Shane Knapp >Priority: Major > Fix For: 3.1.0 > > > this line in run-tests.py on master: > |python_execs = [x for x in ["python3.6", "python3.8", "pypy3"] if which(x)]| > > and this line in branch-3.0: > python_execs = [x for x in ["python3.8", "python2.7", "pypy3", "pypy"] if > which(x)] > ...are currently breaking builds on the new ubuntu 20.04LTS workers. > the default system python is /usr/bin/python3.8 and we do NOT have a working > python3.8 anaconda deployment yet. this is causing python test breakages. > PRs incoming > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33376) Remove the option of "sharesHadoopClasses" in Hive IsolatedClientLoader
[ https://issues.apache.org/jira/browse/SPARK-33376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-33376: Assignee: Chao Sun (was: Apache Spark) > Remove the option of "sharesHadoopClasses" in Hive IsolatedClientLoader > --- > > Key: SPARK-33376 > URL: https://issues.apache.org/jira/browse/SPARK-33376 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.1 >Reporter: Chao Sun >Assignee: Chao Sun >Priority: Major > Fix For: 3.1.0 > > > Currently, when initializing {{IsolatedClientLoader}}, ppl can specify to > either share Hadoop classes from Spark or not. In the latter case it's > supposed to only loads the Hadoop classes from the Hive jars themselves. > However this feature is currently used in two cases: 1) unit tests, 2) when > the Hadoop version defined in Maven can not be found when > {{spark.sql.hive.metastore.jars == "maven"}}. Also when > {{sharesHadoopClasses}} is false, it isn't really only using Hadoop classes > from Hive jars: Spark also download {{hadoop-client}} jar and put it together > with the Hive jars, and the Hadoop version used by {{hadoop-client}} is the > same version used by Spark itself. This could potentially cause issues > because we are mixing two versions of Hadoop jars in the classpath. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33303) Deduplicate deterministic PythonUDF calls
[ https://issues.apache.org/jira/browse/SPARK-33303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-33303: Assignee: Peter Toth (was: Apache Spark) > Deduplicate deterministic PythonUDF calls > - > > Key: SPARK-33303 > URL: https://issues.apache.org/jira/browse/SPARK-33303 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Peter Toth >Assignee: Peter Toth >Priority: Major > Fix For: 3.1.0 > > > We run into an issue where a customer created a column with an expensive > PythonUDF call and build a very complex logic on the the top of that column > as new derived columns. Due to `CollapseProject` and `ExtractPythonUDFs` > rules the UDF is called ~1000 times for each row which degraded the > performance of the query significantly. > The `ExtractPythonUDFs` rule could deduplicate deterministic UDFs so as to > avoid performance degradation. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32215) Expose end point on Master so that it can be informed about decommissioned workers out of band
[ https://issues.apache.org/jira/browse/SPARK-32215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-32215: Assignee: Devesh Agrawal > Expose end point on Master so that it can be informed about decommissioned > workers out of band > -- > > Key: SPARK-32215 > URL: https://issues.apache.org/jira/browse/SPARK-32215 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 3.1.0 > Environment: Standalone Scheduler >Reporter: Devesh Agrawal >Assignee: Devesh Agrawal >Priority: Major > Fix For: 3.1.0 > > > The use case here is to allow some external entity that has made a > decommissioning decision to inform the Master (in case of Standalone > scheduling mode) > The current decommissioning is triggered by the Worker getting getting a > SIGPWR > (out of band possibly by some cleanup hook), which then informs the Master > about it. This approach may not be feasible in some environments that cannot > trigger a clean up hook on the Worker. > Add a new post endpoint {{/workers/kill}} on the MasterWebUI that allows an > external agent to inform the master about all the nodes being decommissioned > in > bulk. The workers are identified by either their {{host:port}} or just the > host > – in which case all workers on the host would be decommissioned. > This API is merely a new entry point into the existing decommissioning > logic. It does not change how the decommissioning request is handled in > its core. > The path /workers/kill is so chosen to be consistent with the other endpoint > names on the MasterWebUI. > Since this is a sensitive operation, this API will be disabled by default. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33297) Intermittent Compilation failure In GitHub Actions after SBT upgrade
[ https://issues.apache.org/jira/browse/SPARK-33297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-33297: Assignee: Hyukjin Kwon (was: Apache Spark) > Intermittent Compilation failure In GitHub Actions after SBT upgrade > > > Key: SPARK-33297 > URL: https://issues.apache.org/jira/browse/SPARK-33297 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 3.1.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > Fix For: 3.1.0 > > > https://github.com/apache/spark/runs/1314691686 > {code} > Error: java.util.MissingResourceException: Can't find bundle for base name > org.scalactic.ScalacticBundle, locale en > Error:at > java.util.ResourceBundle.throwMissingResourceException(ResourceBundle.java:1581) > Error:at > java.util.ResourceBundle.getBundleImpl(ResourceBundle.java:1396) > Error:at java.util.ResourceBundle.getBundle(ResourceBundle.java:782) > Error:at > org.scalactic.Resources$.resourceBundle$lzycompute(Resources.scala:8) > Error:at org.scalactic.Resources$.resourceBundle(Resources.scala:8) > Error:at > org.scalactic.Resources$.pleaseDefineScalacticFillFilePathnameEnvVar(Resources.scala:256) > Error:at > org.scalactic.source.PositionMacro$PositionMacroImpl.apply(PositionMacro.scala:65) > Error:at > org.scalactic.source.PositionMacro$.genPosition(PositionMacro.scala:85) > Error:at sun.reflect.GeneratedMethodAccessor34.invoke(Unknown Source) > Error:at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > Error:at java.lang.reflect.Method.invoke(Method.java:498) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32450) Upgrade pycodestyle to 2.6.0
[ https://issues.apache.org/jira/browse/SPARK-32450?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-32450: Assignee: L. C. Hsieh (was: Apache Spark) > Upgrade pycodestyle to 2.6.0 > > > Key: SPARK-32450 > URL: https://issues.apache.org/jira/browse/SPARK-32450 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.1.0 >Reporter: L. C. Hsieh >Assignee: L. C. Hsieh >Priority: Trivial > Fix For: 3.1.0 > > > Upgrade pycodestyle to 2.6.0 to include bug fixes and newer Python version > support. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32613) DecommissionWorkerSuite has started failing sporadically again
[ https://issues.apache.org/jira/browse/SPARK-32613?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-32613: Assignee: Devesh Agrawal > DecommissionWorkerSuite has started failing sporadically again > -- > > Key: SPARK-32613 > URL: https://issues.apache.org/jira/browse/SPARK-32613 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 3.1.0 >Reporter: Devesh Agrawal >Assignee: Devesh Agrawal >Priority: Major > Fix For: 3.1.0 > > > Test "decommission workers ensure that fetch failures lead to rerun" is > failing: > > [https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/127357/testReport/org.apache.spark.deploy/DecommissionWorkerSuite/decommission_workers_ensure_that_fetch_failures_lead_to_rerun/] > https://github.com/apache/spark/pull/29367/checks?check_run_id=972990200#step:14:13579 > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33190) Set upperbound of PyArrow version in GitHub Actions
[ https://issues.apache.org/jira/browse/SPARK-33190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-33190: Assignee: Hyukjin Kwon (was: Apache Spark) > Set upperbound of PyArrow version in GitHub Actions > --- > > Key: SPARK-33190 > URL: https://issues.apache.org/jira/browse/SPARK-33190 > Project: Spark > Issue Type: Test > Components: PySpark, Tests >Affects Versions: 2.4.7, 3.0.1, 3.1.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > Fix For: 3.0.2, 3.1.0 > > > See SPARK-33189. Some tests look being failed with PyArrow 2.0.0+. We should > make the tests pass. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31830) Consistent error handling for datetime formatting functions
[ https://issues.apache.org/jira/browse/SPARK-31830?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-31830: Assignee: Kent Yao (was: Apache Spark) > Consistent error handling for datetime formatting functions > --- > > Key: SPARK-31830 > URL: https://issues.apache.org/jira/browse/SPARK-31830 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0, 3.1.0 >Reporter: Kent Yao >Assignee: Kent Yao >Priority: Major > Fix For: 3.1.0 > > > date_format and from_unixtime have different error handling behavior for > formatting datetime values. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32074) Update AppVeyor R to 4.0.2
[ https://issues.apache.org/jira/browse/SPARK-32074?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-32074: Assignee: Hyukjin Kwon > Update AppVeyor R to 4.0.2 > -- > > Key: SPARK-32074 > URL: https://issues.apache.org/jira/browse/SPARK-32074 > Project: Spark > Issue Type: Improvement > Components: R >Affects Versions: 3.0.1, 3.1.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > Fix For: 3.1.0 > > > We should test R 4.0.0. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32282) Improve EnsureRquirement.reorderJoinKeys to handle more scenarios such as PartitioningCollection
[ https://issues.apache.org/jira/browse/SPARK-32282?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-32282: Assignee: Terry Kim (was: Apache Spark) > Improve EnsureRquirement.reorderJoinKeys to handle more scenarios such as > PartitioningCollection > > > Key: SPARK-32282 > URL: https://issues.apache.org/jira/browse/SPARK-32282 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Terry Kim >Assignee: Terry Kim >Priority: Major > Fix For: 3.1.0 > > > The EnsureRquirement.reorderJoinKeys can be improved to handle the following > scenarios: > # If the keys cannot be reordered to match the left-side HashPartitioning, > consider the right-side HashPartitioning. > # Handle PartitioningCollection, which may contain HashPartitioning -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-30880) Delete Sphinx Makefile cruft
[ https://issues.apache.org/jira/browse/SPARK-30880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-30880: Assignee: Nicholas Chammas > Delete Sphinx Makefile cruft > > > Key: SPARK-30880 > URL: https://issues.apache.org/jira/browse/SPARK-30880 > Project: Spark > Issue Type: Improvement > Components: Documentation >Affects Versions: 3.1.0 >Reporter: Nicholas Chammas >Assignee: Nicholas Chammas >Priority: Minor > Fix For: 3.1.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31598) LegacySimpleTimestampFormatter incorrectly interprets pre-Gregorian timestamps
[ https://issues.apache.org/jira/browse/SPARK-31598?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-31598: Assignee: Bruce Robbins > LegacySimpleTimestampFormatter incorrectly interprets pre-Gregorian timestamps > -- > > Key: SPARK-31598 > URL: https://issues.apache.org/jira/browse/SPARK-31598 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0, 3.1.0 >Reporter: Bruce Robbins >Assignee: Bruce Robbins >Priority: Major > Fix For: 3.0.0, 3.1.0 > > > As per discussion with [~maxgekk]: > {{LegacySimpleTimestampFormatter#parse}} misinterprets pre-Gregorian > timestamps: > {noformat} > scala> sql("set spark.sql.legacy.timeParserPolicy=LEGACY") > res0: org.apache.spark.sql.DataFrame = [key: string, value: string] > scala> val df1 = Seq("0002-01-01 00:00:00", "1000-01-01 00:00:00", > "1800-01-01 00:00:00").toDF("expected") > df1: org.apache.spark.sql.DataFrame = [expected: string] > scala> val df2 = df1.select('expected, to_timestamp('expected, "-MM-dd > HH:mm:ss").as("actual")) > df2: org.apache.spark.sql.DataFrame = [expected: string, actual: timestamp] > scala> df2.show(truncate=false) > +---+---+ > |expected |actual | > +---+---+ > |0002-01-01 00:00:00|0001-12-30 00:00:00| > |1000-01-01 00:00:00|1000-01-06 00:00:00| > |1800-01-01 00:00:00|1800-01-01 00:00:00| > +---+---+ > scala> > {noformat} > Legacy timestamp parsing with JSON and CSV files is correct, so apparently > {{LegacyFastTimestampFormatter}} does not have this issue (need to double > check). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31000) Add ability to set table description in the catalog
[ https://issues.apache.org/jira/browse/SPARK-31000?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-31000: Assignee: Nicholas Chammas > Add ability to set table description in the catalog > --- > > Key: SPARK-31000 > URL: https://issues.apache.org/jira/browse/SPARK-31000 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Nicholas Chammas >Assignee: Nicholas Chammas >Priority: Minor > Fix For: 3.1.0 > > > It seems that the catalog supports a {{description}} attribute on tables. > https://github.com/apache/spark/blob/86cc907448f0102ad0c185e87fcc897d0a32707f/sql/core/src/main/scala/org/apache/spark/sql/catalog/interface.scala#L68 > However, the {{createTable()}} interface doesn't provide any way to set that > attribute. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31517) SparkR::orderBy with multiple columns descending produces error
[ https://issues.apache.org/jira/browse/SPARK-31517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-31517: Assignee: Michael Chirico > SparkR::orderBy with multiple columns descending produces error > --- > > Key: SPARK-31517 > URL: https://issues.apache.org/jira/browse/SPARK-31517 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.4.5 > Environment: Databricks Runtime 6.5 >Reporter: Ross Bowen >Assignee: Michael Chirico >Priority: Major > Fix For: 3.1.0, 3.2.0 > > > When specifying two columns within an `orderBy()` function, to attempt to get > an ordering by two columns in descending order, an error is returned. > {code:java} > library(magrittr) > library(SparkR) > cars <- cbind(model = rownames(mtcars), mtcars) > carsDF <- createDataFrame(cars) > carsDF %>% > mutate(rank = over(rank(), orderBy(windowPartitionBy(column("cyl")), > desc(column("mpg")), desc(column("disp") %>% > head() {code} > This returns an error: > {code:java} > Error in ns[[i]] : subscript out of bounds{code} > This seems to be related to the more general issue that the following code, > excluding the use of the `desc()` function also fails: > {code:java} > carsDF %>% > mutate(rank = over(rank(), orderBy(windowPartitionBy(column("cyl")), > column("mpg"), column("disp" %>% > head(){code} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-30889) Add version information to the configuration of Worker
[ https://issues.apache.org/jira/browse/SPARK-30889?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-30889: Assignee: jiaan.geng > Add version information to the configuration of Worker > -- > > Key: SPARK-30889 > URL: https://issues.apache.org/jira/browse/SPARK-30889 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 3.1.0 >Reporter: jiaan.geng >Assignee: jiaan.geng >Priority: Major > Fix For: 3.0.0, 3.1.0 > > > core/src/main/scala/org/apache/spark/internal/config/Worker.scala -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-30856) SQLContext retains reference to unusable instance after SparkContext restarted
[ https://issues.apache.org/jira/browse/SPARK-30856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-30856: Assignee: Alex Favaro > SQLContext retains reference to unusable instance after SparkContext restarted > -- > > Key: SPARK-30856 > URL: https://issues.apache.org/jira/browse/SPARK-30856 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 2.4.5 >Reporter: Alex Favaro >Assignee: Alex Favaro >Priority: Major > Fix For: 3.1.0 > > > When the underlying SQLContext is instantiated for a SparkSession, the > instance is saved as a class attribute and returned from subsequent calls to > SQLContext.getOrCreate(). If the SparkContext is stopped and a new one > started, the SQLContext class attribute is never cleared so any code which > calls SQLContext.getOrCreate() will get a SQLContext with a reference to the > old, unusable SparkContext. > A similar issue was identified and fixed for SparkSession in SPARK-19055, but > the fix did not change SQLContext as well. I ran into this because mllib > still > [uses|https://github.com/apache/spark/blob/master/python/pyspark/mllib/common.py#L105] > SQLContext.getOrCreate() under the hood. > I've already written a fix for this, which I'll be sharing in a PR, that > clears the class attribute on SQLContext when the SparkSession is stopped. > Another option would be to deprecate SQLContext.getOrCreate() entirely since > the corresponding Scala > [method|https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/SQLContext.html#getOrCreate-org.apache.spark.SparkContext-] > is itself deprecated. That seems like a larger change for a relatively minor > issue, however. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-30988) Add more edge-case exercising values to stats tests
[ https://issues.apache.org/jira/browse/SPARK-30988?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-30988: Assignee: Maxim Gekk > Add more edge-case exercising values to stats tests > --- > > Key: SPARK-30988 > URL: https://issues.apache.org/jira/browse/SPARK-30988 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 3.0.0 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Major > Fix For: 3.1.0 > > > Add more edge-cases to StatisticsCollectionTestBase -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-30839) Add version information for Spark configuration
[ https://issues.apache.org/jira/browse/SPARK-30839?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-30839: Assignee: jiaan.geng > Add version information for Spark configuration > --- > > Key: SPARK-30839 > URL: https://issues.apache.org/jira/browse/SPARK-30839 > Project: Spark > Issue Type: Improvement > Components: Documentation, DStreams, Kubernetes, Mesos, Spark Core, > SQL, Structured Streaming, YARN >Affects Versions: 3.1.0 >Reporter: jiaan.geng >Assignee: jiaan.geng >Priority: Major > Fix For: 3.0.0, 3.1.0 > > > Spark ConfigEntry and ConfigBuilder missing Spark version information of each > configuration at release. This is not good for Spark user when they visiting > the page of spark configuration. > http://spark.apache.org/docs/latest/configuration.html -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-30821) Executor pods with multiple containers will not be rescheduled unless all containers fail
[ https://issues.apache.org/jira/browse/SPARK-30821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-30821: Assignee: Shiqi Sun (was: Apache Spark) > Executor pods with multiple containers will not be rescheduled unless all > containers fail > - > > Key: SPARK-30821 > URL: https://issues.apache.org/jira/browse/SPARK-30821 > Project: Spark > Issue Type: Improvement > Components: Kubernetes, Spark Core >Affects Versions: 3.1.0 >Reporter: Kevin Hogeland >Assignee: Shiqi Sun >Priority: Major > Fix For: 3.0.2, 3.1.0 > > > Since the restart policy of launched pods is Never, additional handling is > required for pods that may have sidecar containers. The executor should be > considered failed if any containers have terminated and have a non-zero exit > code, but Spark currently only checks the pod phase. The pod phase will > remain "running" as long as _any_ pods are still running. Kubernetes sidecar > support in 1.18/1.19 does not address this situation, as sidecar containers > are excluded from pod phase calculation. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-30795) Spark SQL codegen's code() interpolator should treat escapes like Scala's StringContext.s()
[ https://issues.apache.org/jira/browse/SPARK-30795?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-30795: Assignee: Kris Mok > Spark SQL codegen's code() interpolator should treat escapes like Scala's > StringContext.s() > --- > > Key: SPARK-30795 > URL: https://issues.apache.org/jira/browse/SPARK-30795 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0, 2.4.1, 2.4.2, 2.4.3, 2.4.4, 2.4.5, 3.0.0 >Reporter: Kris Mok >Assignee: Kris Mok >Priority: Major > Fix For: 3.1.0 > > > The {{code()}} string interpolator in Spark SQL's code generator should treat > escapes like Scala's builtin {{StringContext.s()}} interpolator, i.e. it > should treat escapes in the code parts, and should not treat escapes in the > input arguments. > For example, > {code} > val arg = "This is an argument." > val str = s"This is string part 1. $arg This is string part 2." > val code = code"This is string part 1. $arg This is string part 2." > assert(code.toString == str) > {code} > We should expect the {{code()}} interpolator produce the same thing as the > {{StringContext.s()}} interpolator, where only escapes in the string parts > should be treated, while the args should be kept verbatim. > But in the current implementation, due to the eager folding of code parts and > literal input args, the escape treatment is incorrectly done on both code > parts and literal args. > That causes a problem when an arg contains escape sequences and wants to > preserve that in the final produced code string. For example, in {{Like}} > expression's codegen, there's an ugly workaround for this bug: > {code} > // We need double escape to avoid > org.codehaus.commons.compiler.CompileException. > // '\\' will cause exception 'Single quote must be backslash-escaped in > character literal'. > // '\"' will cause exception 'Line break in literal not allowed'. > val newEscapeChar = if (escapeChar == '\"' || escapeChar == '\\') { > s"""\\$escapeChar""" > } else { > escapeChar > } > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-30733) Fix SparkR tests per testthat and R version upgrade
[ https://issues.apache.org/jira/browse/SPARK-30733?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-30733: Assignee: Hyukjin Kwon > Fix SparkR tests per testthat and R version upgrade > --- > > Key: SPARK-30733 > URL: https://issues.apache.org/jira/browse/SPARK-30733 > Project: Spark > Issue Type: Test > Components: SparkR, SQL >Affects Versions: 2.4.5, 3.0.0, 3.1.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Critical > Fix For: 2.4.6, 3.0.0, 3.1.0 > > > 5 SparkR tests seem being failed after upgrading testthat 2.0.0 and R 3.5.x > {code} > test_context.R:49: failure: Check masked functions > length(maskedCompletely) not equal to length(namesOfMaskedCompletely). > 1/1 mismatches > [1] 6 - 4 == 2 > test_context.R:53: failure: Check masked functions > sort(maskedCompletely, na.last = TRUE) not equal to > sort(namesOfMaskedCompletely, na.last = TRUE). > 5/6 mismatches > x[2]: "endsWith" > y[2]: "filter" > x[3]: "filter" > y[3]: "not" > x[4]: "not" > y[4]: "sample" > x[5]: "sample" > y[5]: NA > x[6]: "startsWith" > y[6]: NA > {code} > {code} > test_includePackage.R:31: error: include inside function > package or namespace load failed for ���plyr���: > package ���plyr��� was installed by an R version with different internals; > it needs to be reinstalled for use with this R version > Seems it's a package installation issue. Looks like plyr has to be > re-installed. > {code} > {code} > test_sparkSQL.R:499: warning: SPARK-17811: can create DataFrame containing NA > as date and time > Your system is mis-configured: ���/etc/localtime��� is not a symlink > test_sparkSQL.R:504: warning: SPARK-17811: can create DataFrame containing NA > as date and time > Your system is mis-configured: ���/etc/localtime��� is not a symlink > {code} > {code} > test_sparkSQL.R:499: warning: SPARK-17811: can create DataFrame containing NA > as date and time > It is strongly recommended to set envionment variable TZ to > ���America/Los_Angeles��� (or equivalent) > test_sparkSQL.R:504: warning: SPARK-17811: can create DataFrame containing NA > as date and time > It is strongly recommended to set envionment variable TZ to > ���America/Los_Angeles��� (or equivalent > {code} > {code} > test_sparkSQL.R:1814: error: string operators > unable to find an inherited method for function ���startsWith��� for > signature ���"character"��� > 1: expect_true(startsWith("Hello World", "Hello")) at > /home/jenkins/workspace/SparkPullRequestBuilder@2/R/pkg/tests/fulltests/test_sparkSQL.R:1814 > 2: quasi_label(enquo(object), label) > 3: eval_bare(get_expr(quo), get_env(quo)) > 4: startsWith("Hello World", "Hello") > 5: (function (classes, fdef, mtable) >{ >methods <- .findInheritedMethods(classes, fdef, mtable) >if (length(methods) == 1L) >return(methods[[1L]]) >else if (length(methods) == 0L) { >cnames <- paste0("\"", vapply(classes, as.character, ""), "\"", > collapse = ", ") >stop(gettextf("unable to find an inherited method for function %s > for signature %s", >sQuote(fdef@generic), sQuote(cnames)), domain = NA) >} >else stop("Internal error in finding inherited methods; didn't return > a unique method", >domain = NA) >})(list("character"), new("nonstandardGenericFunction", .Data = function > (x, prefix) >{ >standardGeneric("startsWith") >}, generic = structure("startsWith", package = "SparkR"), package = > "SparkR", group = list(), >valueClass = character(0), signature = c("x", "prefix"), default = > NULL, skeleton = (function (x, >prefix) >stop("invalid call in method dispatch to 'startsWith' (no default > method)", domain = NA))(x, >prefix)), ) > 6: stop(gettextf("unable to find an inherited method for function %s for > signature %s", >sQuote(fdef@generic), sQuote(cnames)), domain = NA) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-28646) Allow usage of `count` only for parameterless aggregate function
[ https://issues.apache.org/jira/browse/SPARK-28646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-28646: Assignee: jiaan.geng (was: Apache Spark) > Allow usage of `count` only for parameterless aggregate function > > > Key: SPARK-28646 > URL: https://issues.apache.org/jira/browse/SPARK-28646 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: Dylan Guedes >Assignee: jiaan.geng >Priority: Major > Fix For: 3.1.0 > > > Currently, Spark allows calls to `count` even for non parameterless aggregate > function. For example, the following query actually works: > {code:sql}SELECT count() OVER () FROM tenk1;{code} > In PgSQL, on the other hand, the following error is thrown: > {code:sql}ERROR: count(*) must be used to call a parameterless aggregate > function{code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27497) Spark wipes out bucket spec in metastore when updating table stats
[ https://issues.apache.org/jira/browse/SPARK-27497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-27497: Assignee: Bruce Robbins > Spark wipes out bucket spec in metastore when updating table stats > -- > > Key: SPARK-27497 > URL: https://issues.apache.org/jira/browse/SPARK-27497 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0, 3.0.0 >Reporter: Bruce Robbins >Assignee: Bruce Robbins >Priority: Major > Fix For: 2.4.6, 3.1.0 > > > The bucket spec gets wiped out after Spark writes to a Hive-bucketed table > that has the following characteristics: > - table is created by Hive (or even Spark, if you use HQL DDL) > - table is stored in Parquet format > - table has at least one Hive-created data file already > Also, spark.sql.hive.convertMetastoreParquet has to be set to true (the > default). > For example, do the following in Hive: > {noformat} > hive> create table sourcetable as select 1 a, 3 b, 7 c; > hive> drop table hivebucket1; > hive> create table hivebucket1 (a int, b int, c int) clustered by (a, b) > sorted by (a, b asc) into 10 buckets stored as parquet; > hive> insert into hivebucket1 select * from sourcetable; > hive> show create table hivebucket1; > OK > CREATE TABLE `hivebucket1`( > `a` int, > `b` int, > `c` int) > CLUSTERED BY ( > a, > b) > SORTED BY ( > a ASC, > b ASC) > INTO 10 BUCKETS > ROW FORMAT SERDE > 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' > STORED AS INPUTFORMAT > 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' > OUTPUTFORMAT > 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat' > LOCATION > 'file:/Users/brobbins/github/spark_upstream/spark-warehouse/hivebucket1' > TBLPROPERTIES ( > 'COLUMN_STATS_ACCURATE'='true', > 'numFiles'='1', > 'numRows'='1', > 'rawDataSize'='3', > 'totalSize'='352', > 'transient_lastDdlTime'='142971') > Time taken: 0.056 seconds, Fetched: 26 row(s) > hive> > {noformat} > Then in spark-shell, do the following: > {noformat} > scala> sql("insert into hivebucket1 select 1, 3, 7") > 19/04/17 10:49:30 WARN ObjectStore: Failed to get database global_temp, > returning NoSuchObjectException > res0: org.apache.spark.sql.DataFrame = [] > {noformat} > Note: At this point, I would have expected Spark to throw an > {{AnalysisException}} with the message "Output Hive table > `default`.`hivebucket1` is bucketed...". However, I am ignoring that for now > and may open a separate Jira (SPARK-27498). > Return to some Hive CLI and note that the bucket specification is gone from > the table definition: > {noformat} > hive> show create table hivebucket1; > OK > CREATE TABLE `hivebucket1`( > `a` int, > `b` int, > `c` int) > ROW FORMAT SERDE > 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' > STORED AS INPUTFORMAT > 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' > OUTPUTFORMAT > 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat' > LOCATION > '' > TBLPROPERTIES ( > 'COLUMN_STATS_ACCURATE'='false', > 'SORTBUCKETCOLSPREFIX'='TRUE', > 'numFiles'='2', > 'numRows'='-1', > 'rawDataSize'='-1', > 'totalSize'='1144', > 'transient_lastDdlTime'='123374') > Time taken: 1.619 seconds, Fetched: 20 row(s) > hive> > {noformat} > This information is lost when Spark attempts to update table stats. > HiveClientImpl.toHiveTable drops the bucket specification. toHiveTable drops > the bucket information because {{table.provider}} is None instead of "hive". > {{table.provider}} is not "hive" because Spark bypassed the serdes and used > the built-in parquet code path (by default, > spark.sql.hive.convertMetastoreParquet is true). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-28367) Kafka connector infinite wait because metadata never updated
[ https://issues.apache.org/jira/browse/SPARK-28367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-28367: Assignee: Gabor Somogyi > Kafka connector infinite wait because metadata never updated > > > Key: SPARK-28367 > URL: https://issues.apache.org/jira/browse/SPARK-28367 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.1.3, 2.2.3, 2.3.3, 2.4.3, 3.0.0, 3.1.0 >Reporter: Gabor Somogyi >Assignee: Gabor Somogyi >Priority: Critical > Fix For: 3.1.0 > > > Spark uses an old and deprecated API named poll(long) which never returns and > stays in live lock if metadata is not updated (for instance when broker > disappears at consumer creation). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33670) Verify the partition provider is Hive in v1 SHOW TABLE EXTENDED
[ https://issues.apache.org/jira/browse/SPARK-33670?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-33670: Assignee: Maxim Gekk (was: Apache Spark) > Verify the partition provider is Hive in v1 SHOW TABLE EXTENDED > --- > > Key: SPARK-33670 > URL: https://issues.apache.org/jira/browse/SPARK-33670 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.2, 3.1.0 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Major > Fix For: 2.4.8, 3.0.2, 3.1.0 > > > Invoke the check verifyPartitionProviderIsHive() from v1 implementation of > SHOW TABLE EXTENDED. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33697) UnionExec should require column ordering in RemoveRedundantProjects
[ https://issues.apache.org/jira/browse/SPARK-33697?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-33697: Assignee: Allison Wang (was: Apache Spark) > UnionExec should require column ordering in RemoveRedundantProjects > --- > > Key: SPARK-33697 > URL: https://issues.apache.org/jira/browse/SPARK-33697 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.0 >Reporter: Allison Wang >Assignee: Allison Wang >Priority: Major > Fix For: 3.1.0 > > > UnionExec requires its children's columns to have the same order in order to > merge the columns. Currently, the physical rule `RemoveRedundantProjects` can > pass through the ordering requirements from its parent and incorrectly remove > the necessary project nodes below a union operation. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33803) Sort table properties by key in DESCRIBE TABLE command
[ https://issues.apache.org/jira/browse/SPARK-33803?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-33803: Assignee: Hyukjin Kwon (was: Apache Spark) > Sort table properties by key in DESCRIBE TABLE command > -- > > Key: SPARK-33803 > URL: https://issues.apache.org/jira/browse/SPARK-33803 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Minor > Fix For: 3.1.0 > > > Currently: > {code} > -- !query > DESC FORMATTED v > -- !query schema > struct > -- !query output > a string > b int > c string > d string > > # Detailed Table Information > Database default > Table v > Created Time [not included in comparison] > Last Access [not included in comparison] > Created By [not included in comparison] > Type VIEW > View Text SELECT * FROM t > View Original TextSELECT * FROM t > View Catalog and Namespacespark_catalog.default > View Query Output Columns [a, b, c, d] > Table Properties [view.catalogAndNamespace.numParts=2, > view.catalogAndNamespace.part.0=spark_catalog, > view.catalogAndNamespace.part.1=default, view.query.out.col.0=a, > view.query.out.col.1=b, view.query.out.col.2=c, view.query.out.col.3=d, > view.query.out.numCols=4, view.referredTempFunctionsNames=[], > view.referredTempViewNames=[]] > {code} > The order of "Table Properties" is indeterministic which makes the test above > fails in other environments. It should be best to sort it by key. This is > consistent with DSv2 command as well. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33901) Char and Varchar display error after DDLs
[ https://issues.apache.org/jira/browse/SPARK-33901?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-33901: Assignee: Kent Yao (was: Apache Spark) > Char and Varchar display error after DDLs > - > > Key: SPARK-33901 > URL: https://issues.apache.org/jira/browse/SPARK-33901 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.0 >Reporter: Kent Yao >Assignee: Kent Yao >Priority: Major > Fix For: 3.1.0 > > > CTAS / CREATE TABLE LIKE/ CVAS/ alter table add columns -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33907) Only prune columns of from_json if parsing options is empty
[ https://issues.apache.org/jira/browse/SPARK-33907?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-33907: Assignee: L. C. Hsieh (was: Apache Spark) > Only prune columns of from_json if parsing options is empty > --- > > Key: SPARK-33907 > URL: https://issues.apache.org/jira/browse/SPARK-33907 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0, 3.2.0 >Reporter: L. C. Hsieh >Assignee: L. C. Hsieh >Priority: Major > Fix For: 3.1.0 > > > For safety, we should only prune columns from from_json expression if the > parsing option is empty. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-34181) Update build doc help document
[ https://issues.apache.org/jira/browse/SPARK-34181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-34181: Assignee: angerszhu > Update build doc help document > -- > > Key: SPARK-34181 > URL: https://issues.apache.org/jira/browse/SPARK-34181 > Project: Spark > Issue Type: Improvement > Components: docs >Affects Versions: 3.2.0 >Reporter: angerszhu >Assignee: angerszhu >Priority: Major > Fix For: 3.0.2, 3.1.1 > > > According to https://github.com/jekyll/jekyll/issues/8523 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25872) Add an optimizer tracker for TPC-DS queries
[ https://issues.apache.org/jira/browse/SPARK-25872?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang reassigned SPARK-25872: --- Assignee: wuyi (was: Apache Spark) > Add an optimizer tracker for TPC-DS queries > --- > > Key: SPARK-25872 > URL: https://issues.apache.org/jira/browse/SPARK-25872 > Project: Spark > Issue Type: Improvement > Components: Tests >Affects Versions: 3.1.0 >Reporter: Yuming Wang >Assignee: wuyi >Priority: Major > Fix For: 3.1.0 > > > Used for track all TPC-DS queries optimized plans. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34080) Add UnivariateFeatureSelector to deprecate existing selectors
[ https://issues.apache.org/jira/browse/SPARK-34080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17281478#comment-17281478 ] Apache Spark commented on SPARK-34080: -- User 'viirya' has created a pull request for this issue: https://github.com/apache/spark/pull/31531 > Add UnivariateFeatureSelector to deprecate existing selectors > - > > Key: SPARK-34080 > URL: https://issues.apache.org/jira/browse/SPARK-34080 > Project: Spark > Issue Type: New Feature > Components: ML >Affects Versions: 3.2.0, 3.1.1 >Reporter: Xiangrui Meng >Assignee: Huaxin Gao >Priority: Critical > Fix For: 3.2.0, 3.1.1 > > > In SPARK-26111, we introduced a few univariate feature selectors, which share > a common set of params. And they are named after the underlying test, which > requires users to understand the test to find the matched scenarios. It would > be nice if we introduce a single class called UnivariateFeatureSelector that > accepts a selection criterion and a score method (string names). Then we can > deprecate all other univariate selectors. > For the params, instead of ask users to provide what score function to use, > it is more friendly to ask users to specify the feature and label types > (continuous or categorical) and we set a default score function for each > combo. We can also detect the types from feature metadata if given. Advanced > users can overwrite it (if there are multiple score function that is > compatible with the feature type and label type combo). Example (param names > are not finalized): > {code} > selector = UnivariateFeatureSelector(featureCols=["x", "y", "z"], > labelCol=["target"], featureType="categorical", labelType="continuous", > select="bestK", k=100) > {code} > cc: [~huaxingao] [~ruifengz] [~weichenxu123] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32736) Avoid caching the removed decommissioned executors in TaskSchedulerImpl
[ https://issues.apache.org/jira/browse/SPARK-32736?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang reassigned SPARK-32736: --- Assignee: wuyi (was: Apache Spark) > Avoid caching the removed decommissioned executors in TaskSchedulerImpl > --- > > Key: SPARK-32736 > URL: https://issues.apache.org/jira/browse/SPARK-32736 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.1.0 >Reporter: wuyi >Assignee: wuyi >Priority: Major > Fix For: 3.1.0 > > > We can save the host directly in the ExecutorDecommissionState. Therefore, > when the executor lost, we could unregister the shuffle map status on the > host. Thus, we don't need to hold the cache to wait for FetchFailureException > to do the unregister. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32689) HiveSerDeReadWriteSuite and ScriptTransformationSuite are currently failed under hive1.2 profile in branch-3.0 and master
[ https://issues.apache.org/jira/browse/SPARK-32689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang reassigned SPARK-32689: --- Assignee: L. C. Hsieh > HiveSerDeReadWriteSuite and ScriptTransformationSuite are currently failed > under hive1.2 profile in branch-3.0 and master > - > > Key: SPARK-32689 > URL: https://issues.apache.org/jira/browse/SPARK-32689 > Project: Spark > Issue Type: Bug > Components: SQL, Tests >Affects Versions: 3.0.1, 3.1.0 >Reporter: L. C. Hsieh >Assignee: L. C. Hsieh >Priority: Major > Fix For: 3.0.1, 3.1.0 > > > There are three tests which are currently failed under hive1.2 profiles in > branch-3.0 and master branches: > org.apache.spark.sql.hive.execution.HiveSerDeReadWriteSuite.Read/Write Hive > PARQUET serde table > org.apache.spark.sql.hive.execution.HiveSerDeReadWriteSuite.Read/Write Hive > TEXTFILE serde table > org.apache.spark.sql.hive.execution.ScriptTransformationSuite.SPARK-32608: > Script Transform ROW FORMAT DELIMIT value should format value > Please see [https://github.com/apache/spark/pull/29517]. > This test is failed under hive1.2 profiles in master branch: > org.apache.spark.sql.hive.orc.HiveOrcHadoopFsRelationSuite.save()/load() - > partitioned table - simple queries - partition columns in data -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33139) protect setActiveSession and clearActiveSession
[ https://issues.apache.org/jira/browse/SPARK-33139?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang reassigned SPARK-33139: --- Assignee: jiahong.li > protect setActiveSession and clearActiveSession > --- > > Key: SPARK-33139 > URL: https://issues.apache.org/jira/browse/SPARK-33139 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: Leanken.Lin >Assignee: jiahong.li >Priority: Major > Fix For: 3.1.0 > > > This PR is a sub-task of > [SPARK-33138](https://issues.apache.org/jira/browse/SPARK-33138). In order to > make SQLConf.get reliable and stable, we need to make sure user can't pollute > the SQLConf and SparkSession Context via calling setActiveSession and > clearActiveSession. > Change of the PR: > * add legacy config spark.sql.legacy.allowModifyActiveSession to fallback to > old behavior if user do need to call these two API. > * by default, if user call these two API, it will throw exception > * add extra two internal and private API setActiveSessionInternal and > clearActiveSessionInternal for current internal usage > * change all internal reference to new internal API exception for > SQLContext.setActive and SQLContext.clearActive -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33105) Broken installation of source packages on AppVeyor
[ https://issues.apache.org/jira/browse/SPARK-33105?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang reassigned SPARK-33105: --- Assignee: Maciej Szymkiewicz > Broken installation of source packages on AppVeyor > -- > > Key: SPARK-33105 > URL: https://issues.apache.org/jira/browse/SPARK-33105 > Project: Spark > Issue Type: Bug > Components: Project Infra, R >Affects Versions: 3.1.0 > Environment: *strong text* >Reporter: Maciej Szymkiewicz >Assignee: Maciej Szymkiewicz >Priority: Major > Fix For: 3.1.0 > > > It looks like AppVeyor configuration is broken, which leads to failure of > installation of source packages (become a problem when {{rlang}} has been > updated from 0.4.7 and 0.4.8, with latter available only as a source package). > {code} > [00:01:48] trying URL > 'https://cloud.r-project.org/src/contrib/rlang_0.4.8.tar.gz' > [00:01:48] Content type 'application/x-gzip' length 847517 bytes (827 KB) > [00:01:48] == > [00:01:48] downloaded 827 KB > [00:01:48] > [00:01:48] Warning in strptime(xx, f, tz = tz) : > [00:01:48] unable to identify current timezone 'C': > [00:01:48] please set environment variable 'TZ' > [00:01:49] * installing *source* package 'rlang' ... > [00:01:49] ** package 'rlang' successfully unpacked and MD5 sums checked > [00:01:49] ** using staged installation > [00:01:49] ** libs > [00:01:49] > [00:01:49] *** arch - i386 > [00:01:49] C:/Rtools40/mingw64/bin/gcc -I"C:/R/include" -DNDEBUG > -I./lib/ -O2 -Wall -std=gnu99 -mfpmath=sse -msse2 > -mstackrealign -c capture.c -o capture.o > [00:01:49] C:/Rtools40/mingw64/bin/gcc -I"C:/R/include" -DNDEBUG > -I./lib/ -O2 -Wall -std=gnu99 -mfpmath=sse -msse2 > -mstackrealign -c export.c -o export.o > [00:01:49] C:/Rtools40/mingw64/bin/gcc -I"C:/R/include" -DNDEBUG > -I./lib/ -O2 -Wall -std=gnu99 -mfpmath=sse -msse2 > -mstackrealign -c internal.c -o internal.o > [00:01:50] In file included from ./lib/rlang.h:74, > [00:01:50] from internal/arg.c:1, > [00:01:50] from internal.c:1: > [00:01:50] internal/eval-tidy.c: In function 'rlang_tilde_eval': > [00:01:50] ./lib/env.h:33:10: warning: 'top' may be used uninitialized > in this function [-Wmaybe-uninitialized] > [00:01:50]return ENCLOS(env); > [00:01:50] ^~~ > [00:01:50] In file included from internal.c:8: > [00:01:50] internal/eval-tidy.c:406:9: note: 'top' was declared here > [00:01:50]sexp* top; > [00:01:50] ^~~ > [00:01:50] C:/Rtools40/mingw64/bin/gcc -I"C:/R/include" -DNDEBUG > -I./lib/ -O2 -Wall -std=gnu99 -mfpmath=sse -msse2 > -mstackrealign -c lib.c -o lib.o > [00:01:51] C:/Rtools40/mingw64/bin/gcc -I"C:/R/include" -DNDEBUG > -I./lib/ -O2 -Wall -std=gnu99 -mfpmath=sse -msse2 > -mstackrealign -c version.c -o version.o > [00:01:52] C:/Rtools40/mingw64/bin/gcc -shared -s -static-libgcc -o > rlang.dll tmp.def capture.o export.o internal.o lib.o version.o > -LC:/R/bin/i386 -lR > [00:01:52] > c:/Rtools40/mingw64/bin/../lib/gcc/x86_64-w64-mingw32/8.3.0/../../../../x86_64-w64-mingw32/bin/ld.exe: > skipping incompatible C:/R/bin/i386/R.dll when searching for -lR > [00:01:52] > c:/Rtools40/mingw64/bin/../lib/gcc/x86_64-w64-mingw32/8.3.0/../../../../x86_64-w64-mingw32/bin/ld.exe: > skipping incompatible C:/R/bin/i386/R.dll when searching for -lR > [00:01:52] > c:/Rtools40/mingw64/bin/../lib/gcc/x86_64-w64-mingw32/8.3.0/../../../../x86_64-w64-mingw32/bin/ld.exe: > cannot find -lR > [00:01:52] collect2.exe: error: ld returned 1 exit status > [00:01:52] no DLL was created > [00:01:52] ERROR: compilation failed for package 'rlang' > [00:01:52] * removing 'C:/RLibrary/rlang' > [00:01:52] > [00:01:52] The downloaded source packages are in > [00:01:52] > 'C:\Users\appveyor\AppData\Local\Temp\1\Rtmp8qrryA\downloaded_packages' > [00:01:52] Warning message: > [00:01:52] In install.packages(c("knitr", "rmarkdown", "testthat", > "e1071", : > [00:01:52] installation of package 'rlang' had non-zero exit status > {code} > This leads to failures to install {{devtools}} and generate Rd files and, as > a result, CRAN check failure. > There are some discrepancies in the > {{dev/appveyor-install-dependencies.ps1}}, but the direct source of this > issue seems to be {{$env:BINPREF}}, which forces usage of 64 bit mingw, even > if packages are compiled for 32 bit. > Modifying the variable to include current architecture: > {code} > $env:BINPREF=$RtoolsDrive + '/Rtools40/mingw$(WIN)/bin/' > {code} > (as proposed [here|https://stackoverflow.com/a/44035904] by R Yoda) looks > like a valid fix, though we might want to clean remaining issues as well. -- This message was sent by Atlassian Jira
[jira] [Assigned] (SPARK-33169) Check propagation of datasource options to underlying file system for built-in file-based datasources
[ https://issues.apache.org/jira/browse/SPARK-33169?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang reassigned SPARK-33169: --- Assignee: Maxim Gekk > Check propagation of datasource options to underlying file system for > built-in file-based datasources > - > > Key: SPARK-33169 > URL: https://issues.apache.org/jira/browse/SPARK-33169 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 3.1.0 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Major > Fix For: 3.1.0 > > > Add a common trait with a test to check that datasource options are > propagated to underlying file systems. Individual tests were already added by > SPARK-33094 and SPARK-33089. The ticket aims to de-duplicate the tests. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33165) Remove dependencies(scalatest,scalactic) from Benchmark
[ https://issues.apache.org/jira/browse/SPARK-33165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang reassigned SPARK-33165: --- Assignee: Takeshi Yamamuro > Remove dependencies(scalatest,scalactic) from Benchmark > --- > > Key: SPARK-33165 > URL: https://issues.apache.org/jira/browse/SPARK-33165 > Project: Spark > Issue Type: Test > Components: SQL, Tests >Affects Versions: 3.0.1, 3.1.0 >Reporter: Takeshi Yamamuro >Assignee: Takeshi Yamamuro >Priority: Major > Fix For: 3.0.2, 3.1.0 > > > This ticket aims at removing `assert` from `Benchmark` for making it easier > to run benchmark codes via `spark-submit`. > Since the current `Benchmark` (`master` and `branch-3.0`) has `assert`, we > need to pass the proper jars of `scalatest` and `scalactic`; > - scalatest-core_2.12-3.2.0.jar > - scalatest-compatible-3.2.0.jar > - scalactic_2.12-3.0.jar > {code} > ./bin/spark-submit --jars > scalatest-core_2.12-3.2.0.jar,scalatest-compatible-3.2.0.jar,scalactic_2.12-3.0.jar,./sql/catalyst/target/spark-catalyst_2.12-3.1.0-SNAPSHOT-tests.jar,./core/target/spark-core_2.12-3.1.0-SNAPSHOT-tests.jar > --class org.apache.spark.sql.execution.benchmark.TPCDSQueryBenchmark > ./sql/core/target/spark-sql_2.12-3.1.0-SNAPSHOT-tests.jar --data-location > /tmp/tpcds-sf1 > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33265) Rename classOf[Seq] to classOf[scala.collection.Seq] in PostgresIntegrationSuite for Scala 2.13
[ https://issues.apache.org/jira/browse/SPARK-33265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang reassigned SPARK-33265: --- Assignee: Kousuke Saruta (was: Apache Spark) > Rename classOf[Seq] to classOf[scala.collection.Seq] in > PostgresIntegrationSuite for Scala 2.13 > --- > > Key: SPARK-33265 > URL: https://issues.apache.org/jira/browse/SPARK-33265 > Project: Spark > Issue Type: Sub-task > Components: Tests >Affects Versions: 3.1.0 >Reporter: Kousuke Saruta >Assignee: Kousuke Saruta >Priority: Minor > Fix For: 3.1.0 > > > In PostgresIntegrationSuite, evaluation of classOf[Seq.isAssignableFrom] > fails due to ClassCastException. > The reason is the same as what is resolved in SPARK-29292 but this happens at > test time, not compile time. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33860) Make CatalystTypeConverters.convertToCatalyst match special Array value
[ https://issues.apache.org/jira/browse/SPARK-33860?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang reassigned SPARK-33860: --- Assignee: ulysses you > Make CatalystTypeConverters.convertToCatalyst match special Array value > --- > > Key: SPARK-33860 > URL: https://issues.apache.org/jira/browse/SPARK-33860 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: ulysses you >Assignee: ulysses you >Priority: Minor > Fix For: 3.1.0 > > > Array[Any] doesn't match Array[Int]. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-34116) Separate state store numKeys metric test
[ https://issues.apache.org/jira/browse/SPARK-34116?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang reassigned SPARK-34116: --- Assignee: L. C. Hsieh (was: Apache Spark) > Separate state store numKeys metric test > > > Key: SPARK-34116 > URL: https://issues.apache.org/jira/browse/SPARK-34116 > Project: Spark > Issue Type: Test > Components: Structured Streaming >Affects Versions: 3.2.0, 3.1.1 >Reporter: L. C. Hsieh >Assignee: L. C. Hsieh >Priority: Minor > Fix For: 3.1.1 > > > Right now in StateStoreSuite, the tests of get/put/remove/commit are mixed > with numKeys metric test. I found it is flaky when I was testing with other > StateStore implementation. Specifically, we also are able to check these > metrics after state store is updated (committed). So I think we can refactor > the test a little bit to make it easier to incorporate other StateStore > externally. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-34027) ALTER TABLE .. RECOVER PARTITIONS doesn't refresh cache
[ https://issues.apache.org/jira/browse/SPARK-34027?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang reassigned SPARK-34027: --- Assignee: Maxim Gekk (was: Apache Spark) > ALTER TABLE .. RECOVER PARTITIONS doesn't refresh cache > --- > > Key: SPARK-34027 > URL: https://issues.apache.org/jira/browse/SPARK-34027 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.1, 3.1.0, 3.2.0 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Major > Labels: correctness > Fix For: 3.0.2, 3.2.0, 3.1.1 > > > Here is the example to reproduce the issue: > {code:sql} > spark-sql> create table tbl (col int, part int) using parquet partitioned by > (part); > spark-sql> insert into tbl partition (part=0) select 0; > spark-sql> cache table tbl; > spark-sql> select * from tbl; > 0 0 > spark-sql> show table extended like 'tbl' partition(part=0); > default tbl false Partition Values: [part=0] > Location: > file:/Users/maximgekk/proj/recover-partitions-refresh-cache/spark-warehouse/tbl/part=0 > ... > {code} > Add new partition by copying the existing one: > {code} > cp -r > /Users/maximgekk/proj/recover-partitions-refresh-cache/spark-warehouse/tbl/part=0 > > /Users/maximgekk/proj/recover-partitions-refresh-cache/spark-warehouse/tbl/part=1 > {code} > Recover and select the table: > {code} > spark-sql> alter table tbl recover partitions; > spark-sql> select * from tbl; > 0 0 > {code} > We see only old data. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-34374) Use standard methods to extract keys or values from a Map.
[ https://issues.apache.org/jira/browse/SPARK-34374?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen resolved SPARK-34374. -- Fix Version/s: 3.2.0 Resolution: Fixed Issue resolved by pull request 31484 [https://github.com/apache/spark/pull/31484] > Use standard methods to extract keys or values from a Map. > -- > > Key: SPARK-34374 > URL: https://issues.apache.org/jira/browse/SPARK-34374 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Trivial > Fix For: 3.2.0 > > > For keys: > *before* > {code:scala} > map.map(_._1) > {code} > *after* > {code:java} > map.keys > {code} > For values: > {code:scala} > map.map(_._2) > {code} > *after* > {code:java} > map.values > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-34374) Use standard methods to extract keys or values from a Map.
[ https://issues.apache.org/jira/browse/SPARK-34374?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen reassigned SPARK-34374: Assignee: Yang Jie > Use standard methods to extract keys or values from a Map. > -- > > Key: SPARK-34374 > URL: https://issues.apache.org/jira/browse/SPARK-34374 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Trivial > > For keys: > *before* > {code:scala} > map.map(_._1) > {code} > *after* > {code:java} > map.keys > {code} > For values: > {code:scala} > map.map(_._2) > {code} > *after* > {code:java} > map.values > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-34194) Queries that only touch partition columns shouldn't scan through all files
[ https://issues.apache.org/jira/browse/SPARK-34194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicholas Chammas resolved SPARK-34194. -- Resolution: Won't Fix > Queries that only touch partition columns shouldn't scan through all files > -- > > Key: SPARK-34194 > URL: https://issues.apache.org/jira/browse/SPARK-34194 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: Nicholas Chammas >Priority: Minor > > When querying only the partition columns of a partitioned table, it seems > that Spark nonetheless scans through all files in the table, even though it > doesn't need to. > Here's an example: > {code:python} > >>> data = spark.read.option('mergeSchema', > >>> 'false').parquet('s3a://some/dataset') > [Stage 0:==> (407 + 12) / > 1158] > {code} > Note the 1158 tasks. This matches the number of partitions in the table, > which is partitioned on a single field named {{file_date}}: > {code:sh} > $ aws s3 ls s3://some/dataset | head -n 3 >PRE file_date=2017-05-01/ >PRE file_date=2017-05-02/ >PRE file_date=2017-05-03/ > $ aws s3 ls s3://some/dataset | wc -l > 1158 > {code} > The table itself has over 138K files, though: > {code:sh} > $ aws s3 ls --recursive --human --summarize s3://some/dataset > ... > Total Objects: 138708 >Total Size: 3.7 TiB > {code} > Now let's try to query just the {{file_date}} field and see what Spark does. > {code:python} > >>> data.select('file_date').orderBy('file_date', > >>> ascending=False).limit(1).explain() > == Physical Plan == > TakeOrderedAndProject(limit=1, orderBy=[file_date#11 DESC NULLS LAST], > output=[file_date#11]) > +- *(1) ColumnarToRow >+- FileScan parquet [file_date#11] Batched: true, DataFilters: [], Format: > Parquet, Location: InMemoryFileIndex[s3a://some/dataset], PartitionFilters: > [], PushedFilters: [], ReadSchema: struct<> > >>> data.select('file_date').orderBy('file_date', > >>> ascending=False).limit(1).show() > [Stage 2:> (179 + 12) / > 41011] > {code} > Notice that Spark has spun up 41,011 tasks. Maybe more will be needed as the > job progresses? I'm not sure. > What I do know is that this operation takes a long time (~20 min) running > from my laptop, whereas to list the top-level {{file_date}} partitions via > the AWS CLI take a second or two. > Spark appears to be going through all the files in the table, when it just > needs to list the partitions captured in the S3 "directory" structure. The > query is only touching {{file_date}}, after all. > The current workaround for this performance problem / optimizer wastefulness, > is to [query the catalog > directly|https://stackoverflow.com/a/65724151/877069]. It works, but is a lot > of extra work compared to the elegant query against {{file_date}} that users > actually intend. > Spark should somehow know when it is only querying partition fields and skip > iterating through all the individual files in a table. > Tested on Spark 3.0.1. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34194) Queries that only touch partition columns shouldn't scan through all files
[ https://issues.apache.org/jira/browse/SPARK-34194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17281363#comment-17281363 ] Cheng Su commented on SPARK-34194: -- [~nchammas] - I think metadata-only query on partition column cannot be correct reliably with current design, and it's hard to fix. Here's an example - for Parquet/ORC table, there's only 0-row files under partition directory (this can happen). There's no easy way to return result with only metadata, as we don't know whether the file has 0 row or not, until opening the file to observe the file metadata. If we end up opening files, there's not so much performance improvement we can get, compared to actually run the query. For your specific case if you are sure that your data do not have 0-row files problem, I suggest to set config `spark.sql.optimizer.metadataOnly` to true to unblock yourself. > Queries that only touch partition columns shouldn't scan through all files > -- > > Key: SPARK-34194 > URL: https://issues.apache.org/jira/browse/SPARK-34194 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: Nicholas Chammas >Priority: Minor > > When querying only the partition columns of a partitioned table, it seems > that Spark nonetheless scans through all files in the table, even though it > doesn't need to. > Here's an example: > {code:python} > >>> data = spark.read.option('mergeSchema', > >>> 'false').parquet('s3a://some/dataset') > [Stage 0:==> (407 + 12) / > 1158] > {code} > Note the 1158 tasks. This matches the number of partitions in the table, > which is partitioned on a single field named {{file_date}}: > {code:sh} > $ aws s3 ls s3://some/dataset | head -n 3 >PRE file_date=2017-05-01/ >PRE file_date=2017-05-02/ >PRE file_date=2017-05-03/ > $ aws s3 ls s3://some/dataset | wc -l > 1158 > {code} > The table itself has over 138K files, though: > {code:sh} > $ aws s3 ls --recursive --human --summarize s3://some/dataset > ... > Total Objects: 138708 >Total Size: 3.7 TiB > {code} > Now let's try to query just the {{file_date}} field and see what Spark does. > {code:python} > >>> data.select('file_date').orderBy('file_date', > >>> ascending=False).limit(1).explain() > == Physical Plan == > TakeOrderedAndProject(limit=1, orderBy=[file_date#11 DESC NULLS LAST], > output=[file_date#11]) > +- *(1) ColumnarToRow >+- FileScan parquet [file_date#11] Batched: true, DataFilters: [], Format: > Parquet, Location: InMemoryFileIndex[s3a://some/dataset], PartitionFilters: > [], PushedFilters: [], ReadSchema: struct<> > >>> data.select('file_date').orderBy('file_date', > >>> ascending=False).limit(1).show() > [Stage 2:> (179 + 12) / > 41011] > {code} > Notice that Spark has spun up 41,011 tasks. Maybe more will be needed as the > job progresses? I'm not sure. > What I do know is that this operation takes a long time (~20 min) running > from my laptop, whereas to list the top-level {{file_date}} partitions via > the AWS CLI take a second or two. > Spark appears to be going through all the files in the table, when it just > needs to list the partitions captured in the S3 "directory" structure. The > query is only touching {{file_date}}, after all. > The current workaround for this performance problem / optimizer wastefulness, > is to [query the catalog > directly|https://stackoverflow.com/a/65724151/877069]. It works, but is a lot > of extra work compared to the elegant query against {{file_date}} that users > actually intend. > Spark should somehow know when it is only querying partition fields and skip > iterating through all the individual files in a table. > Tested on Spark 3.0.1. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34404) Support Avro datasource options to control datetime rebasing in read
[ https://issues.apache.org/jira/browse/SPARK-34404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17281348#comment-17281348 ] Apache Spark commented on SPARK-34404: -- User 'MaxGekk' has created a pull request for this issue: https://github.com/apache/spark/pull/31529 > Support Avro datasource options to control datetime rebasing in read > > > Key: SPARK-34404 > URL: https://issues.apache.org/jira/browse/SPARK-34404 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.2.0 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Major > Fix For: 3.2.0 > > > Add new Avro option similar to the SQL configs > {{spark.sql.legacy.parquet.datetimeRebaseModeInRead}}{{.}} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-34404) Support Avro datasource options to control datetime rebasing in read
[ https://issues.apache.org/jira/browse/SPARK-34404?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-34404: Assignee: Apache Spark (was: Maxim Gekk) > Support Avro datasource options to control datetime rebasing in read > > > Key: SPARK-34404 > URL: https://issues.apache.org/jira/browse/SPARK-34404 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.2.0 >Reporter: Maxim Gekk >Assignee: Apache Spark >Priority: Major > Fix For: 3.2.0 > > > Add new Avro option similar to the SQL configs > {{spark.sql.legacy.parquet.datetimeRebaseModeInRead}}{{.}} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-34404) Support Avro datasource options to control datetime rebasing in read
[ https://issues.apache.org/jira/browse/SPARK-34404?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-34404: Assignee: Maxim Gekk (was: Apache Spark) > Support Avro datasource options to control datetime rebasing in read > > > Key: SPARK-34404 > URL: https://issues.apache.org/jira/browse/SPARK-34404 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.2.0 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Major > Fix For: 3.2.0 > > > Add new Avro option similar to the SQL configs > {{spark.sql.legacy.parquet.datetimeRebaseModeInRead}}{{.}} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-34404) Support Avro datasource options to control datetime rebasing in read
[ https://issues.apache.org/jira/browse/SPARK-34404?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maxim Gekk updated SPARK-34404: --- Description: Add new Avro option similar to the SQL configs {{spark.sql.legacy.parquet.datetimeRebaseModeInRead}}{{.}} (was: Add new parquet options similar to the SQL configs {{spark.sql.legacy.parquet.datetimeRebaseModeInRead}} and {{spark.sql.legacy.parquet.int96RebaseModeInRead.}}) > Support Avro datasource options to control datetime rebasing in read > > > Key: SPARK-34404 > URL: https://issues.apache.org/jira/browse/SPARK-34404 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.2.0 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Major > Fix For: 3.2.0 > > > Add new Avro option similar to the SQL configs > {{spark.sql.legacy.parquet.datetimeRebaseModeInRead}}{{.}} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-34404) Support Avro datasource options to control datetime rebasing in read
Maxim Gekk created SPARK-34404: -- Summary: Support Avro datasource options to control datetime rebasing in read Key: SPARK-34404 URL: https://issues.apache.org/jira/browse/SPARK-34404 Project: Spark Issue Type: New Feature Components: SQL Affects Versions: 3.2.0 Reporter: Maxim Gekk Assignee: Maxim Gekk Fix For: 3.2.0 Add new parquet options similar to the SQL configs {{spark.sql.legacy.parquet.datetimeRebaseModeInRead}} and {{spark.sql.legacy.parquet.int96RebaseModeInRead.}} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-34168) Support DPP in AQE When the join is Broadcast hash join before applying the AQE rules
[ https://issues.apache.org/jira/browse/SPARK-34168?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-34168: -- Parent: SPARK-33828 Issue Type: Sub-task (was: Improvement) > Support DPP in AQE When the join is Broadcast hash join before applying the > AQE rules > - > > Key: SPARK-34168 > URL: https://issues.apache.org/jira/browse/SPARK-34168 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0, 3.0.1 >Reporter: Ke Jia >Assignee: Ke Jia >Priority: Major > Fix For: 3.2.0 > > > Both AQE and DPP cannot be applied at the same time. This PR will enable AQE > and DPP when the join is Broadcast hash join at the beginning. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34194) Queries that only touch partition columns shouldn't scan through all files
[ https://issues.apache.org/jira/browse/SPARK-34194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17281269#comment-17281269 ] Nicholas Chammas commented on SPARK-34194: -- It's not clear to me whether SPARK-26709 is describing an inherent design issue that has no fix, or whether SPARK-26709 simply captures a bug in the past implementation of {{OptimizeMetadataOnlyQuery}} which could conceivably be fixed in the future. If it's something that could be fixed and reintroduced, this issue should stay open. If we know for design reasons that metadata-only queries cannot be made reliably correct, then this issue should be closed with a clear explanation to that effect. > Queries that only touch partition columns shouldn't scan through all files > -- > > Key: SPARK-34194 > URL: https://issues.apache.org/jira/browse/SPARK-34194 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: Nicholas Chammas >Priority: Minor > > When querying only the partition columns of a partitioned table, it seems > that Spark nonetheless scans through all files in the table, even though it > doesn't need to. > Here's an example: > {code:python} > >>> data = spark.read.option('mergeSchema', > >>> 'false').parquet('s3a://some/dataset') > [Stage 0:==> (407 + 12) / > 1158] > {code} > Note the 1158 tasks. This matches the number of partitions in the table, > which is partitioned on a single field named {{file_date}}: > {code:sh} > $ aws s3 ls s3://some/dataset | head -n 3 >PRE file_date=2017-05-01/ >PRE file_date=2017-05-02/ >PRE file_date=2017-05-03/ > $ aws s3 ls s3://some/dataset | wc -l > 1158 > {code} > The table itself has over 138K files, though: > {code:sh} > $ aws s3 ls --recursive --human --summarize s3://some/dataset > ... > Total Objects: 138708 >Total Size: 3.7 TiB > {code} > Now let's try to query just the {{file_date}} field and see what Spark does. > {code:python} > >>> data.select('file_date').orderBy('file_date', > >>> ascending=False).limit(1).explain() > == Physical Plan == > TakeOrderedAndProject(limit=1, orderBy=[file_date#11 DESC NULLS LAST], > output=[file_date#11]) > +- *(1) ColumnarToRow >+- FileScan parquet [file_date#11] Batched: true, DataFilters: [], Format: > Parquet, Location: InMemoryFileIndex[s3a://some/dataset], PartitionFilters: > [], PushedFilters: [], ReadSchema: struct<> > >>> data.select('file_date').orderBy('file_date', > >>> ascending=False).limit(1).show() > [Stage 2:> (179 + 12) / > 41011] > {code} > Notice that Spark has spun up 41,011 tasks. Maybe more will be needed as the > job progresses? I'm not sure. > What I do know is that this operation takes a long time (~20 min) running > from my laptop, whereas to list the top-level {{file_date}} partitions via > the AWS CLI take a second or two. > Spark appears to be going through all the files in the table, when it just > needs to list the partitions captured in the S3 "directory" structure. The > query is only touching {{file_date}}, after all. > The current workaround for this performance problem / optimizer wastefulness, > is to [query the catalog > directly|https://stackoverflow.com/a/65724151/877069]. It works, but is a lot > of extra work compared to the elegant query against {{file_date}} that users > actually intend. > Spark should somehow know when it is only querying partition fields and skip > iterating through all the individual files in a table. > Tested on Spark 3.0.1. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-34344) Have functionality to trace back Spark SQL queries from the application ID that got submitted on YARN
[ https://issues.apache.org/jira/browse/SPARK-34344?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Arpan Bhandari updated SPARK-34344: --- Component/s: SQL > Have functionality to trace back Spark SQL queries from the application ID > that got submitted on YARN > - > > Key: SPARK-34344 > URL: https://issues.apache.org/jira/browse/SPARK-34344 > Project: Spark > Issue Type: New Feature > Components: Spark Shell, SQL >Affects Versions: 1.6.3, 2.3.0, 2.4.5 >Reporter: Arpan Bhandari >Priority: Major > > We need to have Application Id from resource manager mapped to the specific > spark sql query that got executed with respect to that application Id so that > back tracing is possible. > For example : if i run a query using spark shell : > spark.sql("select dt.d_year,item.i_brand_id brand_id,item.i_brand > brand,sum(ss_ext_sales_price) sum_agg from date_dim dt,store_sales,item where > dt.d_date_sk = store_sales.ss_sold_date_sk and store_sales.ss_item_sk = > item.i_item_sk and item.i_manufact_id = 436 and dt.d_moy=12 group by > dt.d_year,item.i_brand,item.i_brand_id order by dt.d_year,sum_agg > desc,brand_id limit 100").show(); > When i see the event logs or the history server i don't see the query > anywhere, but the query plan is there, so it becomes difficult to trace back > what query actually got submitted. (if have to map it to the specific > application Id on yarn) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-34344) Have functionality to trace back Spark SQL queries from the application ID that got submitted on YARN
[ https://issues.apache.org/jira/browse/SPARK-34344?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Arpan Bhandari updated SPARK-34344: --- Component/s: (was: Spark Submit) > Have functionality to trace back Spark SQL queries from the application ID > that got submitted on YARN > - > > Key: SPARK-34344 > URL: https://issues.apache.org/jira/browse/SPARK-34344 > Project: Spark > Issue Type: New Feature > Components: Spark Shell >Affects Versions: 1.6.3, 2.3.0, 2.4.5 >Reporter: Arpan Bhandari >Priority: Major > > We need to have Application Id from resource manager mapped to the specific > spark sql query that got executed with respect to that application Id so that > back tracing is possible. > For example : if i run a query using spark shell : > spark.sql("select dt.d_year,item.i_brand_id brand_id,item.i_brand > brand,sum(ss_ext_sales_price) sum_agg from date_dim dt,store_sales,item where > dt.d_date_sk = store_sales.ss_sold_date_sk and store_sales.ss_item_sk = > item.i_item_sk and item.i_manufact_id = 436 and dt.d_moy=12 group by > dt.d_year,item.i_brand,item.i_brand_id order by dt.d_year,sum_agg > desc,brand_id limit 100").show(); > When i see the event logs or the history server i don't see the query > anywhere, but the query plan is there, so it becomes difficult to trace back > what query actually got submitted. (if have to map it to the specific > application Id on yarn) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-34168) Support DPP in AQE When the join is Broadcast hash join before applying the AQE rules
[ https://issues.apache.org/jira/browse/SPARK-34168?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-34168: --- Assignee: Ke Jia > Support DPP in AQE When the join is Broadcast hash join before applying the > AQE rules > - > > Key: SPARK-34168 > URL: https://issues.apache.org/jira/browse/SPARK-34168 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0, 3.0.1 >Reporter: Ke Jia >Assignee: Ke Jia >Priority: Major > > Both AQE and DPP cannot be applied at the same time. This PR will enable AQE > and DPP when the join is Broadcast hash join at the beginning. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-34168) Support DPP in AQE When the join is Broadcast hash join before applying the AQE rules
[ https://issues.apache.org/jira/browse/SPARK-34168?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-34168. - Fix Version/s: 3.2.0 Resolution: Fixed Issue resolved by pull request 31258 [https://github.com/apache/spark/pull/31258] > Support DPP in AQE When the join is Broadcast hash join before applying the > AQE rules > - > > Key: SPARK-34168 > URL: https://issues.apache.org/jira/browse/SPARK-34168 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0, 3.0.1 >Reporter: Ke Jia >Assignee: Ke Jia >Priority: Major > Fix For: 3.2.0 > > > Both AQE and DPP cannot be applied at the same time. This PR will enable AQE > and DPP when the join is Broadcast hash join at the beginning. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-34403) Remove dependency to commons-httpclient, is not used and has vulnerabilities.
[ https://issues.apache.org/jira/browse/SPARK-34403?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-34403: Assignee: (was: Apache Spark) > Remove dependency to commons-httpclient, is not used and has vulnerabilities. > - > > Key: SPARK-34403 > URL: https://issues.apache.org/jira/browse/SPARK-34403 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.1.0 >Reporter: Sergio Sainz >Priority: Major > > > commons-httpclient > commons-httpclient > > > Has vulnerabilities as below: > > CVE-2012-6153 > CVE-2012-5783 > > Also, after removing it and running `spark/sql/hive$mvn compile test` the > result is SUCCESS > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-34403) Remove dependency to commons-httpclient, is not used and has vulnerabilities.
[ https://issues.apache.org/jira/browse/SPARK-34403?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-34403: Assignee: Apache Spark > Remove dependency to commons-httpclient, is not used and has vulnerabilities. > - > > Key: SPARK-34403 > URL: https://issues.apache.org/jira/browse/SPARK-34403 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.1.0 >Reporter: Sergio Sainz >Assignee: Apache Spark >Priority: Major > > > commons-httpclient > commons-httpclient > > > Has vulnerabilities as below: > > CVE-2012-6153 > CVE-2012-5783 > > Also, after removing it and running `spark/sql/hive$mvn compile test` the > result is SUCCESS > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34403) Remove dependency to commons-httpclient, is not used and has vulnerabilities.
[ https://issues.apache.org/jira/browse/SPARK-34403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17281180#comment-17281180 ] Apache Spark commented on SPARK-34403: -- User 'ssainz' has created a pull request for this issue: https://github.com/apache/spark/pull/31528 > Remove dependency to commons-httpclient, is not used and has vulnerabilities. > - > > Key: SPARK-34403 > URL: https://issues.apache.org/jira/browse/SPARK-34403 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.1.0 >Reporter: Sergio Sainz >Priority: Major > > > commons-httpclient > commons-httpclient > > > Has vulnerabilities as below: > > CVE-2012-6153 > CVE-2012-5783 > > Also, after removing it and running `spark/sql/hive$mvn compile test` the > result is SUCCESS > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34403) Remove dependency to commons-httpclient, is not used and has vulnerabilities.
[ https://issues.apache.org/jira/browse/SPARK-34403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17281181#comment-17281181 ] Apache Spark commented on SPARK-34403: -- User 'ssainz' has created a pull request for this issue: https://github.com/apache/spark/pull/31528 > Remove dependency to commons-httpclient, is not used and has vulnerabilities. > - > > Key: SPARK-34403 > URL: https://issues.apache.org/jira/browse/SPARK-34403 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.1.0 >Reporter: Sergio Sainz >Priority: Major > > > commons-httpclient > commons-httpclient > > > Has vulnerabilities as below: > > CVE-2012-6153 > CVE-2012-5783 > > Also, after removing it and running `spark/sql/hive$mvn compile test` the > result is SUCCESS > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34403) Remove dependency to commons-httpclient, is not used and has vulnerabilities.
[ https://issues.apache.org/jira/browse/SPARK-34403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17281179#comment-17281179 ] Sergio Sainz commented on SPARK-34403: -- PR: https://github.com/apache/spark/pull/31528 > Remove dependency to commons-httpclient, is not used and has vulnerabilities. > - > > Key: SPARK-34403 > URL: https://issues.apache.org/jira/browse/SPARK-34403 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.1.0 >Reporter: Sergio Sainz >Priority: Major > > > commons-httpclient > commons-httpclient > > > Has vulnerabilities as below: > > CVE-2012-6153 > CVE-2012-5783 > > Also, after removing it and running `spark/sql/hive$mvn compile test` the > result is SUCCESS > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org