[jira] [Assigned] (SPARK-45287) Add Java 21 benchmark result
[ https://issues.apache.org/jira/browse/SPARK-45287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-45287: - Assignee: Dongjoon Hyun > Add Java 21 benchmark result > > > Key: SPARK-45287 > URL: https://issues.apache.org/jira/browse/SPARK-45287 > Project: Spark > Issue Type: Sub-task > Components: Tests >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-45287) Add Java 21 benchmark result
[ https://issues.apache.org/jira/browse/SPARK-45287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-45287. --- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 43065 [https://github.com/apache/spark/pull/43065] > Add Java 21 benchmark result > > > Key: SPARK-45287 > URL: https://issues.apache.org/jira/browse/SPARK-45287 > Project: Spark > Issue Type: Sub-task > Components: Tests >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-44119) Drop K8s v1.25 and lower version support
[ https://issues.apache.org/jira/browse/SPARK-44119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-44119. --- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 43069 [https://github.com/apache/spark/pull/43069] > Drop K8s v1.25 and lower version support > > > Key: SPARK-44119 > URL: https://issues.apache.org/jira/browse/SPARK-44119 > Project: Spark > Issue Type: Sub-task > Components: Kubernetes >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > > *1. Default K8s Version in Public Cloud environments* > The default K8s versions of public cloud providers are already K8s 1.27+. > - EKS: v1.27 (Default) > - GKE: v1.27 (Stable), v1.27 (Regular), v1.27 (Rapid) > *2. End Of Support* > In addition, K8s 1.25 and olders are going to reach EOL when Apache Spark > 4.0.0 arrives on June 2024. K8s 1.26 is also going to reach EOL on June. > || K8s || AKS || GKE || EKS || > | 1.27 | 2024-07 | 2024-08 | 2024-07 | > | 1.26 | 2024-03 | 2024-06 | 2024-06 | > | 1.25 | 2023-12 | 2024-02 | 2024-05 | > | 1.24 | 2023-07 | 2023-10 | 2024-01 | > - [AKS EOL > Schedule](https://docs.microsoft.com/en-us/azure/aks/supported-kubernetes-versions?tabs=azure-cli#aks-kubernetes-release-calendar) > - [GKE EOL > Schedule](https://cloud.google.com/kubernetes-engine/docs/release-schedule) > - [EKS EOL > Schedule](https://docs.aws.amazon.com/eks/latest/userguide/kubernetes-versions.html#kubernetes-release-calendar) -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-44119) Drop K8s v1.25 and lower version support
[ https://issues.apache.org/jira/browse/SPARK-44119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-44119: - Assignee: Dongjoon Hyun > Drop K8s v1.25 and lower version support > > > Key: SPARK-44119 > URL: https://issues.apache.org/jira/browse/SPARK-44119 > Project: Spark > Issue Type: Sub-task > Components: Kubernetes >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Labels: pull-request-available > > *1. Default K8s Version in Public Cloud environments* > The default K8s versions of public cloud providers are already K8s 1.27+. > - EKS: v1.27 (Default) > - GKE: v1.27 (Stable), v1.27 (Regular), v1.27 (Rapid) > *2. End Of Support* > In addition, K8s 1.25 and olders are going to reach EOL when Apache Spark > 4.0.0 arrives on June 2024. K8s 1.26 is also going to reach EOL on June. > || K8s || AKS || GKE || EKS || > | 1.27 | 2024-07 | 2024-08 | 2024-07 | > | 1.26 | 2024-03 | 2024-06 | 2024-06 | > | 1.25 | 2023-12 | 2024-02 | 2024-05 | > | 1.24 | 2023-07 | 2023-10 | 2024-01 | > - [AKS EOL > Schedule](https://docs.microsoft.com/en-us/azure/aks/supported-kubernetes-versions?tabs=azure-cli#aks-kubernetes-release-calendar) > - [GKE EOL > Schedule](https://cloud.google.com/kubernetes-engine/docs/release-schedule) > - [EKS EOL > Schedule](https://docs.aws.amazon.com/eks/latest/userguide/kubernetes-versions.html#kubernetes-release-calendar) -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-44118) Support K8s scheduling gates
[ https://issues.apache.org/jira/browse/SPARK-44118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-44118: - Assignee: (was: Dongjoon Hyun) > Support K8s scheduling gates > > > Key: SPARK-44118 > URL: https://issues.apache.org/jira/browse/SPARK-44118 > Project: Spark > Issue Type: Sub-task > Components: Kubernetes >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Priority: Major > > https://kubernetes.io/docs/concepts/scheduling-eviction/pod-scheduling-readiness/ > - Kubernetes v1.26 [alpha] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-44118) Support K8s scheduling gates
[ https://issues.apache.org/jira/browse/SPARK-44118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-44118: - Assignee: Dongjoon Hyun > Support K8s scheduling gates > > > Key: SPARK-44118 > URL: https://issues.apache.org/jira/browse/SPARK-44118 > Project: Spark > Issue Type: Sub-task > Components: Kubernetes >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > > https://kubernetes.io/docs/concepts/scheduling-eviction/pod-scheduling-readiness/ > - Kubernetes v1.26 [alpha] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44119) Drop K8s v1.25 and lower version support
[ https://issues.apache.org/jira/browse/SPARK-44119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-44119: -- Description: *1. Default K8s Version in Public Cloud environments* The default K8s versions of public cloud providers are already K8s 1.27+. - EKS: v1.27 (Default) - GKE: v1.27 (Stable), v1.27 (Regular), v1.27 (Rapid) *2. End Of Support* In addition, K8s 1.25 and olders are going to reach EOL when Apache Spark 4.0.0 arrives on June 2024. K8s 1.26 is also going to reach EOL on June. || K8s || AKS || GKE || EKS || | 1.27 | 2024-07 | 2024-08 | 2024-07 | | 1.26 | 2024-03 | 2024-06 | 2024-06 | | 1.25 | 2023-12 | 2024-02 | 2024-05 | | 1.24 | 2023-07 | 2023-10 | 2024-01 | - [AKS EOL Schedule](https://docs.microsoft.com/en-us/azure/aks/supported-kubernetes-versions?tabs=azure-cli#aks-kubernetes-release-calendar) - [GKE EOL Schedule](https://cloud.google.com/kubernetes-engine/docs/release-schedule) - [EKS EOL Schedule](https://docs.aws.amazon.com/eks/latest/userguide/kubernetes-versions.html#kubernetes-release-calendar) was:EKS K8s v1.25 will reach the End-Of-Support on May 2024. > Drop K8s v1.25 and lower version support > > > Key: SPARK-44119 > URL: https://issues.apache.org/jira/browse/SPARK-44119 > Project: Spark > Issue Type: Sub-task > Components: Kubernetes >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Priority: Major > Labels: pull-request-available > > *1. Default K8s Version in Public Cloud environments* > The default K8s versions of public cloud providers are already K8s 1.27+. > - EKS: v1.27 (Default) > - GKE: v1.27 (Stable), v1.27 (Regular), v1.27 (Rapid) > *2. End Of Support* > In addition, K8s 1.25 and olders are going to reach EOL when Apache Spark > 4.0.0 arrives on June 2024. K8s 1.26 is also going to reach EOL on June. > || K8s || AKS || GKE || EKS || > | 1.27 | 2024-07 | 2024-08 | 2024-07 | > | 1.26 | 2024-03 | 2024-06 | 2024-06 | > | 1.25 | 2023-12 | 2024-02 | 2024-05 | > | 1.24 | 2023-07 | 2023-10 | 2024-01 | > - [AKS EOL > Schedule](https://docs.microsoft.com/en-us/azure/aks/supported-kubernetes-versions?tabs=azure-cli#aks-kubernetes-release-calendar) > - [GKE EOL > Schedule](https://cloud.google.com/kubernetes-engine/docs/release-schedule) > - [EKS EOL > Schedule](https://docs.aws.amazon.com/eks/latest/userguide/kubernetes-versions.html#kubernetes-release-calendar) -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-45288) Remove outdated benchmark result files `jdk1[17]*results.txt`
[ https://issues.apache.org/jira/browse/SPARK-45288?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yang Jie resolved SPARK-45288. -- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 43066 [https://github.com/apache/spark/pull/43066] > Remove outdated benchmark result files `jdk1[17]*results.txt` > - > > Key: SPARK-45288 > URL: https://issues.apache.org/jira/browse/SPARK-45288 > Project: Spark > Issue Type: Test > Components: Tests >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Minor > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-45288) Remove outdated benchmark result files `jdk1[17]*results.txt`
[ https://issues.apache.org/jira/browse/SPARK-45288?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yang Jie reassigned SPARK-45288: Assignee: Dongjoon Hyun > Remove outdated benchmark result files `jdk1[17]*results.txt` > - > > Key: SPARK-45288 > URL: https://issues.apache.org/jira/browse/SPARK-45288 > Project: Spark > Issue Type: Test > Components: Tests >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Minor > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44119) Drop K8s v1.25 and lower version support
[ https://issues.apache.org/jira/browse/SPARK-44119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-44119: --- Labels: pull-request-available (was: ) > Drop K8s v1.25 and lower version support > > > Key: SPARK-44119 > URL: https://issues.apache.org/jira/browse/SPARK-44119 > Project: Spark > Issue Type: Sub-task > Components: Kubernetes >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Priority: Major > Labels: pull-request-available > > EKS K8s v1.25 will reach the End-Of-Support on May 2024. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-45274) Implementation of a new DAG drawing approach to avoid fork
[ https://issues.apache.org/jira/browse/SPARK-45274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-45274. --- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 43053 [https://github.com/apache/spark/pull/43053] > Implementation of a new DAG drawing approach to avoid fork > --- > > Key: SPARK-45274 > URL: https://issues.apache.org/jira/browse/SPARK-45274 > Project: Spark > Issue Type: Improvement > Components: UI >Affects Versions: 4.0.0 >Reporter: Kent Yao >Assignee: Kent Yao >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-45274) Implementation of a new DAG drawing approach to avoid fork
[ https://issues.apache.org/jira/browse/SPARK-45274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-45274: - Assignee: Kent Yao > Implementation of a new DAG drawing approach to avoid fork > --- > > Key: SPARK-45274 > URL: https://issues.apache.org/jira/browse/SPARK-45274 > Project: Spark > Issue Type: Improvement > Components: UI >Affects Versions: 4.0.0 >Reporter: Kent Yao >Assignee: Kent Yao >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44550) Wrong semantics for null IN (empty list)
[ https://issues.apache.org/jira/browse/SPARK-44550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-44550: --- Labels: pull-request-available (was: ) > Wrong semantics for null IN (empty list) > > > Key: SPARK-44550 > URL: https://issues.apache.org/jira/browse/SPARK-44550 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0 >Reporter: Jack Chen >Assignee: Jack Chen >Priority: Major > Labels: pull-request-available > > {{null IN (empty list)}} incorrectly evaluates to null, when it should > evaluate to false. (The reason it should be false is because a IN (b1, b2) is > defined as a = b1 OR a = b2, and an empty IN list is treated as an empty OR > which is false. This is specified by ANSI SQL.) > Many places in Spark execution (In, InSet, InSubquery) and optimization > (OptimizeIn, NullPropagation) implemented this wrong behavior. Also note that > the Spark behavior for the null IN (empty list) is inconsistent in some > places - literal IN lists generally return null (incorrect), while IN/NOT IN > subqueries mostly return false/true, respectively (correct) in this case. > This is a longstanding correctness issue which has existed since null support > for IN expressions was first added to Spark. > Doc with more details: > [https://docs.google.com/document/d/1k8AY8oyT-GI04SnP7eXttPDnDj-Ek-c3luF2zL6DPNU/edit] > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42669) Short circuit local relation rpcs
[ https://issues.apache.org/jira/browse/SPARK-42669?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-42669: --- Labels: pull-request-available (was: ) > Short circuit local relation rpcs > - > > Key: SPARK-42669 > URL: https://issues.apache.org/jira/browse/SPARK-42669 > Project: Spark > Issue Type: New Feature > Components: Connect >Affects Versions: 3.4.0 >Reporter: Herman van Hövell >Priority: Major > Labels: pull-request-available > > Operations on LocalRelation can mostly be done locally (without sending > rpcs). We should leverage this. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42830) Link skipped stages on Spark UI
[ https://issues.apache.org/jira/browse/SPARK-42830?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-42830: --- Labels: pull-request-available (was: ) > Link skipped stages on Spark UI > --- > > Key: SPARK-42830 > URL: https://issues.apache.org/jira/browse/SPARK-42830 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 3.3.2 >Reporter: Yian Liou >Priority: Major > Labels: pull-request-available > > Add a link to the skipped Spark stages so that its easier to find the > execution details on the UI. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42890) Add Identifier to the InMemoryTableScan node on the SQL page
[ https://issues.apache.org/jira/browse/SPARK-42890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-42890: --- Labels: pull-request-available (was: ) > Add Identifier to the InMemoryTableScan node on the SQL page > > > Key: SPARK-42890 > URL: https://issues.apache.org/jira/browse/SPARK-42890 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 3.3.2 >Reporter: Yian Liou >Priority: Major > Labels: pull-request-available > > On the SQL page in the Web UI, there is no distinction for which > InMemoryTableScan is being used at a specific point in the DAG. This Jira > aims to add a repeat identifier to distinguish which InMemoryTableScan is > being used at a certain location. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45057) Deadlock caused by rdd replication level of 2
[ https://issues.apache.org/jira/browse/SPARK-45057?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-45057: --- Labels: pull-request-available (was: ) > Deadlock caused by rdd replication level of 2 > - > > Key: SPARK-45057 > URL: https://issues.apache.org/jira/browse/SPARK-45057 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.4.1 >Reporter: Zhongwei Zhu >Priority: Major > Labels: pull-request-available > > > When 2 tasks try to compute same rdd with replication level of 2 and running > on only 2 executors. Deadlock will happen. > Task only release lock after writing into local machine and replicate to > remote executor. > > ||Time||Exe 1 (Task Thread T1)||Exe 1 (Shuffle Server Thread T2)||Exe 2 (Task > Thread T3)||Exe 2 (Shuffle Server Thread T4)|| > |T0|write lock of rdd| | | | > |T1| | |write lock of rdd| | > |T2|replicate -> UploadBlockSync (blocked by T4)| | | | > |T3| | | |Received UploadBlock request from T1 (blocked by T4)| > |T4| | |replicate -> UploadBlockSync (blocked by T2)| | > |T5| |Received UploadBlock request from T3 (blocked by T1)| | | > |T6|Deadlock|Deadlock|Deadlock|Deadlock| -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-45285) Remove deprecated `Runtime.getRuntime.exec(String)` API usage
[ https://issues.apache.org/jira/browse/SPARK-45285?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-45285: - Assignee: Dongjoon Hyun > Remove deprecated `Runtime.getRuntime.exec(String)` API usage > - > > Key: SPARK-45285 > URL: https://issues.apache.org/jira/browse/SPARK-45285 > Project: Spark > Issue Type: Improvement > Components: Spark Core, Tests >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Minor > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-45285) Remove deprecated `Runtime.getRuntime.exec(String)` API usage
[ https://issues.apache.org/jira/browse/SPARK-45285?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-45285. --- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 43062 [https://github.com/apache/spark/pull/43062] > Remove deprecated `Runtime.getRuntime.exec(String)` API usage > - > > Key: SPARK-45285 > URL: https://issues.apache.org/jira/browse/SPARK-45285 > Project: Spark > Issue Type: Improvement > Components: Spark Core, Tests >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Minor > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-45284) Update SparkR minimum SystemRequirements to Java 17
[ https://issues.apache.org/jira/browse/SPARK-45284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-45284: - Assignee: Dongjoon Hyun > Update SparkR minimum SystemRequirements to Java 17 > --- > > Key: SPARK-45284 > URL: https://issues.apache.org/jira/browse/SPARK-45284 > Project: Spark > Issue Type: Sub-task > Components: R >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-45284) Update SparkR minimum SystemRequirements to Java 17
[ https://issues.apache.org/jira/browse/SPARK-45284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-45284. --- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 43060 [https://github.com/apache/spark/pull/43060] > Update SparkR minimum SystemRequirements to Java 17 > --- > > Key: SPARK-45284 > URL: https://issues.apache.org/jira/browse/SPARK-45284 > Project: Spark > Issue Type: Sub-task > Components: R >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45265) Support Hive 4.0 metastore
[ https://issues.apache.org/jira/browse/SPARK-45265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-45265: -- Parent: SPARK-44111 Issue Type: Sub-task (was: Bug) > Support Hive 4.0 metastore > -- > > Key: SPARK-45265 > URL: https://issues.apache.org/jira/browse/SPARK-45265 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 4.0.0 >Reporter: Attila Zsolt Piros >Assignee: Attila Zsolt Piros >Priority: Major > Labels: pull-request-available > > Although Hive 4.0.0 is still beta I would like to work on this as Hive 4.0.0 > will support support the pushdowns of partition column filters with > VARCHAR/CHAR types. > For details please see HIVE-26661: Support partition filter for char and > varchar types on Hive metastore -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45288) Remove outdated benchmark result files `jdk1[17]*results.txt`
[ https://issues.apache.org/jira/browse/SPARK-45288?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-45288: --- Labels: pull-request-available (was: ) > Remove outdated benchmark result files `jdk1[17]*results.txt` > - > > Key: SPARK-45288 > URL: https://issues.apache.org/jira/browse/SPARK-45288 > Project: Spark > Issue Type: Test > Components: Tests >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Priority: Minor > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-45288) Remove outdated benchmark result files `jdk1[17]*results.txt`
Dongjoon Hyun created SPARK-45288: - Summary: Remove outdated benchmark result files `jdk1[17]*results.txt` Key: SPARK-45288 URL: https://issues.apache.org/jira/browse/SPARK-45288 Project: Spark Issue Type: Test Components: Tests Affects Versions: 4.0.0 Reporter: Dongjoon Hyun -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45287) Add Java 21 benchmark result
[ https://issues.apache.org/jira/browse/SPARK-45287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-45287: --- Labels: pull-request-available (was: ) > Add Java 21 benchmark result > > > Key: SPARK-45287 > URL: https://issues.apache.org/jira/browse/SPARK-45287 > Project: Spark > Issue Type: Sub-task > Components: Tests >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-45287) Add Java 21 benchmark result
Dongjoon Hyun created SPARK-45287: - Summary: Add Java 21 benchmark result Key: SPARK-45287 URL: https://issues.apache.org/jira/browse/SPARK-45287 Project: Spark Issue Type: Sub-task Components: Tests Affects Versions: 4.0.0 Reporter: Dongjoon Hyun -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45265) Support Hive 4.0 metastore
[ https://issues.apache.org/jira/browse/SPARK-45265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-45265: --- Labels: pull-request-available (was: ) > Support Hive 4.0 metastore > -- > > Key: SPARK-45265 > URL: https://issues.apache.org/jira/browse/SPARK-45265 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 4.0.0 >Reporter: Attila Zsolt Piros >Assignee: Attila Zsolt Piros >Priority: Major > Labels: pull-request-available > > Although Hive 4.0.0 is still beta I would like to work on this as Hive 4.0.0 > will support support the pushdowns of partition column filters with > VARCHAR/CHAR types. > For details please see HIVE-26661: Support partition filter for char and > varchar types on Hive metastore -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43288) DataSourceV2: CREATE TABLE LIKE
[ https://issues.apache.org/jira/browse/SPARK-43288?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-43288: --- Labels: pull-request-available (was: ) > DataSourceV2: CREATE TABLE LIKE > --- > > Key: SPARK-43288 > URL: https://issues.apache.org/jira/browse/SPARK-43288 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: John Zhuge >Priority: Major > Labels: pull-request-available > > Support CREATE TABLE LIKE in DSv2. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-39822) Provides a good error during create Index with different dtype elements
[ https://issues.apache.org/jira/browse/SPARK-39822?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-39822: --- Labels: pull-request-available (was: ) > Provides a good error during create Index with different dtype elements > --- > > Key: SPARK-39822 > URL: https://issues.apache.org/jira/browse/SPARK-39822 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.2.2 >Reporter: bo zhao >Priority: Minor > Labels: pull-request-available > > PANDAS > > {code:java} > >>> import pandas as pd >>> pd.Index([1,2,'3',4]) Index([1, 2, '3', 4], > >>> dtype='object') >>> > {code} > PYSPARK > > > {code:java} > Using Python version 3.8.13 (default, Jun 29 2022 11:50:19) > Spark context Web UI available at http://172.25.179.45:4042 > Spark context available as 'sc' (master = local[*], app id = > local-1658301116572). > SparkSession available as 'spark'. > >>> from pyspark import pandas as ps > WARNING:root:'PYARROW_IGNORE_TIMEZONE' environment variable was not set. It > is required to set this environment variable to '1' in both driver and > executor sides if you use pyarrow>=2.0.0. pandas-on-Spark will set it for you > but it does not work if there is a Spark context already launched. > >>> ps.Index([1,2,'3',4]) > Traceback (most recent call last): > File "", line 1, in > File "/home/spark/spark/python/pyspark/pandas/indexes/base.py", line 184, > in __new__ > ps.from_pandas( > File "/home/spark/spark/python/pyspark/pandas/namespace.py", line 155, in > from_pandas > return DataFrame(pd.DataFrame(index=pobj)).index > File "/home/spark/spark/python/pyspark/pandas/frame.py", line 463, in > __init__ > internal = InternalFrame.from_pandas(pdf) > File "/home/spark/spark/python/pyspark/pandas/internal.py", line 1469, in > from_pandas > ) = InternalFrame.prepare_pandas_frame(pdf, > prefer_timestamp_ntz=prefer_timestamp_ntz) > File "/home/spark/spark/python/pyspark/pandas/internal.py", line 1570, in > prepare_pandas_frame > spark_type = infer_pd_series_spark_type(reset_index[col], dtype, > prefer_timestamp_ntz) > File "/home/spark/spark/python/pyspark/pandas/typedef/typehints.py", line > 360, in infer_pd_series_spark_type > return from_arrow_type(pa.Array.from_pandas(pser).type, > prefer_timestamp_ntz) > File "pyarrow/array.pxi", line 1033, in pyarrow.lib.Array.from_pandas > File "pyarrow/array.pxi", line 312, in pyarrow.lib.array > File "pyarrow/array.pxi", line 83, in pyarrow.lib._ndarray_to_array > File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status > pyarrow.lib.ArrowInvalid: Could not convert '3' with type str: tried to > convert to int64 > {code} > I understand that pyspark pandas need the dtype to be the same, but we need a > good error msg or something to tell the user how to avoid. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45286) Add back Matomo analytics to release docs
[ https://issues.apache.org/jira/browse/SPARK-45286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen updated SPARK-45286: - Target Version/s: 3.4.2, 4.0.0, 3.5.1 (was: 4.0.0) > Add back Matomo analytics to release docs > - > > Key: SPARK-45286 > URL: https://issues.apache.org/jira/browse/SPARK-45286 > Project: Spark > Issue Type: Improvement > Components: Documentation >Affects Versions: 4.0.0 >Reporter: Sean R. Owen >Assignee: Sean R. Owen >Priority: Minor > Labels: pull-request-available > > We had previously removed Google Analytics from the website and release docs, > per ASF policy: https://github.com/apache/spark/pull/36310 > We just restored analytics using the ASF-hosted Matomo service on the website: > https://github.com/apache/spark-website/commit/a1548627b48a62c2e51870d1488ca3e09397bd30 > This change would put the same new tracking code back into the release docs. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45286) Add back Matomo analytics to release docs
[ https://issues.apache.org/jira/browse/SPARK-45286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-45286: --- Labels: pull-request-available (was: ) > Add back Matomo analytics to release docs > - > > Key: SPARK-45286 > URL: https://issues.apache.org/jira/browse/SPARK-45286 > Project: Spark > Issue Type: Improvement > Components: Documentation >Affects Versions: 4.0.0 >Reporter: Sean R. Owen >Assignee: Sean R. Owen >Priority: Minor > Labels: pull-request-available > > We had previously removed Google Analytics from the website and release docs, > per ASF policy: https://github.com/apache/spark/pull/36310 > We just restored analytics using the ASF-hosted Matomo service on the website: > https://github.com/apache/spark-website/commit/a1548627b48a62c2e51870d1488ca3e09397bd30 > This change would put the same new tracking code back into the release docs. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-45273) Http header Attack【HttpSecurityFilter】
[ https://issues.apache.org/jira/browse/SPARK-45273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17768144#comment-17768144 ] Sean R. Owen commented on SPARK-45273: -- Yep we typically evaluate security reports on priv...@spark.apache.org first, not here > Http header Attack【HttpSecurityFilter】 > -- > > Key: SPARK-45273 > URL: https://issues.apache.org/jira/browse/SPARK-45273 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.5.0 >Reporter: chenyu >Priority: Major > > There is an HTTP host header attack vulnerability in the target URL -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-45273) Http header Attack【HttpSecurityFilter】
[ https://issues.apache.org/jira/browse/SPARK-45273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17768140#comment-17768140 ] Bjørn Jørgensen commented on SPARK-45273: - Hi, [~chenyu-opensource] can you take this on mail to secur...@spark.apache.org CC [~srowen] > Http header Attack【HttpSecurityFilter】 > -- > > Key: SPARK-45273 > URL: https://issues.apache.org/jira/browse/SPARK-45273 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.5.0 >Reporter: chenyu >Priority: Major > > There is an HTTP host header attack vulnerability in the target URL -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-45286) Add back Matomo analytics to release docs
Sean R. Owen created SPARK-45286: Summary: Add back Matomo analytics to release docs Key: SPARK-45286 URL: https://issues.apache.org/jira/browse/SPARK-45286 Project: Spark Issue Type: Improvement Components: Documentation Affects Versions: 4.0.0 Reporter: Sean R. Owen Assignee: Sean R. Owen We had previously removed Google Analytics from the website and release docs, per ASF policy: https://github.com/apache/spark/pull/36310 We just restored analytics using the ASF-hosted Matomo service on the website: https://github.com/apache/spark-website/commit/a1548627b48a62c2e51870d1488ca3e09397bd30 This change would put the same new tracking code back into the release docs. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-45282) Join loses records for cached datasets
[ https://issues.apache.org/jira/browse/SPARK-45282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17768123#comment-17768123 ] koert kuipers edited comment on SPARK-45282 at 9/22/23 7:04 PM: after reverting SPARK-41048 the issue went away. was (Author: koert): after reverting SPARK-41048 the issue went away. so i think this is the cause. > Join loses records for cached datasets > -- > > Key: SPARK-45282 > URL: https://issues.apache.org/jira/browse/SPARK-45282 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.1, 3.5.0 > Environment: spark 3.4.1 on apache hadoop 3.3.6 or kubernetes 1.26 or > databricks 13.3 >Reporter: koert kuipers >Priority: Major > Labels: CorrectnessBug, correctness > > we observed this issue on spark 3.4.1 but it is also present on 3.5.0. it is > not present on spark 3.3.1. > it only shows up in distributed environment. i cannot replicate in unit test. > however i did get it to show up on hadoop cluster, kubernetes, and on > databricks 13.3 > the issue is that records are dropped when two cached dataframes are joined. > it seems in spark 3.4.1 in queryplan some Exchanges are dropped as an > optimization while in spark 3.3.1 these Exhanges are still present. it seems > to be an issue with AQE with canChangeCachedPlanOutputPartitioning=true. > to reproduce on distributed cluster these settings needed are: > {code:java} > spark.sql.adaptive.advisoryPartitionSizeInBytes 33554432 > spark.sql.adaptive.coalescePartitions.parallelismFirst false > spark.sql.adaptive.enabled true > spark.sql.optimizer.canChangeCachedPlanOutputPartitioning true {code} > code using scala to reproduce is: > {code:java} > import java.util.UUID > import org.apache.spark.sql.functions.col > import spark.implicits._ > val data = (1 to 100).toDS().map(i => > UUID.randomUUID().toString).persist() > val left = data.map(k => (k, 1)) > val right = data.map(k => (k, k)) // if i change this to k => (k, 1) it works! > println("number of left " + left.count()) > println("number of right " + right.count()) > println("number of (left join right) " + > left.toDF("key", "value1").join(right.toDF("key", "value2"), "key").count() > ) > val left1 = left > .toDF("key", "value1") > .repartition(col("key")) // comment out this line to make it work > .persist() > println("number of left1 " + left1.count()) > val right1 = right > .toDF("key", "value2") > .repartition(col("key")) // comment out this line to make it work > .persist() > println("number of right1 " + right1.count()) > println("number of (left1 join right1) " + left1.join(right1, > "key").count()) // this gives incorrect result{code} > this produces the following output: > {code:java} > number of left 100 > number of right 100 > number of (left join right) 100 > number of left1 100 > number of right1 100 > number of (left1 join right1) 859531 {code} > note that the last number (the incorrect one) actually varies depending on > settings and cluster size etc. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45282) Join loses records for cached datasets
[ https://issues.apache.org/jira/browse/SPARK-45282?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] koert kuipers updated SPARK-45282: -- Description: we observed this issue on spark 3.4.1 but it is also present on 3.5.0. it is not present on spark 3.3.1. it only shows up in distributed environment. i cannot replicate in unit test. however i did get it to show up on hadoop cluster, kubernetes, and on databricks 13.3 the issue is that records are dropped when two cached dataframes are joined. it seems in spark 3.4.1 in queryplan some Exchanges are dropped as an optimization while in spark 3.3.1 these Exhanges are still present. it seems to be an issue with AQE with canChangeCachedPlanOutputPartitioning=true. to reproduce on distributed cluster these settings needed are: {code:java} spark.sql.adaptive.advisoryPartitionSizeInBytes 33554432 spark.sql.adaptive.coalescePartitions.parallelismFirst false spark.sql.adaptive.enabled true spark.sql.optimizer.canChangeCachedPlanOutputPartitioning true {code} code using scala to reproduce is: {code:java} import java.util.UUID import org.apache.spark.sql.functions.col import spark.implicits._ val data = (1 to 100).toDS().map(i => UUID.randomUUID().toString).persist() val left = data.map(k => (k, 1)) val right = data.map(k => (k, k)) // if i change this to k => (k, 1) it works! println("number of left " + left.count()) println("number of right " + right.count()) println("number of (left join right) " + left.toDF("key", "value1").join(right.toDF("key", "value2"), "key").count() ) val left1 = left .toDF("key", "value1") .repartition(col("key")) // comment out this line to make it work .persist() println("number of left1 " + left1.count()) val right1 = right .toDF("key", "value2") .repartition(col("key")) // comment out this line to make it work .persist() println("number of right1 " + right1.count()) println("number of (left1 join right1) " + left1.join(right1, "key").count()) // this gives incorrect result{code} this produces the following output: {code:java} number of left 100 number of right 100 number of (left join right) 100 number of left1 100 number of right1 100 number of (left1 join right1) 859531 {code} note that the last number (the incorrect one) actually varies depending on settings and cluster size etc. was: we observed this issue on spark 3.4.1 but it is also present on 3.5.0. it is not present on spark 3.3.1. it only shows up in distributed environment. i cannot replicate in unit test. however i did get it to show up on hadoop cluster, kubernetes, and on databricks 13.3 the issue is that records are dropped when two cached dataframes are joined. it seems in spark 3.4.1 in queryplan some Exchanges are dropped as an optimization while in spark 3.3.1 these Exhanges are still present. it seems to be an issue with AQE with canChangeCachedPlanOutputPartitioning=true. to reproduce on distributed cluster these settings needed are: {code:java} spark.sql.adaptive.advisoryPartitionSizeInBytes 33554432 spark.sql.adaptive.coalescePartitions.parallelismFirst false spark.sql.adaptive.enabled true spark.sql.optimizer.canChangeCachedPlanOutputPartitioning true {code} code using scala to reproduce is: {code:java} import java.util.UUID import org.apache.spark.sql.functions.col import spark.implicits._ val data = (1 to 100).toDS().map(i => UUID.randomUUID().toString).persist() val left = data.map(k => (k, 1)) val right = data.map(k => (k, k)) // if i change this to k => (k, 1) it works! println("number of left " + left.count()) println("number of right " + right.count()) println("number of (left join right) " + left.toDF("key", "vertex").join(right.toDF("key", "state"), "key").count() ) val left1 = left .toDF("key", "vertex") .repartition(col("key")) // comment out this line to make it work .persist() println("number of left1 " + left1.count()) val right1 = right .toDF("key", "state") .repartition(col("key")) // comment out this line to make it work .persist() println("number of right1 " + right1.count()) println("number of (left1 join right1) " + left1.join(right1, "key").count()) // this gives incorrect result{code} this produces the following output: {code:java} number of left 100 number of right 100 number of (left join right) 100 number of left1 100 number of right1 100 number of (left1 join right1) 859531 {code} note that the last number (the incorrect one) actually varies depending on settings and cluster size etc. > Join loses records for cached datasets > -- > > Key: SPARK-45282 > URL: https://issues.apache.org/jira/browse/SPARK-45282 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.1, 3.5.0 > Environment: spark 3.4.1 on apache hadoop 3.3.6 or
[jira] [Updated] (SPARK-45285) Remove deprecated `Runtime.getRuntime.exec(String)` API usage
[ https://issues.apache.org/jira/browse/SPARK-45285?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-45285: --- Labels: pull-request-available (was: ) > Remove deprecated `Runtime.getRuntime.exec(String)` API usage > - > > Key: SPARK-45285 > URL: https://issues.apache.org/jira/browse/SPARK-45285 > Project: Spark > Issue Type: Improvement > Components: Spark Core, Tests >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Priority: Minor > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-45285) Remove deprecated `Runtime.getRuntime.exec(String)` API usage
Dongjoon Hyun created SPARK-45285: - Summary: Remove deprecated `Runtime.getRuntime.exec(String)` API usage Key: SPARK-45285 URL: https://issues.apache.org/jira/browse/SPARK-45285 Project: Spark Issue Type: Improvement Components: Spark Core, Tests Affects Versions: 4.0.0 Reporter: Dongjoon Hyun -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45282) Join loses records for cached datasets
[ https://issues.apache.org/jira/browse/SPARK-45282?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] koert kuipers updated SPARK-45282: -- Description: we observed this issue on spark 3.4.1 but it is also present on 3.5.0. it is not present on spark 3.3.1. it only shows up in distributed environment. i cannot replicate in unit test. however i did get it to show up on hadoop cluster, kubernetes, and on databricks 13.3 the issue is that records are dropped when two cached dataframes are joined. it seems in spark 3.4.1 in queryplan some Exchanges are dropped as an optimization while in spark 3.3.1 these Exhanges are still present. it seems to be an issue with AQE with canChangeCachedPlanOutputPartitioning=true. to reproduce on distributed cluster these settings needed are: {code:java} spark.sql.adaptive.advisoryPartitionSizeInBytes 33554432 spark.sql.adaptive.coalescePartitions.parallelismFirst false spark.sql.adaptive.enabled true spark.sql.optimizer.canChangeCachedPlanOutputPartitioning true {code} code using scala to reproduce is: {code:java} import java.util.UUID import org.apache.spark.sql.functions.col import spark.implicits._ val data = (1 to 100).toDS().map(i => UUID.randomUUID().toString).persist() val left = data.map(k => (k, 1)) val right = data.map(k => (k, k)) // if i change this to k => (k, 1) it works! println("number of left " + left.count()) println("number of right " + right.count()) println("number of (left join right) " + left.toDF("key", "vertex").join(right.toDF("key", "state"), "key").count() ) val left1 = left .toDF("key", "vertex") .repartition(col("key")) // comment out this line to make it work .persist() println("number of left1 " + left1.count()) val right1 = right .toDF("key", "state") .repartition(col("key")) // comment out this line to make it work .persist() println("number of right1 " + right1.count()) println("number of (left1 join right1) " + left1.join(right1, "key").count()) // this gives incorrect result{code} this produces the following output: {code:java} number of left 100 number of right 100 number of (left join right) 100 number of left1 100 number of right1 100 number of (left1 join right1) 859531 {code} note that the last number (the incorrect one) actually varies depending on settings and cluster size etc. was: we observed this issue on spark 3.4.1 but it is also present on 3.5.0. it is not present on spark 3.3.1. it only shows up in distributed environment. i cannot replicate in unit test. however i did get it to show up on hadoop cluster, kubernetes, and on databricks 13.3 the issue is that records are dropped when two cached dataframes are joined. it seems in spark 3.4.1 in queryplan some Exchanges are dropped as an optimization while in spark 3.3.1 these Exhanges are still present. it seems to be an issue with AQE with canChangeCachedPlanOutputPartitioning=true. to reproduce on distributed cluster these settings needed are: {code:java} spark.sql.adaptive.advisoryPartitionSizeInBytes 33554432 spark.sql.adaptive.coalescePartitions.parallelismFirst false spark.sql.adaptive.enabled true spark.sql.optimizer.canChangeCachedPlanOutputPartitioning true {code} code using scala 2.13 to reproduce is: {code:java} import java.util.UUID import org.apache.spark.sql.functions.col import spark.implicits._ val data = (1 to 100).toDS().map(i => UUID.randomUUID().toString).persist() val left = data.map(k => (k, 1)) val right = data.map(k => (k, k)) // if i change this to k => (k, 1) it works! println("number of left " + left.count()) println("number of right " + right.count()) println("number of (left join right) " + left.toDF("key", "vertex").join(right.toDF("key", "state"), "key").count() ) val left1 = left .toDF("key", "vertex") .repartition(col("key")) // comment out this line to make it work .persist() println("number of left1 " + left1.count()) val right1 = right .toDF("key", "state") .repartition(col("key")) // comment out this line to make it work .persist() println("number of right1 " + right1.count()) println("number of (left1 join right1) " + left1.join(right1, "key").count()) // this gives incorrect result{code} this produces the following output: {code:java} number of left 100 number of right 100 number of (left join right) 100 number of left1 100 number of right1 100 number of (left1 join right1) 859531 {code} note that the last number (the incorrect one) actually varies depending on settings and cluster size etc. > Join loses records for cached datasets > -- > > Key: SPARK-45282 > URL: https://issues.apache.org/jira/browse/SPARK-45282 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.1, 3.5.0 > Environment: spark 3.4.1 on apache hadoop 3.3.6 or
[jira] [Commented] (SPARK-45282) Join loses records for cached datasets
[ https://issues.apache.org/jira/browse/SPARK-45282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17768123#comment-17768123 ] koert kuipers commented on SPARK-45282: --- after reverting SPARK-41048 the issue went away. so i think this is the cause. > Join loses records for cached datasets > -- > > Key: SPARK-45282 > URL: https://issues.apache.org/jira/browse/SPARK-45282 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.1, 3.5.0 > Environment: spark 3.4.1 on apache hadoop 3.3.6 or kubernetes 1.26 or > databricks 13.3 >Reporter: koert kuipers >Priority: Major > Labels: CorrectnessBug, correctness > > we observed this issue on spark 3.4.1 but it is also present on 3.5.0. it is > not present on spark 3.3.1. > it only shows up in distributed environment. i cannot replicate in unit test. > however i did get it to show up on hadoop cluster, kubernetes, and on > databricks 13.3 > the issue is that records are dropped when two cached dataframes are joined. > it seems in spark 3.4.1 in queryplan some Exchanges are dropped as an > optimization while in spark 3.3.1 these Exhanges are still present. it seems > to be an issue with AQE with canChangeCachedPlanOutputPartitioning=true. > to reproduce on distributed cluster these settings needed are: > {code:java} > spark.sql.adaptive.advisoryPartitionSizeInBytes 33554432 > spark.sql.adaptive.coalescePartitions.parallelismFirst false > spark.sql.adaptive.enabled true > spark.sql.optimizer.canChangeCachedPlanOutputPartitioning true {code} > code using scala 2.13 to reproduce is: > {code:java} > import java.util.UUID > import org.apache.spark.sql.functions.col > import spark.implicits._ > val data = (1 to 100).toDS().map(i => > UUID.randomUUID().toString).persist() > val left = data.map(k => (k, 1)) > val right = data.map(k => (k, k)) // if i change this to k => (k, 1) it works! > println("number of left " + left.count()) > println("number of right " + right.count()) > println("number of (left join right) " + > left.toDF("key", "vertex").join(right.toDF("key", "state"), "key").count() > ) > val left1 = left > .toDF("key", "vertex") > .repartition(col("key")) // comment out this line to make it work > .persist() > println("number of left1 " + left1.count()) > val right1 = right > .toDF("key", "state") > .repartition(col("key")) // comment out this line to make it work > .persist() > println("number of right1 " + right1.count()) > println("number of (left1 join right1) " + left1.join(right1, > "key").count()) // this gives incorrect result{code} > this produces the following output: > {code:java} > number of left 100 > number of right 100 > number of (left join right) 100 > number of left1 100 > number of right1 100 > number of (left1 join right1) 859531 {code} > note that the last number (the incorrect one) actually varies depending on > settings and cluster size etc. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42768) Enable cached plan apply AQE by default
[ https://issues.apache.org/jira/browse/SPARK-42768?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-42768: --- Labels: pull-request-available (was: ) > Enable cached plan apply AQE by default > --- > > Key: SPARK-42768 > URL: https://issues.apache.org/jira/browse/SPARK-42768 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.5.0 >Reporter: XiDuo You >Assignee: XiDuo You >Priority: Major > Labels: pull-request-available > Fix For: 3.5.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45284) Update SparkR minimum SystemRequirements to Java 17
[ https://issues.apache.org/jira/browse/SPARK-45284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-45284: -- Parent: SPARK-44111 Issue Type: Sub-task (was: Improvement) > Update SparkR minimum SystemRequirements to Java 17 > --- > > Key: SPARK-45284 > URL: https://issues.apache.org/jira/browse/SPARK-45284 > Project: Spark > Issue Type: Sub-task > Components: R >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45284) Update SparkR minimum SystemRequirements to Java 17
[ https://issues.apache.org/jira/browse/SPARK-45284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-45284: --- Labels: pull-request-available (was: ) > Update SparkR minimum SystemRequirements to Java 17 > --- > > Key: SPARK-45284 > URL: https://issues.apache.org/jira/browse/SPARK-45284 > Project: Spark > Issue Type: Improvement > Components: R >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45077) Upgrade dagre-d3.js from 0.4.3 to 0.6.4
[ https://issues.apache.org/jira/browse/SPARK-45077?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-45077: --- Labels: pull-request-available (was: ) > Upgrade dagre-d3.js from 0.4.3 to 0.6.4 > --- > > Key: SPARK-45077 > URL: https://issues.apache.org/jira/browse/SPARK-45077 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 4.0.0 >Reporter: Kent Yao >Assignee: Kent Yao >Priority: Minor > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-45284) Update SparkR minimum +SystemRequirements to Java 17
Dongjoon Hyun created SPARK-45284: - Summary: Update SparkR minimum +SystemRequirements to Java 17 Key: SPARK-45284 URL: https://issues.apache.org/jira/browse/SPARK-45284 Project: Spark Issue Type: Improvement Components: R Affects Versions: 4.0.0 Reporter: Dongjoon Hyun -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45284) Update SparkR minimum SystemRequirements to Java 17
[ https://issues.apache.org/jira/browse/SPARK-45284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-45284: -- Summary: Update SparkR minimum SystemRequirements to Java 17 (was: Update SparkR minimum +SystemRequirements to Java 17) > Update SparkR minimum SystemRequirements to Java 17 > --- > > Key: SPARK-45284 > URL: https://issues.apache.org/jira/browse/SPARK-45284 > Project: Spark > Issue Type: Improvement > Components: R >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45282) Join loses records for cached datasets
[ https://issues.apache.org/jira/browse/SPARK-45282?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] koert kuipers updated SPARK-45282: -- Description: we observed this issue on spark 3.4.1 but it is also present on 3.5.0. it is not present on spark 3.3.1. it only shows up in distributed environment. i cannot replicate in unit test. however i did get it to show up on hadoop cluster, kubernetes, and on databricks 13.3 the issue is that records are dropped when two cached dataframes are joined. it seems in spark 3.4.1 in queryplan some Exchanges are dropped as an optimization while in spark 3.3.1 these Exhanges are still present. it seems to be an issue with AQE with canChangeCachedPlanOutputPartitioning=true. to reproduce on distributed cluster these settings needed are: {code:java} spark.sql.adaptive.advisoryPartitionSizeInBytes 33554432 spark.sql.adaptive.coalescePartitions.parallelismFirst false spark.sql.adaptive.enabled true spark.sql.optimizer.canChangeCachedPlanOutputPartitioning true {code} code using scala 2.13 to reproduce is: {code:java} import java.util.UUID import org.apache.spark.sql.functions.col import spark.implicits._ val data = (1 to 100).toDS().map(i => UUID.randomUUID().toString).persist() val left = data.map(k => (k, 1)) val right = data.map(k => (k, k)) // if i change this to k => (k, 1) it works! println("number of left " + left.count()) println("number of right " + right.count()) println("number of (left join right) " + left.toDF("key", "vertex").join(right.toDF("key", "state"), "key").count() ) val left1 = left .toDF("key", "vertex") .repartition(col("key")) // comment out this line to make it work .persist() println("number of left1 " + left1.count()) val right1 = right .toDF("key", "state") .repartition(col("key")) // comment out this line to make it work .persist() println("number of right1 " + right1.count()) println("number of (left1 join right1) " + left1.join(right1, "key").count()) // this gives incorrect result{code} this produces the following output: {code:java} number of left 100 number of right 100 number of (left join right) 100 number of left1 100 number of right1 100 number of (left1 join right1) 859531 {code} note that the last number (the incorrect one) actually varies depending on settings and cluster size etc. was: we observed this issue on spark 3.4.1 but it is also present on 3.5.0. it is not present on spark 3.3.1. it only shows up in distributed environment. i cannot replicate in unit test. however i did get it to show up on hadoop cluster, kubernetes, and on databricks 13.3 the issue is that records are dropped when two cached dataframes are joined. it seems in spark 3.4.1 in queryplan some Exchanges are dropped as an optimization while in spark 3.3.1 these Exhanges are still present. it seems to be an issue with AQE with canChangeCachedPlanOutputPartitioning=true. to reproduce on distributed cluster these settings needed are: {code:java} spark.sql.adaptive.advisoryPartitionSizeInBytes 33554432 spark.sql.adaptive.coalescePartitions.parallelismFirst false spark.sql.adaptive.enabled true spark.sql.optimizer.canChangeCachedPlanOutputPartitioning true {code} code to reproduce is: {code:java} import java.util.UUID import org.apache.spark.sql.functions.col import spark.implicits._ val data = (1 to 100).toDS().map(i => UUID.randomUUID().toString).persist() val left = data.map(k => (k, 1)) val right = data.map(k => (k, k)) // if i change this to k => (k, 1) it works! println("number of left " + left.count()) println("number of right " + right.count()) println("number of (left join right) " + left.toDF("key", "vertex").join(right.toDF("key", "state"), "key").count() ) val left1 = left .toDF("key", "vertex") .repartition(col("key")) // comment out this line to make it work .persist() println("number of left1 " + left1.count()) val right1 = right .toDF("key", "state") .repartition(col("key")) // comment out this line to make it work .persist() println("number of right1 " + right1.count()) println("number of (left1 join right1) " + left1.join(right1, "key").count()) // this gives incorrect result{code} this produces the following output: {code:java} number of left 100 number of right 100 number of (left join right) 100 number of left1 100 number of right1 100 number of (left1 join right1) 859531 {code} note that the last number (the incorrect one) actually varies depending on settings and cluster size etc. > Join loses records for cached datasets > -- > > Key: SPARK-45282 > URL: https://issues.apache.org/jira/browse/SPARK-45282 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.1, 3.5.0 > Environment: spark 3.4.1 on apache hadoop 3.3.6 or kubernetes
[jira] [Updated] (SPARK-45282) Join loses records for cached datasets
[ https://issues.apache.org/jira/browse/SPARK-45282?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] koert kuipers updated SPARK-45282: -- Description: we observed this issue on spark 3.4.1 but it is also present on 3.5.0. it is not present on spark 3.3.1. it only shows up in distributed environment. i cannot replicate in unit test. however i did get it to show up on hadoop cluster, kubernetes, and on databricks 13.3 the issue is that records are dropped when two cached dataframes are joined. it seems in spark 3.4.1 in queryplan some Exchanges are dropped as an optimization while in spark 3.3.1 these Exhanges are still present. it seems to be an issue with AQE with canChangeCachedPlanOutputPartitioning=true. to reproduce on distributed cluster these settings needed are: {code:java} spark.sql.adaptive.advisoryPartitionSizeInBytes 33554432 spark.sql.adaptive.coalescePartitions.parallelismFirst false spark.sql.adaptive.enabled true spark.sql.optimizer.canChangeCachedPlanOutputPartitioning true {code} code to reproduce is: {code:java} import java.util.UUID import org.apache.spark.sql.functions.col import spark.implicits._ val data = (1 to 100).toDS().map(i => UUID.randomUUID().toString).persist() val left = data.map(k => (k, 1)) val right = data.map(k => (k, k)) // if i change this to k => (k, 1) it works! println("number of left " + left.count()) println("number of right " + right.count()) println("number of (left join right) " + left.toDF("key", "vertex").join(right.toDF("key", "state"), "key").count() ) val left1 = left .toDF("key", "vertex") .repartition(col("key")) // comment out this line to make it work .persist() println("number of left1 " + left1.count()) val right1 = right .toDF("key", "state") .repartition(col("key")) // comment out this line to make it work .persist() println("number of right1 " + right1.count()) println("number of (left1 join right1) " + left1.join(right1, "key").count()) // this gives incorrect result{code} this produces the following output: {code:java} number of left 100 number of right 100 number of (left join right) 100 number of left1 100 number of right1 100 number of (left1 join right1) 859531 {code} note that the last number (the incorrect one) actually varies depending on settings and cluster size etc. was: we observed this issue on spark 3.4.1 but it is also present on 3.5.0. it is not present on spark 3.3.1. it only shows up in distributed environment. i cannot replicate in unit test. however i did get it to show up on hadoop cluster, kubernetes, and on databricks 13.3 the issue is that records are dropped when two cached dataframes are joined. it seems in spark 3.4.1 in queryplan some Exchanges are dropped as an optimization while in spark 3.3.1 these Exhanges are still present. it seems to be an issue with AQE with canChangeCachedPlanOutputPartitioning=true. to reproduce on distributed cluster these settings needed are: {code:java} spark.sql.adaptive.advisoryPartitionSizeInBytes 33554432 spark.sql.adaptive.coalescePartitions.parallelismFirst false spark.sql.adaptive.enabled true spark.sql.optimizer.canChangeCachedPlanOutputPartitioning true {code} code to reproduce is: {code:java} import java.util.UUID import org.apache.spark.sql.functions.col import spark.implicits._ val data = (1 to 100).toDS().map(i => UUID.randomUUID().toString).persist() val left = data.map(k => (k, 1)) val right = data.map(k => (k, k)) // if i change this to k => (k, 1) it works! println("number of left " + left.count()) println("number of right " + right.count()) println("number of (left join right) " + left.toDF("key", "vertex").join(right.toDF("key", "state"), "key").count() ) val left1 = left .toDF("key", "vertex") .repartition(col("key")) // comment out this line to make it work .persist() println("number of left1 " + left1.count()) val right1 = right .toDF("key", "state") .repartition(col("key")) // comment out this line to make it work .persist() println("number of right1 " + right1.count()) println("number of (left1 join right1) " + left1.join(right1, "key").count()) // this gives incorrect result{code} > Join loses records for cached datasets > -- > > Key: SPARK-45282 > URL: https://issues.apache.org/jira/browse/SPARK-45282 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.1, 3.5.0 > Environment: spark 3.4.1 on apache hadoop 3.3.6 or kubernetes 1.26 or > databricks 13.3 >Reporter: koert kuipers >Priority: Major > Labels: CorrectnessBug, correctness > > we observed this issue on spark 3.4.1 but it is also present on 3.5.0. it is > not present on spark 3.3.1. > it only shows up in distributed environment. i cannot replicate in unit test. > however
[jira] [Created] (SPARK-45283) Make StatusTrackerSuite less fragile
Bo Xiong created SPARK-45283: Summary: Make StatusTrackerSuite less fragile Key: SPARK-45283 URL: https://issues.apache.org/jira/browse/SPARK-45283 Project: Spark Issue Type: Bug Components: Spark Core, Tests Affects Versions: 3.5.0, 4.0.0 Reporter: Bo Xiong It's discovered from [Github Actions|https://github.com/xiongbo-sjtu/spark/actions/runs/6270601155/job/17028788767] that StatusTrackerSuite can run into random failures because FutureAction.jobIds is not a sorted sequence (by design), as shown in the following stack trace (highlighted in red). The proposed fix is to update the unit test to remove the nondeterministic behavior. {quote}[info] StatusTrackerSuite: [info] - basic status API usage (99 milliseconds) [info] - getJobIdsForGroup() (56 milliseconds) [info] - getJobIdsForGroup() with takeAsync() (48 milliseconds) [info] - getJobIdsForGroup() with takeAsync() across multiple partitions (58 milliseconds) [info] - getJobIdsForTag() *** FAILED *** (10 seconds, 77 milliseconds) {color:#FF}[info] The code passed to eventually never returned normally. Attempted 651 times over 10.00505994401 seconds. Last failure message: Set(3, 2, 1) was not equal to Set(1, 2). (StatusTrackerSuite.scala:148){color} [info] org.scalatest.exceptions.TestFailedDueToTimeoutException: [info] at org.scalatest.enablers.Retrying$$anon$4.tryTryAgain$2(Retrying.scala:219) [info] at org.scalatest.enablers.Retrying$$anon$4.retry(Retrying.scala:226) [info] at org.scalatest.concurrent.Eventually.eventually(Eventually.scala:348) [info] at org.scalatest.concurrent.Eventually.eventually$(Eventually.scala:347) [info] at org.scalatest.concurrent.Eventually$.eventually(Eventually.scala:457) [info] at org.apache.spark.StatusTrackerSuite.$anonfun$new$21(StatusTrackerSuite.scala:148) [info] at org.scalatest.enablers.Timed$$anon$1.timeoutAfter(Timed.scala:127) [info] at org.scalatest.concurrent.TimeLimits$.failAfterImpl(TimeLimits.scala:282) [info] at org.scalatest.concurrent.TimeLimits.failAfter(TimeLimits.scala:231) [info] at org.scalatest.concurrent.TimeLimits.failAfter$(TimeLimits.scala:230) [info] at org.apache.spark.SparkFunSuite.failAfter(SparkFunSuite.scala:69) [info] at org.apache.spark.SparkFunSuite.$anonfun$test$2(SparkFunSuite.scala:155) [info] at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85) [info] at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83) [info] at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) [info] at org.scalatest.Transformer.apply(Transformer.scala:22) [info] at org.scalatest.Transformer.apply(Transformer.scala:20) [info] at org.scalatest.funsuite.AnyFunSuiteLike$$anon$1.apply(AnyFunSuiteLike.scala:226) [info] at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:227) [info] at org.scalatest.funsuite.AnyFunSuiteLike.invokeWithFixture$1(AnyFunSuiteLike.scala:224) [info] at org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTest$1(AnyFunSuiteLike.scala:236) [info] at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306) [info] at org.scalatest.funsuite.AnyFunSuiteLike.runTest(AnyFunSuiteLike.scala:236) [info] at org.scalatest.funsuite.AnyFunSuiteLike.runTest$(AnyFunSuiteLike.scala:218) [info] at org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(SparkFunSuite.scala:69) [info] at org.scalatest.BeforeAndAfterEach.runTest(BeforeAndAfterEach.scala:234) [info] at org.scalatest.BeforeAndAfterEach.runTest$(BeforeAndAfterEach.scala:227) [info] at org.apache.spark.SparkFunSuite.runTest(SparkFunSuite.scala:69) [info] at org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTests$1(AnyFunSuiteLike.scala:269) [info] at org.scalatest.SuperEngine.$anonfun$runTestsInBranch$1(Engine.scala:413) [info] at scala.collection.immutable.List.foreach(List.scala:333) [info] at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401) [info] at org.scalatest.SuperEngine.runTestsInBranch(Engine.scala:396) [info] at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:475) [info] at org.scalatest.funsuite.AnyFunSuiteLike.runTests(AnyFunSuiteLike.scala:269) [info] at org.scalatest.funsuite.AnyFunSuiteLike.runTests$(AnyFunSuiteLike.scala:268) [info] at org.scalatest.funsuite.AnyFunSuite.runTests(AnyFunSuite.scala:1564) [info] at org.scalatest.Suite.run(Suite.scala:1114) [info] at org.scalatest.Suite.run$(Suite.scala:1096) [info] at org.scalatest.funsuite.AnyFunSuite.org$scalatest$funsuite$AnyFunSuiteLike$$super$run(AnyFunSuite.scala:1564) [info] at org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$run$1(AnyFunSuiteLike.scala:273) [info] at org.scalatest.SuperEngine.runImpl(Engine.scala:535) [info] at org.scalatest.funsuite.AnyFunSuiteLike.run(AnyFunSuiteLike.scala:273) [info] at org.scalatest.funsuite.AnyFunSuiteLike.run$(AnyFunSuiteLike.scala:272) [info] at
[jira] [Updated] (SPARK-45282) Join loses records for cached datasets
[ https://issues.apache.org/jira/browse/SPARK-45282?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] koert kuipers updated SPARK-45282: -- Description: we observed this issue on spark 3.4.1 but it is also present on 3.5.0. it is not present on spark 3.3.1. it only shows up in distributed environment. i cannot replicate in unit test. however i did get it to show up on hadoop cluster, kubernetes, and on databricks 13.3 the issue is that records are dropped when two cached dataframes are joined. it seems in spark 3.4.1 in queryplan some Exchanges are dropped as an optimization while in spark 3.3.1 these Exhanges are still present. it seems to be an issue with AQE with canChangeCachedPlanOutputPartitioning=true. to reproduce on distributed cluster these settings needed are: {code:java} spark.sql.adaptive.advisoryPartitionSizeInBytes 33554432 spark.sql.adaptive.coalescePartitions.parallelismFirst false spark.sql.adaptive.enabled true spark.sql.optimizer.canChangeCachedPlanOutputPartitioning true {code} code to reproduce is: {code:java} import java.util.UUID import org.apache.spark.sql.functions.col import spark.implicits._ val data = (1 to 100).toDS().map(i => UUID.randomUUID().toString).persist() val left = data.map(k => (k, 1)) val right = data.map(k => (k, k)) // if i change this to k => (k, 1) it works! println("number of left " + left.count()) println("number of right " + right.count()) println("number of (left join right) " + left.toDF("key", "vertex").join(right.toDF("key", "state"), "key").count() ) val left1 = left .toDF("key", "vertex") .repartition(col("key")) // comment out this line to make it work .persist() println("number of left1 " + left1.count()) val right1 = right .toDF("key", "state") .repartition(col("key")) // comment out this line to make it work .persist() println("number of right1 " + right1.count()) println("number of (left1 join right1) " + left1.join(right1, "key").count()) // this gives incorrect result{code} was: we observed this issue on spark 3.4.1 but it is also present on 3.5.0. it is not present on spark 3.3.1. it only shows up in distributed environment. i cannot replicate in unit test. however i did get it to show up on hadoop cluster, kubernetes, and on databricks 13.3 the issue is that records are dropped when two cached dataframes are joined. it seems in spark 3.4.1 in queryplan some Exchanges are dropped as an optimization while in spark 3.3.1 these Exhanges are still present. it seems to be an issue with AQE with canChangeCachedPlanOutputPartitioning=true. to reproduce on distributed cluster these settings needed are: {code:java} spark.sql.adaptive.advisoryPartitionSizeInBytes 33554432 spark.sql.adaptive.coalescePartitions.parallelismFirst false spark.sql.adaptive.enabled true spark.sql.optimizer.canChangeCachedPlanOutputPartitioning true {code} code to reproduce is: {code:java} import java.util.UUID import org.apache.spark.sql.functions.col import spark.implicits._ val data = (1 to 100).toDS().map(i => UUID.randomUUID().toString).persist() val left = data.map(k => (k, 1)) val right = data.map(k => (k, k)) // if i change this to k => (k, 1) it works! println("number of left " + left.count()) println("number of right " + right.count()) println("number of (left join right) " + left.toDF("key", "vertex").join(right.toDF("key", "state"), "key").count() ) val left1 = left .toDF("key", "vertex") .repartition(col("key")) // comment out this line to make it work .persist() println("number of left1 " + left1.count()) val right1 = right .toDF("key", "state") .repartition(col("key")) // comment out this line to make it work .persist() println("number of right1 " + right1.count()) println("number of (left1 join right1) " + left1.join(right1, "key").count()) // this gives incorrect result{code} > Join loses records for cached datasets > -- > > Key: SPARK-45282 > URL: https://issues.apache.org/jira/browse/SPARK-45282 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.1, 3.5.0 > Environment: spark 3.4.1 on apache hadoop 3.3.6 or kubernetes 1.26 or > databricks 13.3 >Reporter: koert kuipers >Priority: Major > Labels: CorrectnessBug, correctness > > we observed this issue on spark 3.4.1 but it is also present on 3.5.0. it is > not present on spark 3.3.1. > it only shows up in distributed environment. i cannot replicate in unit test. > however i did get it to show up on hadoop cluster, kubernetes, and on > databricks 13.3 > the issue is that records are dropped when two cached dataframes are joined. > it seems in spark 3.4.1 in queryplan some Exchanges are dropped as an > optimization while in spark 3.3.1 these Exhanges are still present. it seems > to
[jira] [Created] (SPARK-45282) Join loses records for cached datasets
koert kuipers created SPARK-45282: - Summary: Join loses records for cached datasets Key: SPARK-45282 URL: https://issues.apache.org/jira/browse/SPARK-45282 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.5.0, 3.4.1 Environment: spark 3.4.1 on apache hadoop 3.3.6 or kubernetes 1.26 or databricks 13.3 Reporter: koert kuipers we observed this issue on spark 3.4.1 but it is also present on 3.5.0. it is not present on spark 3.3.1. it only shows up in distributed environment. i cannot replicate in unit test. however i did get it to show up on hadoop cluster, kubernetes, and on databricks 13.3 the issue is that records are dropped when two cached dataframes are joined. it seems in spark 3.4.1 in queryplan some Exchanges are dropped as an optimization while in spark 3.3.1 these Exhanges are still present. it seems to be an issue with AQE with canChangeCachedPlanOutputPartitioning=true. to reproduce on distributed cluster these settings needed are: {code:java} spark.sql.adaptive.advisoryPartitionSizeInBytes 33554432 spark.sql.adaptive.coalescePartitions.parallelismFirst false spark.sql.adaptive.enabled true spark.sql.optimizer.canChangeCachedPlanOutputPartitioning true {code} code to reproduce is: {code:java} import java.util.UUID import org.apache.spark.sql.functions.col import spark.implicits._ val data = (1 to 100).toDS().map(i => UUID.randomUUID().toString).persist() val left = data.map(k => (k, 1)) val right = data.map(k => (k, k)) // if i change this to k => (k, 1) it works! println("number of left " + left.count()) println("number of right " + right.count()) println("number of (left join right) " + left.toDF("key", "vertex").join(right.toDF("key", "state"), "key").count() ) val left1 = left .toDF("key", "vertex") .repartition(col("key")) // comment out this line to make it work .persist() println("number of left1 " + left1.count()) val right1 = right .toDF("key", "state") .repartition(col("key")) // comment out this line to make it work .persist() println("number of right1 " + right1.count()) println("number of (left1 join right1) " + left1.join(right1, "key").count()) // this gives incorrect result{code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-45281) Update BenchmarkBase to use Java 17 as the base version
[ https://issues.apache.org/jira/browse/SPARK-45281?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-45281. --- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 43059 [https://github.com/apache/spark/pull/43059] > Update BenchmarkBase to use Java 17 as the base version > --- > > Key: SPARK-45281 > URL: https://issues.apache.org/jira/browse/SPARK-45281 > Project: Spark > Issue Type: Improvement > Components: Spark Core, Tests >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-45281) Update BenchmarkBase to use Java 17 as the base version
[ https://issues.apache.org/jira/browse/SPARK-45281?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-45281: - Assignee: Dongjoon Hyun > Update BenchmarkBase to use Java 17 as the base version > --- > > Key: SPARK-45281 > URL: https://issues.apache.org/jira/browse/SPARK-45281 > Project: Spark > Issue Type: Improvement > Components: Spark Core, Tests >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-45277) Install Java 17 for Windows SparkR test
[ https://issues.apache.org/jira/browse/SPARK-45277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-45277: - Assignee: Yang Jie > Install Java 17 for Windows SparkR test > --- > > Key: SPARK-45277 > URL: https://issues.apache.org/jira/browse/SPARK-45277 > Project: Spark > Issue Type: Improvement > Components: Project Infra >Affects Versions: 4.0.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-45277) Install Java 17 for Windows SparkR test
[ https://issues.apache.org/jira/browse/SPARK-45277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-45277. --- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 43056 [https://github.com/apache/spark/pull/43056] > Install Java 17 for Windows SparkR test > --- > > Key: SPARK-45277 > URL: https://issues.apache.org/jira/browse/SPARK-45277 > Project: Spark > Issue Type: Improvement > Components: Project Infra >Affects Versions: 4.0.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45281) Update BenchmarkBase to use Java 17 as the base version
[ https://issues.apache.org/jira/browse/SPARK-45281?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-45281: --- Labels: pull-request-available (was: ) > Update BenchmarkBase to use Java 17 as the base version > --- > > Key: SPARK-45281 > URL: https://issues.apache.org/jira/browse/SPARK-45281 > Project: Spark > Issue Type: Improvement > Components: Spark Core, Tests >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-45281) Update BenchmarkBase to use Java 17 as the base version
Dongjoon Hyun created SPARK-45281: - Summary: Update BenchmarkBase to use Java 17 as the base version Key: SPARK-45281 URL: https://issues.apache.org/jira/browse/SPARK-45281 Project: Spark Issue Type: Improvement Components: Spark Core, Tests Affects Versions: 4.0.0 Reporter: Dongjoon Hyun -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-45256) Arrow DurationWriter fails when vector is at capacity
[ https://issues.apache.org/jira/browse/SPARK-45256?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-45256. --- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 43035 [https://github.com/apache/spark/pull/43035] > Arrow DurationWriter fails when vector is at capacity > - > > Key: SPARK-45256 > URL: https://issues.apache.org/jira/browse/SPARK-45256 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.2, 3.4.0, 3.4.1, 3.5.0, 3.5.1 >Reporter: Sander Goos >Assignee: Sander Goos >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > > The DurationWriter fails if more values are written than the initial capacity > of the DurationVector (4032). Fix by using `setSafe` instead of `set` method. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-45256) Arrow DurationWriter fails when vector is at capacity
[ https://issues.apache.org/jira/browse/SPARK-45256?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-45256: - Assignee: Sander Goos > Arrow DurationWriter fails when vector is at capacity > - > > Key: SPARK-45256 > URL: https://issues.apache.org/jira/browse/SPARK-45256 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.2, 3.4.0, 3.4.1, 3.5.0, 3.5.1 >Reporter: Sander Goos >Assignee: Sander Goos >Priority: Major > Labels: pull-request-available > > The DurationWriter fails if more values are written than the initial capacity > of the DurationVector (4032). Fix by using `setSafe` instead of `set` method. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-45280) Change Maven daily test use Java 17 for testing.
[ https://issues.apache.org/jira/browse/SPARK-45280?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-45280: - Assignee: Yang Jie > Change Maven daily test use Java 17 for testing. > > > Key: SPARK-45280 > URL: https://issues.apache.org/jira/browse/SPARK-45280 > Project: Spark > Issue Type: Bug > Components: Project Infra >Affects Versions: 4.0.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-45280) Change Maven daily test use Java 17 for testing.
[ https://issues.apache.org/jira/browse/SPARK-45280?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-45280. --- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 43057 [https://github.com/apache/spark/pull/43057] > Change Maven daily test use Java 17 for testing. > > > Key: SPARK-45280 > URL: https://issues.apache.org/jira/browse/SPARK-45280 > Project: Spark > Issue Type: Bug > Components: Project Infra >Affects Versions: 4.0.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-36321) Do not fail application in kubernetes if name is too long
[ https://issues.apache.org/jira/browse/SPARK-36321?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-36321: --- Labels: pull-request-available (was: ) > Do not fail application in kubernetes if name is too long > - > > Key: SPARK-36321 > URL: https://issues.apache.org/jira/browse/SPARK-36321 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 3.3.0 >Reporter: XiDuo You >Priority: Major > Labels: pull-request-available > > If we have a long spark app name and start with k8s master, we will get the > execption. > {code:java} > java.lang.IllegalArgumentException: > 'a-89fe2f7ae71c3570' in > spark.kubernetes.executor.podNamePrefix is invalid. must conform > https://kubernetes.io/docs/concepts/overview/working-with-objects/names/#dns-label-names > and the value length <= 47 > at > org.apache.spark.internal.config.TypedConfigBuilder.$anonfun$checkValue$1(ConfigBuilder.scala:108) > at > org.apache.spark.internal.config.TypedConfigBuilder.$anonfun$transform$1(ConfigBuilder.scala:101) > at scala.Option.map(Option.scala:230) > at > org.apache.spark.internal.config.OptionalConfigEntry.readFrom(ConfigEntry.scala:239) > at > org.apache.spark.internal.config.OptionalConfigEntry.readFrom(ConfigEntry.scala:214) > at org.apache.spark.SparkConf.get(SparkConf.scala:261) > at > org.apache.spark.deploy.k8s.KubernetesConf.get(KubernetesConf.scala:67) > at > org.apache.spark.deploy.k8s.KubernetesExecutorConf.(KubernetesConf.scala:147) > at > org.apache.spark.deploy.k8s.KubernetesConf$.createExecutorConf(KubernetesConf.scala:231) > at > org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator.$anonfun$requestNewExecutors$2(ExecutorPodsAllocator.scala:367) > {code} > Use app name as the executor pod name is the Spark internal behavior and we > should not make application failure. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-45255) Spark connect client failing with java.lang.NoClassDefFoundError
[ https://issues.apache.org/jira/browse/SPARK-45255?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17768040#comment-17768040 ] Faiz Halde edited comment on SPARK-45255 at 9/22/23 2:31 PM: - to get past the error org/sparkproject/connect/client/com/google/common/cache/CacheLoader even after adding guava library, you need to copy their shading rules ``` (assembly / assemblyShadeRules) := Seq( ShadeRule.rename("io.grpc.**" -> "org.sparkproject.connect.client.io.grpc.@1").inAll, ShadeRule.rename("com.google.**" -> "org.sparkproject.connect.client.com.google.@1").inAll, ShadeRule.rename("io.netty.**" -> "org.sparkproject.connect.client.io.netty.@1").inAll, ShadeRule.rename("org.checkerframework.**" -> "org.sparkproject.connect.client.org.checkerframework.@1").inAll, ShadeRule.rename("javax.annotation.**" -> "org.sparkproject.connect.client.javax.annotation.@1").inAll, ShadeRule.rename("io.perfmark.**" -> "org.sparkproject.connect.client.io.perfmark.@1").inAll, ShadeRule.rename("org.codehaus.**" -> "org.sparkproject.connect.client.org.codehaus.@1").inAll, ShadeRule.rename("android.annotation.**" -> "org.sparkproject.connect.client.android.annotation.@1").inAll ), ``` was (Author: JIRAUSER300204): to get pas the error org/sparkproject/connect/client/com/google/common/cache/CacheLoader even after adding guava library, you need to copy their shading rules ``` (assembly / assemblyShadeRules) := Seq( ShadeRule.rename("io.grpc.**" -> "org.sparkproject.connect.client.io.grpc.@1").inAll, ShadeRule.rename("com.google.**" -> "org.sparkproject.connect.client.com.google.@1").inAll, ShadeRule.rename("io.netty.**" -> "org.sparkproject.connect.client.io.netty.@1").inAll, ShadeRule.rename("org.checkerframework.**" -> "org.sparkproject.connect.client.org.checkerframework.@1").inAll, ShadeRule.rename("javax.annotation.**" -> "org.sparkproject.connect.client.javax.annotation.@1").inAll, ShadeRule.rename("io.perfmark.**" -> "org.sparkproject.connect.client.io.perfmark.@1").inAll, ShadeRule.rename("org.codehaus.**" -> "org.sparkproject.connect.client.org.codehaus.@1").inAll, ShadeRule.rename("android.annotation.**" -> "org.sparkproject.connect.client.android.annotation.@1").inAll ), ``` > Spark connect client failing with java.lang.NoClassDefFoundError > > > Key: SPARK-45255 > URL: https://issues.apache.org/jira/browse/SPARK-45255 > Project: Spark > Issue Type: Bug > Components: Connect >Affects Versions: 3.5.0 >Reporter: Faiz Halde >Priority: Major > > java 1.8, sbt 1.9, scala 2.12 > > I have a very simple repo with the following dependency in `build.sbt` > ``` > {{libraryDependencies ++= Seq("org.apache.spark" %% > "spark-connect-client-jvm" % "3.5.0")}} > ``` > A simple application > ``` > {{object Main extends App {}} > {{ val s = SparkSession.builder().remote("sc://localhost").getOrCreate()}} > {{ s.read.json("/tmp/input.json").repartition(10).show(false)}} > {{}}} > ``` > But when I run it, I get the following error > > ``` > {{Exception in thread "main" java.lang.NoClassDefFoundError: > org/sparkproject/connect/client/com/google/common/cache/CacheLoader}} > {{ at Main$.delayedEndpoint$Main$1(Main.scala:4)}} > {{ at Main$delayedInit$body.apply(Main.scala:3)}} > {{ at scala.Function0.apply$mcV$sp(Function0.scala:39)}} > {{ at scala.Function0.apply$mcV$sp$(Function0.scala:39)}} > {{ at > scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:17)}} > {{ at scala.App.$anonfun$main$1$adapted(App.scala:80)}} > {{ at scala.collection.immutable.List.foreach(List.scala:431)}} > {{ at scala.App.main(App.scala:80)}} > {{ at scala.App.main$(App.scala:78)}} > {{ at Main$.main(Main.scala:3)}} > {{ at Main.main(Main.scala)}} > {{Caused by: java.lang.ClassNotFoundException: > org.sparkproject.connect.client.com.google.common.cache.CacheLoader}} > {{ at java.net.URLClassLoader.findClass(URLClassLoader.java:387)}} > {{ at java.lang.ClassLoader.loadClass(ClassLoader.java:418)}} > {{ at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352)}} > {{ at java.lang.ClassLoader.loadClass(ClassLoader.java:351)}} > {{ ... 11 more}} > ``` > I know the connect does a bunch of shading during assembly so it could be > related to that. This application is not started via spark-submit or > anything. It's not run neither under a `SPARK_HOME` ( I guess that's the > whole point of connect client ) > > EDIT > Not sure if it's the right mitigation but explicitly adding guava worked but > now I am in the 2nd territory of error > {{Sep 21, 2023 8:21:59 PM >
[jira] [Comment Edited] (SPARK-45255) Spark connect client failing with java.lang.NoClassDefFoundError
[ https://issues.apache.org/jira/browse/SPARK-45255?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17768040#comment-17768040 ] Faiz Halde edited comment on SPARK-45255 at 9/22/23 2:31 PM: - to get past the error `org/sparkproject/connect/client/com/google/common/cache/CacheLoader` even after adding guava library, you need to copy their shading rules ``` (assembly / assemblyShadeRules) := Seq( ShadeRule.rename("io.grpc.**" -> "org.sparkproject.connect.client.io.grpc.@1").inAll, ShadeRule.rename("com.google.**" -> "org.sparkproject.connect.client.com.google.@1").inAll, ShadeRule.rename("io.netty.**" -> "org.sparkproject.connect.client.io.netty.@1").inAll, ShadeRule.rename("org.checkerframework.**" -> "org.sparkproject.connect.client.org.checkerframework.@1").inAll, ShadeRule.rename("javax.annotation.**" -> "org.sparkproject.connect.client.javax.annotation.@1").inAll, ShadeRule.rename("io.perfmark.**" -> "org.sparkproject.connect.client.io.perfmark.@1").inAll, ShadeRule.rename("org.codehaus.**" -> "org.sparkproject.connect.client.org.codehaus.@1").inAll, ShadeRule.rename("android.annotation.**" -> "org.sparkproject.connect.client.android.annotation.@1").inAll ), ``` was (Author: JIRAUSER300204): to get past the error org/sparkproject/connect/client/com/google/common/cache/CacheLoader even after adding guava library, you need to copy their shading rules ``` (assembly / assemblyShadeRules) := Seq( ShadeRule.rename("io.grpc.**" -> "org.sparkproject.connect.client.io.grpc.@1").inAll, ShadeRule.rename("com.google.**" -> "org.sparkproject.connect.client.com.google.@1").inAll, ShadeRule.rename("io.netty.**" -> "org.sparkproject.connect.client.io.netty.@1").inAll, ShadeRule.rename("org.checkerframework.**" -> "org.sparkproject.connect.client.org.checkerframework.@1").inAll, ShadeRule.rename("javax.annotation.**" -> "org.sparkproject.connect.client.javax.annotation.@1").inAll, ShadeRule.rename("io.perfmark.**" -> "org.sparkproject.connect.client.io.perfmark.@1").inAll, ShadeRule.rename("org.codehaus.**" -> "org.sparkproject.connect.client.org.codehaus.@1").inAll, ShadeRule.rename("android.annotation.**" -> "org.sparkproject.connect.client.android.annotation.@1").inAll ), ``` > Spark connect client failing with java.lang.NoClassDefFoundError > > > Key: SPARK-45255 > URL: https://issues.apache.org/jira/browse/SPARK-45255 > Project: Spark > Issue Type: Bug > Components: Connect >Affects Versions: 3.5.0 >Reporter: Faiz Halde >Priority: Major > > java 1.8, sbt 1.9, scala 2.12 > > I have a very simple repo with the following dependency in `build.sbt` > ``` > {{libraryDependencies ++= Seq("org.apache.spark" %% > "spark-connect-client-jvm" % "3.5.0")}} > ``` > A simple application > ``` > {{object Main extends App {}} > {{ val s = SparkSession.builder().remote("sc://localhost").getOrCreate()}} > {{ s.read.json("/tmp/input.json").repartition(10).show(false)}} > {{}}} > ``` > But when I run it, I get the following error > > ``` > {{Exception in thread "main" java.lang.NoClassDefFoundError: > org/sparkproject/connect/client/com/google/common/cache/CacheLoader}} > {{ at Main$.delayedEndpoint$Main$1(Main.scala:4)}} > {{ at Main$delayedInit$body.apply(Main.scala:3)}} > {{ at scala.Function0.apply$mcV$sp(Function0.scala:39)}} > {{ at scala.Function0.apply$mcV$sp$(Function0.scala:39)}} > {{ at > scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:17)}} > {{ at scala.App.$anonfun$main$1$adapted(App.scala:80)}} > {{ at scala.collection.immutable.List.foreach(List.scala:431)}} > {{ at scala.App.main(App.scala:80)}} > {{ at scala.App.main$(App.scala:78)}} > {{ at Main$.main(Main.scala:3)}} > {{ at Main.main(Main.scala)}} > {{Caused by: java.lang.ClassNotFoundException: > org.sparkproject.connect.client.com.google.common.cache.CacheLoader}} > {{ at java.net.URLClassLoader.findClass(URLClassLoader.java:387)}} > {{ at java.lang.ClassLoader.loadClass(ClassLoader.java:418)}} > {{ at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352)}} > {{ at java.lang.ClassLoader.loadClass(ClassLoader.java:351)}} > {{ ... 11 more}} > ``` > I know the connect does a bunch of shading during assembly so it could be > related to that. This application is not started via spark-submit or > anything. It's not run neither under a `SPARK_HOME` ( I guess that's the > whole point of connect client ) > > EDIT > Not sure if it's the right mitigation but explicitly adding guava worked but > now I am in the 2nd territory of error > {{Sep 21, 2023 8:21:59 PM >
[jira] [Commented] (SPARK-45255) Spark connect client failing with java.lang.NoClassDefFoundError
[ https://issues.apache.org/jira/browse/SPARK-45255?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17768040#comment-17768040 ] Faiz Halde commented on SPARK-45255: to get pas the error org/sparkproject/connect/client/com/google/common/cache/CacheLoader even after adding guava library, you need to copy their shading rules ``` (assembly / assemblyShadeRules) := Seq( ShadeRule.rename("io.grpc.**" -> "org.sparkproject.connect.client.io.grpc.@1").inAll, ShadeRule.rename("com.google.**" -> "org.sparkproject.connect.client.com.google.@1").inAll, ShadeRule.rename("io.netty.**" -> "org.sparkproject.connect.client.io.netty.@1").inAll, ShadeRule.rename("org.checkerframework.**" -> "org.sparkproject.connect.client.org.checkerframework.@1").inAll, ShadeRule.rename("javax.annotation.**" -> "org.sparkproject.connect.client.javax.annotation.@1").inAll, ShadeRule.rename("io.perfmark.**" -> "org.sparkproject.connect.client.io.perfmark.@1").inAll, ShadeRule.rename("org.codehaus.**" -> "org.sparkproject.connect.client.org.codehaus.@1").inAll, ShadeRule.rename("android.annotation.**" -> "org.sparkproject.connect.client.android.annotation.@1").inAll ), ``` > Spark connect client failing with java.lang.NoClassDefFoundError > > > Key: SPARK-45255 > URL: https://issues.apache.org/jira/browse/SPARK-45255 > Project: Spark > Issue Type: Bug > Components: Connect >Affects Versions: 3.5.0 >Reporter: Faiz Halde >Priority: Major > > java 1.8, sbt 1.9, scala 2.12 > > I have a very simple repo with the following dependency in `build.sbt` > ``` > {{libraryDependencies ++= Seq("org.apache.spark" %% > "spark-connect-client-jvm" % "3.5.0")}} > ``` > A simple application > ``` > {{object Main extends App {}} > {{ val s = SparkSession.builder().remote("sc://localhost").getOrCreate()}} > {{ s.read.json("/tmp/input.json").repartition(10).show(false)}} > {{}}} > ``` > But when I run it, I get the following error > > ``` > {{Exception in thread "main" java.lang.NoClassDefFoundError: > org/sparkproject/connect/client/com/google/common/cache/CacheLoader}} > {{ at Main$.delayedEndpoint$Main$1(Main.scala:4)}} > {{ at Main$delayedInit$body.apply(Main.scala:3)}} > {{ at scala.Function0.apply$mcV$sp(Function0.scala:39)}} > {{ at scala.Function0.apply$mcV$sp$(Function0.scala:39)}} > {{ at > scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:17)}} > {{ at scala.App.$anonfun$main$1$adapted(App.scala:80)}} > {{ at scala.collection.immutable.List.foreach(List.scala:431)}} > {{ at scala.App.main(App.scala:80)}} > {{ at scala.App.main$(App.scala:78)}} > {{ at Main$.main(Main.scala:3)}} > {{ at Main.main(Main.scala)}} > {{Caused by: java.lang.ClassNotFoundException: > org.sparkproject.connect.client.com.google.common.cache.CacheLoader}} > {{ at java.net.URLClassLoader.findClass(URLClassLoader.java:387)}} > {{ at java.lang.ClassLoader.loadClass(ClassLoader.java:418)}} > {{ at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352)}} > {{ at java.lang.ClassLoader.loadClass(ClassLoader.java:351)}} > {{ ... 11 more}} > ``` > I know the connect does a bunch of shading during assembly so it could be > related to that. This application is not started via spark-submit or > anything. It's not run neither under a `SPARK_HOME` ( I guess that's the > whole point of connect client ) > > EDIT > Not sure if it's the right mitigation but explicitly adding guava worked but > now I am in the 2nd territory of error > {{Sep 21, 2023 8:21:59 PM > org.sparkproject.connect.client.io.grpc.NameResolverRegistry > getDefaultRegistry}} > {{WARNING: No NameResolverProviders found via ServiceLoader, including for > DNS. This is probably due to a broken build. If using ProGuard, check your > configuration}} > {{Exception in thread "main" > org.sparkproject.connect.client.com.google.common.util.concurrent.UncheckedExecutionException: > > org.sparkproject.connect.client.io.grpc.ManagedChannelRegistry$ProviderNotFoundException: > No functional channel service provider found. Try adding a dependency on the > grpc-okhttp, grpc-netty, or grpc-netty-shaded artifact}} > {{ at > org.sparkproject.connect.client.com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2085)}} > {{ at > org.sparkproject.connect.client.com.google.common.cache.LocalCache.get(LocalCache.java:4011)}} > {{ at > org.sparkproject.connect.client.com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:4034)}} > {{ at > org.sparkproject.connect.client.com.google.common.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:5010)}} > {{ at >
[jira] [Commented] (SPARK-45255) Spark connect client failing with java.lang.NoClassDefFoundError
[ https://issues.apache.org/jira/browse/SPARK-45255?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17768037#comment-17768037 ] Faiz Halde commented on SPARK-45255: For now, I unblocked myself by manually building spark connect {{build/mvn -Pconnect -DskipTests clean package}} {{and then running}} {{mkdir connect-jars}} {{./bin/spark-connect-scala-client-classpath | tr ':' '\n' | xargs -I{} cp {} connect-jars}} {{Then, in your client application, have the connect-jars directory in your classpath. Not sure if this is the right way though}} > Spark connect client failing with java.lang.NoClassDefFoundError > > > Key: SPARK-45255 > URL: https://issues.apache.org/jira/browse/SPARK-45255 > Project: Spark > Issue Type: Bug > Components: Connect >Affects Versions: 3.5.0 >Reporter: Faiz Halde >Priority: Major > > java 1.8, sbt 1.9, scala 2.12 > > I have a very simple repo with the following dependency in `build.sbt` > ``` > {{libraryDependencies ++= Seq("org.apache.spark" %% > "spark-connect-client-jvm" % "3.5.0")}} > ``` > A simple application > ``` > {{object Main extends App {}} > {{ val s = SparkSession.builder().remote("sc://localhost").getOrCreate()}} > {{ s.read.json("/tmp/input.json").repartition(10).show(false)}} > {{}}} > ``` > But when I run it, I get the following error > > ``` > {{Exception in thread "main" java.lang.NoClassDefFoundError: > org/sparkproject/connect/client/com/google/common/cache/CacheLoader}} > {{ at Main$.delayedEndpoint$Main$1(Main.scala:4)}} > {{ at Main$delayedInit$body.apply(Main.scala:3)}} > {{ at scala.Function0.apply$mcV$sp(Function0.scala:39)}} > {{ at scala.Function0.apply$mcV$sp$(Function0.scala:39)}} > {{ at > scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:17)}} > {{ at scala.App.$anonfun$main$1$adapted(App.scala:80)}} > {{ at scala.collection.immutable.List.foreach(List.scala:431)}} > {{ at scala.App.main(App.scala:80)}} > {{ at scala.App.main$(App.scala:78)}} > {{ at Main$.main(Main.scala:3)}} > {{ at Main.main(Main.scala)}} > {{Caused by: java.lang.ClassNotFoundException: > org.sparkproject.connect.client.com.google.common.cache.CacheLoader}} > {{ at java.net.URLClassLoader.findClass(URLClassLoader.java:387)}} > {{ at java.lang.ClassLoader.loadClass(ClassLoader.java:418)}} > {{ at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352)}} > {{ at java.lang.ClassLoader.loadClass(ClassLoader.java:351)}} > {{ ... 11 more}} > ``` > I know the connect does a bunch of shading during assembly so it could be > related to that. This application is not started via spark-submit or > anything. It's not run neither under a `SPARK_HOME` ( I guess that's the > whole point of connect client ) > > EDIT > Not sure if it's the right mitigation but explicitly adding guava worked but > now I am in the 2nd territory of error > {{Sep 21, 2023 8:21:59 PM > org.sparkproject.connect.client.io.grpc.NameResolverRegistry > getDefaultRegistry}} > {{WARNING: No NameResolverProviders found via ServiceLoader, including for > DNS. This is probably due to a broken build. If using ProGuard, check your > configuration}} > {{Exception in thread "main" > org.sparkproject.connect.client.com.google.common.util.concurrent.UncheckedExecutionException: > > org.sparkproject.connect.client.io.grpc.ManagedChannelRegistry$ProviderNotFoundException: > No functional channel service provider found. Try adding a dependency on the > grpc-okhttp, grpc-netty, or grpc-netty-shaded artifact}} > {{ at > org.sparkproject.connect.client.com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2085)}} > {{ at > org.sparkproject.connect.client.com.google.common.cache.LocalCache.get(LocalCache.java:4011)}} > {{ at > org.sparkproject.connect.client.com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:4034)}} > {{ at > org.sparkproject.connect.client.com.google.common.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:5010)}} > {{ at > org.apache.spark.sql.SparkSession$Builder.$anonfun$getOrCreate$1(SparkSession.scala:945)}} > {{ at scala.Option.getOrElse(Option.scala:189)}} > {{ at > org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:945)}} > {{ at Main$.delayedEndpoint$Main$1(Main.scala:4)}} > {{ at Main$delayedInit$body.apply(Main.scala:3)}} > {{ at scala.Function0.apply$mcV$sp(Function0.scala:39)}} > {{ at scala.Function0.apply$mcV$sp$(Function0.scala:39)}} > {{ at > scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:17)}} > {{ at scala.App.$anonfun$main$1$adapted(App.scala:80)}} > {{ at scala.collection.immutable.List.foreach(List.scala:431)}} > {{ at scala.App.main(App.scala:80)}} > {{
[jira] [Comment Edited] (SPARK-45255) Spark connect client failing with java.lang.NoClassDefFoundError
[ https://issues.apache.org/jira/browse/SPARK-45255?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17768037#comment-17768037 ] Faiz Halde edited comment on SPARK-45255 at 9/22/23 2:29 PM: - For now, I unblocked myself by manually building spark connect {{build/mvn -Pconnect -DskipTests clean package}} {{and then running}} {{mkdir connect-jars}} {{./bin/spark-connect-scala-client-classpath | tr ':' '\n' | xargs -I{} cp {} connect-jars}} {{Then, when starting your client application, have the connect-jars directory in your classpath. Not sure if this is the right way though}} was (Author: JIRAUSER300204): For now, I unblocked myself by manually building spark connect {{build/mvn -Pconnect -DskipTests clean package}} {{and then running}} {{mkdir connect-jars}} {{./bin/spark-connect-scala-client-classpath | tr ':' '\n' | xargs -I{} cp {} connect-jars}} {{Then, in your client application, have the connect-jars directory in your classpath. Not sure if this is the right way though}} > Spark connect client failing with java.lang.NoClassDefFoundError > > > Key: SPARK-45255 > URL: https://issues.apache.org/jira/browse/SPARK-45255 > Project: Spark > Issue Type: Bug > Components: Connect >Affects Versions: 3.5.0 >Reporter: Faiz Halde >Priority: Major > > java 1.8, sbt 1.9, scala 2.12 > > I have a very simple repo with the following dependency in `build.sbt` > ``` > {{libraryDependencies ++= Seq("org.apache.spark" %% > "spark-connect-client-jvm" % "3.5.0")}} > ``` > A simple application > ``` > {{object Main extends App {}} > {{ val s = SparkSession.builder().remote("sc://localhost").getOrCreate()}} > {{ s.read.json("/tmp/input.json").repartition(10).show(false)}} > {{}}} > ``` > But when I run it, I get the following error > > ``` > {{Exception in thread "main" java.lang.NoClassDefFoundError: > org/sparkproject/connect/client/com/google/common/cache/CacheLoader}} > {{ at Main$.delayedEndpoint$Main$1(Main.scala:4)}} > {{ at Main$delayedInit$body.apply(Main.scala:3)}} > {{ at scala.Function0.apply$mcV$sp(Function0.scala:39)}} > {{ at scala.Function0.apply$mcV$sp$(Function0.scala:39)}} > {{ at > scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:17)}} > {{ at scala.App.$anonfun$main$1$adapted(App.scala:80)}} > {{ at scala.collection.immutable.List.foreach(List.scala:431)}} > {{ at scala.App.main(App.scala:80)}} > {{ at scala.App.main$(App.scala:78)}} > {{ at Main$.main(Main.scala:3)}} > {{ at Main.main(Main.scala)}} > {{Caused by: java.lang.ClassNotFoundException: > org.sparkproject.connect.client.com.google.common.cache.CacheLoader}} > {{ at java.net.URLClassLoader.findClass(URLClassLoader.java:387)}} > {{ at java.lang.ClassLoader.loadClass(ClassLoader.java:418)}} > {{ at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352)}} > {{ at java.lang.ClassLoader.loadClass(ClassLoader.java:351)}} > {{ ... 11 more}} > ``` > I know the connect does a bunch of shading during assembly so it could be > related to that. This application is not started via spark-submit or > anything. It's not run neither under a `SPARK_HOME` ( I guess that's the > whole point of connect client ) > > EDIT > Not sure if it's the right mitigation but explicitly adding guava worked but > now I am in the 2nd territory of error > {{Sep 21, 2023 8:21:59 PM > org.sparkproject.connect.client.io.grpc.NameResolverRegistry > getDefaultRegistry}} > {{WARNING: No NameResolverProviders found via ServiceLoader, including for > DNS. This is probably due to a broken build. If using ProGuard, check your > configuration}} > {{Exception in thread "main" > org.sparkproject.connect.client.com.google.common.util.concurrent.UncheckedExecutionException: > > org.sparkproject.connect.client.io.grpc.ManagedChannelRegistry$ProviderNotFoundException: > No functional channel service provider found. Try adding a dependency on the > grpc-okhttp, grpc-netty, or grpc-netty-shaded artifact}} > {{ at > org.sparkproject.connect.client.com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2085)}} > {{ at > org.sparkproject.connect.client.com.google.common.cache.LocalCache.get(LocalCache.java:4011)}} > {{ at > org.sparkproject.connect.client.com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:4034)}} > {{ at > org.sparkproject.connect.client.com.google.common.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:5010)}} > {{ at > org.apache.spark.sql.SparkSession$Builder.$anonfun$getOrCreate$1(SparkSession.scala:945)}} > {{ at scala.Option.getOrElse(Option.scala:189)}} > {{ at > org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:945)}} > {{ at
[jira] [Updated] (SPARK-45280) Change Maven daily test use Java 17 for testing.
[ https://issues.apache.org/jira/browse/SPARK-45280?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-45280: --- Labels: pull-request-available (was: ) > Change Maven daily test use Java 17 for testing. > > > Key: SPARK-45280 > URL: https://issues.apache.org/jira/browse/SPARK-45280 > Project: Spark > Issue Type: Bug > Components: Project Infra >Affects Versions: 4.0.0 >Reporter: Yang Jie >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-45280) Change Maven daily test use Java 17 for testing.
Yang Jie created SPARK-45280: Summary: Change Maven daily test use Java 17 for testing. Key: SPARK-45280 URL: https://issues.apache.org/jira/browse/SPARK-45280 Project: Spark Issue Type: Bug Components: Project Infra Affects Versions: 4.0.0 Reporter: Yang Jie -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-45255) Spark connect client failing with java.lang.NoClassDefFoundError
[ https://issues.apache.org/jira/browse/SPARK-45255?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17768005#comment-17768005 ] Aleksandr Aleksandrov commented on SPARK-45255: --- I have the same issue. But adding guava dependency didn't help me {code:java} Exception in thread "main" java.lang.NoClassDefFoundError: org/sparkproject/connect/client/com/google/common/cache/CacheLoader at ... Caused by: java.lang.ClassNotFoundException: org.sparkproject.connect.client.com.google.common.cache.CacheLoader at java.base/jdk.internal.loader.BuiltinClassLoader.loadClass(BuiltinClassLoader.java:581) at java.base/jdk.internal.loader.ClassLoaders$AppClassLoader.loadClass(ClassLoaders.java:178) at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:522) ... 2 more{code} > Spark connect client failing with java.lang.NoClassDefFoundError > > > Key: SPARK-45255 > URL: https://issues.apache.org/jira/browse/SPARK-45255 > Project: Spark > Issue Type: Bug > Components: Connect >Affects Versions: 3.5.0 >Reporter: Faiz Halde >Priority: Major > > java 1.8, sbt 1.9, scala 2.12 > > I have a very simple repo with the following dependency in `build.sbt` > ``` > {{libraryDependencies ++= Seq("org.apache.spark" %% > "spark-connect-client-jvm" % "3.5.0")}} > ``` > A simple application > ``` > {{object Main extends App {}} > {{ val s = SparkSession.builder().remote("sc://localhost").getOrCreate()}} > {{ s.read.json("/tmp/input.json").repartition(10).show(false)}} > {{}}} > ``` > But when I run it, I get the following error > > ``` > {{Exception in thread "main" java.lang.NoClassDefFoundError: > org/sparkproject/connect/client/com/google/common/cache/CacheLoader}} > {{ at Main$.delayedEndpoint$Main$1(Main.scala:4)}} > {{ at Main$delayedInit$body.apply(Main.scala:3)}} > {{ at scala.Function0.apply$mcV$sp(Function0.scala:39)}} > {{ at scala.Function0.apply$mcV$sp$(Function0.scala:39)}} > {{ at > scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:17)}} > {{ at scala.App.$anonfun$main$1$adapted(App.scala:80)}} > {{ at scala.collection.immutable.List.foreach(List.scala:431)}} > {{ at scala.App.main(App.scala:80)}} > {{ at scala.App.main$(App.scala:78)}} > {{ at Main$.main(Main.scala:3)}} > {{ at Main.main(Main.scala)}} > {{Caused by: java.lang.ClassNotFoundException: > org.sparkproject.connect.client.com.google.common.cache.CacheLoader}} > {{ at java.net.URLClassLoader.findClass(URLClassLoader.java:387)}} > {{ at java.lang.ClassLoader.loadClass(ClassLoader.java:418)}} > {{ at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352)}} > {{ at java.lang.ClassLoader.loadClass(ClassLoader.java:351)}} > {{ ... 11 more}} > ``` > I know the connect does a bunch of shading during assembly so it could be > related to that. This application is not started via spark-submit or > anything. It's not run neither under a `SPARK_HOME` ( I guess that's the > whole point of connect client ) > > EDIT > Not sure if it's the right mitigation but explicitly adding guava worked but > now I am in the 2nd territory of error > {{Sep 21, 2023 8:21:59 PM > org.sparkproject.connect.client.io.grpc.NameResolverRegistry > getDefaultRegistry}} > {{WARNING: No NameResolverProviders found via ServiceLoader, including for > DNS. This is probably due to a broken build. If using ProGuard, check your > configuration}} > {{Exception in thread "main" > org.sparkproject.connect.client.com.google.common.util.concurrent.UncheckedExecutionException: > > org.sparkproject.connect.client.io.grpc.ManagedChannelRegistry$ProviderNotFoundException: > No functional channel service provider found. Try adding a dependency on the > grpc-okhttp, grpc-netty, or grpc-netty-shaded artifact}} > {{ at > org.sparkproject.connect.client.com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2085)}} > {{ at > org.sparkproject.connect.client.com.google.common.cache.LocalCache.get(LocalCache.java:4011)}} > {{ at > org.sparkproject.connect.client.com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:4034)}} > {{ at > org.sparkproject.connect.client.com.google.common.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:5010)}} > {{ at > org.apache.spark.sql.SparkSession$Builder.$anonfun$getOrCreate$1(SparkSession.scala:945)}} > {{ at scala.Option.getOrElse(Option.scala:189)}} > {{ at > org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:945)}} > {{ at Main$.delayedEndpoint$Main$1(Main.scala:4)}} > {{ at Main$delayedInit$body.apply(Main.scala:3)}} > {{ at scala.Function0.apply$mcV$sp(Function0.scala:39)}} > {{ at scala.Function0.apply$mcV$sp$(Function0.scala:39)}} > {{ at >
[jira] [Updated] (SPARK-45277) Install Java 17 for Windows SparkR test
[ https://issues.apache.org/jira/browse/SPARK-45277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yang Jie updated SPARK-45277: - Summary: Install Java 17 for Windows SparkR test (was: Install a Java 17 for windows SparkR test) > Install Java 17 for Windows SparkR test > --- > > Key: SPARK-45277 > URL: https://issues.apache.org/jira/browse/SPARK-45277 > Project: Spark > Issue Type: Improvement > Components: Project Infra >Affects Versions: 4.0.0 >Reporter: Yang Jie >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45277) Install Java 17 for Windows SparkR test
[ https://issues.apache.org/jira/browse/SPARK-45277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-45277: --- Labels: pull-request-available (was: ) > Install Java 17 for Windows SparkR test > --- > > Key: SPARK-45277 > URL: https://issues.apache.org/jira/browse/SPARK-45277 > Project: Spark > Issue Type: Improvement > Components: Project Infra >Affects Versions: 4.0.0 >Reporter: Yang Jie >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-45247) Upgrade Pandas to 2.1.1
[ https://issues.apache.org/jira/browse/SPARK-45247?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-45247: - Assignee: Haejoon Lee > Upgrade Pandas to 2.1.1 > --- > > Key: SPARK-45247 > URL: https://issues.apache.org/jira/browse/SPARK-45247 > Project: Spark > Issue Type: Sub-task > Components: Pandas API on Spark, PySpark >Affects Versions: 4.0.0 >Reporter: Haejoon Lee >Assignee: Haejoon Lee >Priority: Major > Labels: pull-request-available > > https://pandas.pydata.org/pandas-docs/dev/whatsnew/v2.1.1.html -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-45247) Upgrade Pandas to 2.1.1
[ https://issues.apache.org/jira/browse/SPARK-45247?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-45247. --- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 43025 [https://github.com/apache/spark/pull/43025] > Upgrade Pandas to 2.1.1 > --- > > Key: SPARK-45247 > URL: https://issues.apache.org/jira/browse/SPARK-45247 > Project: Spark > Issue Type: Sub-task > Components: Pandas API on Spark, PySpark >Affects Versions: 4.0.0 >Reporter: Haejoon Lee >Assignee: Haejoon Lee >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > > https://pandas.pydata.org/pandas-docs/dev/whatsnew/v2.1.1.html -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-44112) Drop Java 8 and 11 support
[ https://issues.apache.org/jira/browse/SPARK-44112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-44112: - Assignee: Yang Jie > Drop Java 8 and 11 support > -- > > Key: SPARK-44112 > URL: https://issues.apache.org/jira/browse/SPARK-44112 > Project: Spark > Issue Type: Sub-task > Components: Build >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Assignee: Yang Jie >Priority: Major > Labels: pull-request-available > Attachments: image-2023-09-20-10-52-59-327.png, > image-2023-09-20-10-53-34-956.png > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-44112) Drop Java 8 and 11 support
[ https://issues.apache.org/jira/browse/SPARK-44112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-44112. --- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 43005 [https://github.com/apache/spark/pull/43005] > Drop Java 8 and 11 support > -- > > Key: SPARK-44112 > URL: https://issues.apache.org/jira/browse/SPARK-44112 > Project: Spark > Issue Type: Sub-task > Components: Build >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Assignee: Yang Jie >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > Attachments: image-2023-09-20-10-52-59-327.png, > image-2023-09-20-10-53-34-956.png > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-43655) Enable NamespaceParityTests.test_get_index_map
[ https://issues.apache.org/jira/browse/SPARK-43655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-43655. --- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 43052 [https://github.com/apache/spark/pull/43052] > Enable NamespaceParityTests.test_get_index_map > -- > > Key: SPARK-43655 > URL: https://issues.apache.org/jira/browse/SPARK-43655 > Project: Spark > Issue Type: Sub-task > Components: Connect, Pandas API on Spark >Affects Versions: 3.5.0 >Reporter: Haejoon Lee >Assignee: Haejoon Lee >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > > Enable NamespaceParityTests.test_get_index_map -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-43655) Enable NamespaceParityTests.test_get_index_map
[ https://issues.apache.org/jira/browse/SPARK-43655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-43655: - Assignee: Haejoon Lee > Enable NamespaceParityTests.test_get_index_map > -- > > Key: SPARK-43655 > URL: https://issues.apache.org/jira/browse/SPARK-43655 > Project: Spark > Issue Type: Sub-task > Components: Connect, Pandas API on Spark >Affects Versions: 3.5.0 >Reporter: Haejoon Lee >Assignee: Haejoon Lee >Priority: Major > Labels: pull-request-available > > Enable NamespaceParityTests.test_get_index_map -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45278) Make Yarn executor's bindAddress configurable
[ https://issues.apache.org/jira/browse/SPARK-45278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hendra Saputra updated SPARK-45278: --- Fix Version/s: (was: 4.0.0) (was: 3.5.1) > Make Yarn executor's bindAddress configurable > - > > Key: SPARK-45278 > URL: https://issues.apache.org/jira/browse/SPARK-45278 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.5.0 >Reporter: Hendra Saputra >Assignee: Nishchal Venkataramana >Priority: Major > Labels: pull-request-available > > An improvement has been made in SPARK-24203 that executor's bind address is > configurable. Unfortunately this configuration hasn't implemented in Yarn. > When Yarn cluster is deployed in in Kubernetes, it is preferable to bind the > executor to loopback interface or all interface. This Jira is to allow Yarn > to bind the executor to either pod IP or loopbak interface or all interface > to allow mesh integration like Istio with the cluster. > Another linked Jira explained *Allowing binding to all IPs* very well > SPARK-42411 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45278) Make Yarn executor's bindAddress configurable
[ https://issues.apache.org/jira/browse/SPARK-45278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hendra Saputra updated SPARK-45278: --- Fix Version/s: 4.0.0 3.5.1 (was: 3.0.0) Affects Version/s: 3.5.0 (was: 4.0.0) (was: 3.5.1) > Make Yarn executor's bindAddress configurable > - > > Key: SPARK-45278 > URL: https://issues.apache.org/jira/browse/SPARK-45278 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.5.0 >Reporter: Hendra Saputra >Assignee: Nishchal Venkataramana >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0, 3.5.1 > > > An improvement has been made in SPARK-24203 that executor's bind address is > configurable. Unfortunately this configuration hasn't implemented in Yarn. > When Yarn cluster is deployed in in Kubernetes, it is preferable to bind the > executor to loopback interface or all interface. This Jira is to allow Yarn > to bind the executor to either pod IP or loopbak interface or all interface > to allow mesh integration like Istio with the cluster. > Another linked Jira explained *Allowing binding to all IPs* very well > SPARK-42411 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45279) Attach plan_id for all logical plan
[ https://issues.apache.org/jira/browse/SPARK-45279?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-45279: --- Labels: pull-request-available (was: ) > Attach plan_id for all logical plan > --- > > Key: SPARK-45279 > URL: https://issues.apache.org/jira/browse/SPARK-45279 > Project: Spark > Issue Type: Improvement > Components: Connect, PySpark >Affects Versions: 4.0.0 >Reporter: Ruifeng Zheng >Priority: Minor > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-45278) Make Yarn executor's bindAddress configurable
[ https://issues.apache.org/jira/browse/SPARK-45278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17767932#comment-17767932 ] Hendra Saputra commented on SPARK-45278: PR is up for review [https://github.com/apache/spark/pull/42870.] Thanks > Make Yarn executor's bindAddress configurable > - > > Key: SPARK-45278 > URL: https://issues.apache.org/jira/browse/SPARK-45278 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 4.0.0, 3.5.1 >Reporter: Hendra Saputra >Assignee: Nishchal Venkataramana >Priority: Major > Labels: pull-request-available > Fix For: 3.0.0 > > > An improvement has been made in SPARK-24203 that executor's bind address is > configurable. Unfortunately this configuration hasn't implemented in Yarn. > When Yarn cluster is deployed in in Kubernetes, it is preferable to bind the > executor to loopback interface or all interface. This Jira is to allow Yarn > to bind the executor to either pod IP or loopbak interface or all interface > to allow mesh integration like Istio with the cluster. > Another linked Jira explained *Allowing binding to all IPs* very well > SPARK-42411 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-45279) Attach plan_id for all logical plan
Ruifeng Zheng created SPARK-45279: - Summary: Attach plan_id for all logical plan Key: SPARK-45279 URL: https://issues.apache.org/jira/browse/SPARK-45279 Project: Spark Issue Type: Improvement Components: Connect, PySpark Affects Versions: 4.0.0 Reporter: Ruifeng Zheng -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45278) Make Yarn executor's bindAddress configurable
[ https://issues.apache.org/jira/browse/SPARK-45278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hendra Saputra updated SPARK-45278: --- Description: An improvement has been made in SPARK-24203 that executor's bind address is configurable. Unfortunately this configuration hasn't implemented in Yarn. When Yarn cluster is deployed in in Kubernetes, it is preferable to bind the executor to loopback interface or all interface. This Jira is to allow Yarn to bind the executor to either pod IP or loopbak interface or all interface to allow mesh integration like Istio with the cluster. Another linked Jira explained *Allowing binding to all IPs* very well SPARK-42411 was: An improvement has been made in SPARK-24203 that executor's bind address is configurable. Unfortunately this configuration hasn't implemented in Yarn. When Yarn cluster is deployed in in Kubernetes, it is preferable to bind the executor to loopback interface or all interface. This Jira is to allow Yarn to bind the executor to either pod IP or loopbak interface or all interface to allow mesh integration like Istio with the cluster. Another linked Jira explained this very well SPARK-42411 > Make Yarn executor's bindAddress configurable > - > > Key: SPARK-45278 > URL: https://issues.apache.org/jira/browse/SPARK-45278 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 4.0.0, 3.5.1 >Reporter: Hendra Saputra >Assignee: Nishchal Venkataramana >Priority: Major > Labels: pull-request-available > Fix For: 3.0.0 > > > An improvement has been made in SPARK-24203 that executor's bind address is > configurable. Unfortunately this configuration hasn't implemented in Yarn. > When Yarn cluster is deployed in in Kubernetes, it is preferable to bind the > executor to loopback interface or all interface. This Jira is to allow Yarn > to bind the executor to either pod IP or loopbak interface or all interface > to allow mesh integration like Istio with the cluster. > Another linked Jira explained *Allowing binding to all IPs* very well > SPARK-42411 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45278) Make Yarn executor's bindAddress configurable
[ https://issues.apache.org/jira/browse/SPARK-45278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hendra Saputra updated SPARK-45278: --- Description: An improvement has been made in SPARK-24203 that executor's bind address is configurable. Unfortunately this configuration hasn't implemented in Yarn. When Yarn cluster is deployed in in Kubernetes, it is preferable to bind the executor to loopback interface or all interface. This Jira is to allow Yarn to bind the executor to either pod IP or loopbak interface or all interface to allow mesh integration like Istio with the cluster. Another linked Jira explained this very well SPARK-42411 was:An improvement has been made in SPARK-24203 that executor's bind address is configurable. Unfortunately this configuration hasn't implemented in Yarn. When Yarn cluster is deployed in in Kubernetes, it is preferable to bind the executor to loopback interface or all interface. This Jira is to allow Yarn to bind the executor to either pod IP or loopbak interface or all interface to allow mesh integration like Istio with the cluster > Make Yarn executor's bindAddress configurable > - > > Key: SPARK-45278 > URL: https://issues.apache.org/jira/browse/SPARK-45278 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 4.0.0, 3.5.1 >Reporter: Hendra Saputra >Assignee: Nishchal Venkataramana >Priority: Major > Labels: pull-request-available > Fix For: 3.0.0 > > > An improvement has been made in SPARK-24203 that executor's bind address is > configurable. Unfortunately this configuration hasn't implemented in Yarn. > When Yarn cluster is deployed in in Kubernetes, it is preferable to bind the > executor to loopback interface or all interface. This Jira is to allow Yarn > to bind the executor to either pod IP or loopbak interface or all interface > to allow mesh integration like Istio with the cluster. > Another linked Jira explained this very well SPARK-42411 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45278) Make Yarn executor's bindAddress configurable
[ https://issues.apache.org/jira/browse/SPARK-45278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hendra Saputra updated SPARK-45278: --- Description: An improvement has been made in SPARK-24203 that executor's bind address is configurable. Unfortunately this configuration hasn't implemented in Yarn. When Yarn cluster is deployed in in Kubernetes, it is preferable to bind the executor to loopback interface or all interface. This Jira is to allow Yarn to bind the executor to either pod IP or loopbak interface or all interface to allow mesh integration like Istio with the cluster (was: Previous improvement is made that now Executor bind address is configurable in ) > Make Yarn executor's bindAddress configurable > - > > Key: SPARK-45278 > URL: https://issues.apache.org/jira/browse/SPARK-45278 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 4.0.0, 3.5.1 >Reporter: Hendra Saputra >Assignee: Nishchal Venkataramana >Priority: Major > Labels: pull-request-available > Fix For: 3.0.0 > > > An improvement has been made in SPARK-24203 that executor's bind address is > configurable. Unfortunately this configuration hasn't implemented in Yarn. > When Yarn cluster is deployed in in Kubernetes, it is preferable to bind the > executor to loopback interface or all interface. This Jira is to allow Yarn > to bind the executor to either pod IP or loopbak interface or all interface > to allow mesh integration like Istio with the cluster -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45278) Make Yarn executor's bindAddress configurable
[ https://issues.apache.org/jira/browse/SPARK-45278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hendra Saputra updated SPARK-45278: --- Affects Version/s: 4.0.0 3.5.1 (was: 2.1.1) > Make Yarn executor's bindAddress configurable > - > > Key: SPARK-45278 > URL: https://issues.apache.org/jira/browse/SPARK-45278 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 4.0.0, 3.5.1 >Reporter: Hendra Saputra >Assignee: Nishchal Venkataramana >Priority: Major > Labels: pull-request-available > Fix For: 3.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45278) Make Yarn executor's bindAddress configurable
[ https://issues.apache.org/jira/browse/SPARK-45278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hendra Saputra updated SPARK-45278: --- Description: Previous improvement is made that now Executor bind address is configurable in > Make Yarn executor's bindAddress configurable > - > > Key: SPARK-45278 > URL: https://issues.apache.org/jira/browse/SPARK-45278 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 4.0.0, 3.5.1 >Reporter: Hendra Saputra >Assignee: Nishchal Venkataramana >Priority: Major > Labels: pull-request-available > Fix For: 3.0.0 > > > Previous improvement is made that now Executor bind address is configurable > in -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-45278) Make Yarn executor's bindAddress configurable
Hendra Saputra created SPARK-45278: -- Summary: Make Yarn executor's bindAddress configurable Key: SPARK-45278 URL: https://issues.apache.org/jira/browse/SPARK-45278 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 2.1.1 Reporter: Hendra Saputra Assignee: Nishchal Venkataramana Fix For: 3.0.0 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-44112) Drop Java 8 and 11 support
[ https://issues.apache.org/jira/browse/SPARK-44112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot reassigned SPARK-44112: -- Assignee: (was: Apache Spark) > Drop Java 8 and 11 support > -- > > Key: SPARK-44112 > URL: https://issues.apache.org/jira/browse/SPARK-44112 > Project: Spark > Issue Type: Sub-task > Components: Build >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Priority: Major > Labels: pull-request-available > Attachments: image-2023-09-20-10-52-59-327.png, > image-2023-09-20-10-53-34-956.png > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-44112) Drop Java 8 and 11 support
[ https://issues.apache.org/jira/browse/SPARK-44112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot reassigned SPARK-44112: -- Assignee: Apache Spark > Drop Java 8 and 11 support > -- > > Key: SPARK-44112 > URL: https://issues.apache.org/jira/browse/SPARK-44112 > Project: Spark > Issue Type: Sub-task > Components: Build >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Assignee: Apache Spark >Priority: Major > Labels: pull-request-available > Attachments: image-2023-09-20-10-52-59-327.png, > image-2023-09-20-10-53-34-956.png > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-44112) Drop Java 8 and 11 support
[ https://issues.apache.org/jira/browse/SPARK-44112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot reassigned SPARK-44112: -- Assignee: (was: Apache Spark) > Drop Java 8 and 11 support > -- > > Key: SPARK-44112 > URL: https://issues.apache.org/jira/browse/SPARK-44112 > Project: Spark > Issue Type: Sub-task > Components: Build >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Priority: Major > Labels: pull-request-available > Attachments: image-2023-09-20-10-52-59-327.png, > image-2023-09-20-10-53-34-956.png > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-44112) Drop Java 8 and 11 support
[ https://issues.apache.org/jira/browse/SPARK-44112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot reassigned SPARK-44112: -- Assignee: Apache Spark > Drop Java 8 and 11 support > -- > > Key: SPARK-44112 > URL: https://issues.apache.org/jira/browse/SPARK-44112 > Project: Spark > Issue Type: Sub-task > Components: Build >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Assignee: Apache Spark >Priority: Major > Labels: pull-request-available > Attachments: image-2023-09-20-10-52-59-327.png, > image-2023-09-20-10-53-34-956.png > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-45277) Install a Java 17 for windows SparkR test
Yang Jie created SPARK-45277: Summary: Install a Java 17 for windows SparkR test Key: SPARK-45277 URL: https://issues.apache.org/jira/browse/SPARK-45277 Project: Spark Issue Type: Improvement Components: Project Infra Affects Versions: 4.0.0 Reporter: Yang Jie -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-45276) Replace Java 8 and Java 11 installed in the Dockerfile with Java
Yang Jie created SPARK-45276: Summary: Replace Java 8 and Java 11 installed in the Dockerfile with Java Key: SPARK-45276 URL: https://issues.apache.org/jira/browse/SPARK-45276 Project: Spark Issue Type: Improvement Components: Project Infra Affects Versions: 4.0.0 Reporter: Yang Jie including dev/create-release/spark-rm/Dockerfile and connector/docker/spark-test/base/Dockerfile There might be others as well. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-45175) download krb5.conf from remote storage in spark-submit on k8s
[ https://issues.apache.org/jira/browse/SPARK-45175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17767908#comment-17767908 ] Qian Sun commented on SPARK-45175: -- In multi-tenant scenarios, I find Apache Spark provide *{{spark.kubernetes.kerberos.krb5.configMapName}}* to mount ConfigMap containing the {{*krb5.conf*}} file, we could manage these files by creating multiple configMaps for multi-tenants. > download krb5.conf from remote storage in spark-submit on k8s > - > > Key: SPARK-45175 > URL: https://issues.apache.org/jira/browse/SPARK-45175 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 3.4.1 >Reporter: Qian Sun >Priority: Minor > Labels: pull-request-available > > krb5.conf currently only supports the local file format. Tenants would like > to save this file on their own servers and download it during the > spark-submit phase for better implementation of multi-tenant scenarios. The > proposed solution is to use the *downloadFile* function[1], similar to the > configuration of *spark.kubernetes.driver/executor.podTemplateFile* > > [1]https://github.com/apache/spark/blob/822f58f0d26b7d760469151a65eaf9ee863a07a1/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/features/PodTemplateConfigMapStep.scala#L82C24-L82C24 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-45242) Use DataFrame ID to semantically validate CollectMetrics
[ https://issues.apache.org/jira/browse/SPARK-45242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rui Wang resolved SPARK-45242. -- Fix Version/s: 4.0.0 Resolution: Fixed https://github.com/apache/spark/pull/43010 > Use DataFrame ID to semantically validate CollectMetrics > - > > Key: SPARK-45242 > URL: https://issues.apache.org/jira/browse/SPARK-45242 > Project: Spark > Issue Type: Task > Components: SQL >Affects Versions: 4.0.0 >Reporter: Rui Wang >Assignee: Rui Wang >Priority: Major > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45275) replace function fails to handle null replace param
[ https://issues.apache.org/jira/browse/SPARK-45275?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Diogo Marques updated SPARK-45275: -- Attachment: replace_bug.png > replace function fails to handle null replace param > --- > > Key: SPARK-45275 > URL: https://issues.apache.org/jira/browse/SPARK-45275 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0 >Reporter: Diogo Marques >Priority: Major > Attachments: replace_bug.png > > > [replace |https://spark.apache.org/docs/latest/api/sql/#replace]function > fails to handle null replace param, example below: > > df.withColumn('test',F.expr('replace(col1, "nUll", 1)')).show() > || ||col1||col2||test|| > ||0|person1|0.0|person1| > ||1|person1|2.0|person1| > ||2|person1|3.0|person1| > ||3|person2|1.0|person2| > ||4|None|2.0|None| > ||5|nUll|None|1| > > df.withColumn('test',F.expr('replace(col1, "nUll", null)')).show() > || ||col1||col2||test|| > ||0|person1|0.0|None| > ||1|person1|2.0|None| > ||2|person1|3.0|None| > ||3|person2|1.0|None| > ||4|None|2.0|None| > ||5|nUll|None|None| > > > This function has been ported over to 3.5.0 but I've not been able to test it > on that yet -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45275) replace function fails to handle null replace param
[ https://issues.apache.org/jira/browse/SPARK-45275?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Diogo Marques updated SPARK-45275: -- Description: [replace |https://spark.apache.org/docs/latest/api/sql/#replace]function fails to handle null replace param, example below: df.withColumn('test',F.expr('replace(col1, "nUll", 1)')).show() || ||col1||col2||test|| ||0|person1|0.0|person1| ||1|person1|2.0|person1| ||2|person1|3.0|person1| ||3|person2|1.0|person2| ||4|None|2.0|None| ||5|nUll|None|1| df.withColumn('test',F.expr('replace(col1, "nUll", null)')).show() || ||col1||col2||test|| ||0|person1|0.0|None| ||1|person1|2.0|None| ||2|person1|3.0|None| ||3|person2|1.0|None| ||4|None|2.0|None| ||5|nUll|None|None| This function has been ported over to 3.5.0 but I've not been able to test it on that yet was: [replace |https://spark.apache.org/docs/latest/api/sql/#replace]function fails to handle null replace param, example below: df.withColumn('test',F.expr('replace(col1, "nUll", 1)')).show() || ||col1||col2||test|| ||0|person1|0.0|person1| ||1|person1|2.0|person1| ||2|person1|3.0|person1| ||3|person2|1.0|person2| ||4|None|2.0|None| ||5|nUll|None|1| df.withColumn('test',F.expr('replace(col1, "nUll", null)')).show() || ||col1||col2||test|| ||0|person1|0.0|None| ||1|person1|2.0|None| ||2|person1|3.0|None| ||3|person2|1.0|None| ||4|None|2.0|None| ||5|nUll|None|None| > replace function fails to handle null replace param > --- > > Key: SPARK-45275 > URL: https://issues.apache.org/jira/browse/SPARK-45275 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0 >Reporter: Diogo Marques >Priority: Major > > [replace |https://spark.apache.org/docs/latest/api/sql/#replace]function > fails to handle null replace param, example below: > > df.withColumn('test',F.expr('replace(col1, "nUll", 1)')).show() > || ||col1||col2||test|| > ||0|person1|0.0|person1| > ||1|person1|2.0|person1| > ||2|person1|3.0|person1| > ||3|person2|1.0|person2| > ||4|None|2.0|None| > ||5|nUll|None|1| > > df.withColumn('test',F.expr('replace(col1, "nUll", null)')).show() > || ||col1||col2||test|| > ||0|person1|0.0|None| > ||1|person1|2.0|None| > ||2|person1|3.0|None| > ||3|person2|1.0|None| > ||4|None|2.0|None| > ||5|nUll|None|None| > > > This function has been ported over to 3.5.0 but I've not been able to test it > on that yet -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45275) replace function fails to handle null replace param
[ https://issues.apache.org/jira/browse/SPARK-45275?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Diogo Marques updated SPARK-45275: -- Priority: Trivial (was: Major) > replace function fails to handle null replace param > --- > > Key: SPARK-45275 > URL: https://issues.apache.org/jira/browse/SPARK-45275 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0 >Reporter: Diogo Marques >Priority: Trivial > > [replace |https://spark.apache.org/docs/latest/api/sql/#replace]function > fails to handle null replace param, example below: > > df.withColumn('test',F.expr('replace(col1, "nUll", 1)')).show() > || ||col1||col2||test|| > ||0|person1|0.0|person1| > ||1|person1|2.0|person1| > ||2|person1|3.0|person1| > ||3|person2|1.0|person2| > ||4|None|2.0|None| > ||5|nUll|None|1| > > df.withColumn('test',F.expr('replace(col1, "nUll", null)')).show() > || ||col1||col2||test|| > ||0|person1|0.0|None| > ||1|person1|2.0|None| > ||2|person1|3.0|None| > ||3|person2|1.0|None| > ||4|None|2.0|None| > ||5|nUll|None|None| -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org